Top 50 Data Science Interview Questions (2026)

Question 1

Q1. What is the difference between population and sample?

Answer

Answer: A population includes every member of the group being studied, while a sample is a subset of the population selected for analysis. In data science, we almost always work with samples and use statistical inference to draw conclusions about the population. Key concern: the sample must be representative to avoid bias.

Question 2

Q2. What is the Central Limit Theorem, and why does it matter?

Answer

Answer: The Central Limit Theorem (CLT) states that the distribution of sample means approaches a normal distribution as sample size increases, regardless of the underlying population distribution. It matters because it justifies the use of parametric tests and confidence intervals even when the raw data are not normally distributed. A common rule of thumb is that n >= 30 is sufficient.

Question 3

Q3. Explain the difference between Type I and Type II errors.

Answer

Answer: A Type I error (false positive) occurs when you reject a true null hypothesis. A Type II error (false negative) occurs when you fail to reject a false null hypothesis. In practice, reducing Type I error increases the risk of Type II error. The trade-off is managed through the significance level (alpha) and statistical power.

Question 4

Q4. What is a p-value, and how do you interpret it?

Answer

Answer: The p-value is the probability of observing a result at least as extreme as the one obtained, assuming the null hypothesis is true. A p-value below the significance threshold (typically 0.05) indicates that the result is statistically significant and that we reject the null hypothesis. It does not measure the size or practical importance of an effect.

Question 5

Q5. What is the difference between correlation and covariance?

Answer

Answer: Covariance measures the direction of the linear relationship between two variables, but is sensitive to scale, making it hard to interpret directly. Correlation normalizes covariance by the standard deviations of both variables, producing a unitless value between -1 and 1 that captures both direction and strength of the linear relationship.

Question 6

Q6. What is a confidence interval?

Answer

Answer: A confidence interval is a range of values, derived from sample data, that is likely to contain the true population parameter with a specified probability (e.g., 95%). A 95% CI means that if we repeated the sampling process 100 times, approximately 95 of those intervals would contain the true parameter.

Question 7

Q7. What is Bayes' Theorem, and give a practical example?

Answer

Answer: Bayes' Theorem describes how to update the probability of a hypothesis given new evidence: P(A|B) = P(B|A) * P(A) / P(B). A practical example is spam detection: given a word like 'free' appears in an email, what is the probability that the email is spam? Bayesian classifiers update prior probabilities using the likelihood of observed features.

Question 8

Q8. What is the difference between parametric and non-parametric tests?

Answer

Answer: Parametric tests assume the data follow a specific distribution (usually normal) and use parameters such as the mean and variance. Examples include t-tests and ANOVA. Non-parametric tests make no distributional assumptions and are used when the data is ordinal, or the sample is too small to verify normality. Examples include Mann-Whitney U and Kruskal-Wallis tests.

Question 9

Q9. What is the difference between supervised and unsupervised learning?

Answer

Answer: Supervised learning trains a model on labeled data where the correct output is known (e.g., classification, regression). Unsupervised learning finds hidden patterns in unlabeled data (e.g., clustering, dimensionality reduction). A third category, semi-supervised learning, uses a small amount of labeled data combined with a larger pool of unlabeled data.

Question 10

Q10. Explain the bias-variance tradeoff.

Answer

Answer: Bias refers to errors from overly simplistic assumptions in the model (underfitting). Variance refers to the sensitivity of a model to fluctuations in the training data (overfitting). High bias produces consistently wrong predictions; high variance produces inconsistent predictions. The goal is to find the sweet spot that minimizes total error on unseen data. Techniques such as cross-validation, regularization, and ensemble methods help manage this trade-off.

Question 11

Q11. What is regularization? Explain L1 vs L2.

Answer

Answer: Regularization adds a penalty term to the loss function to prevent overfitting. L1 regularization (Lasso) adds the sum of absolute values of coefficients, which can shrink some coefficients to exactly zero, effectively performing feature selection. L2 regularization (Ridge) adds the sum of squared coefficients, distributing the penalty more evenly and retaining all features but with smaller weights. Elastic Net combines both.

Question 12

Q12. What is a decision tree, and what are its main hyperparameters?

Answer

Answer: A decision tree recursively splits the data based on feature values that best separate the target classes or minimize regression error. Key hyperparameters include max_depth (controls tree depth to prevent overfitting), min_samples_split (minimum samples required to split a node), min_samples_leaf, and the criterion (Gini impurity or entropy for classification; MSE for regression).

Question 13

Q13. How does Random Forest work, and why is it better than a single decision tree?

Answer

Answer: Random Forest builds multiple decision trees using bootstrapped samples of the data and a random subset of features at each split, then aggregates predictions via majority vote (classification) or averaging (regression). It reduces variance compared to a single tree because the trees are decorrelated, no single noisy feature dominates all trees. This makes it robust to outliers and less prone to overfitting.

Question 14

Q14. Explain the concept behind gradient boosting.

Answer

Answer: Gradient boosting builds an ensemble of weak learners (usually shallow decision trees) sequentially. Each new tree is trained to correct the residual errors of the previous ensemble. The final prediction is the sum of all tree outputs scaled by a learning rate. Popular implementations include XGBoost, LightGBM, and CatBoost, which improve speed and add regularization over the basic algorithm.

Question 15

Q15. What is the difference between bagging and boosting?

Answer

Answer: Bagging (Bootstrap Aggregating) trains multiple models in parallel on different bootstrapped subsets of the data and averages their predictions. It reduces variance. Boosting train models sequentially, where each model corrects the errors of the previous one. It reduces both bias and variance but is more sensitive to noise. Random Forest is bagging; Gradient Boosting is boosting.

Question 16

Q16. What is cross-validation, and why is it used?

Answer

Answer: Cross-validation is a technique for evaluating model performance on unseen data by splitting the dataset into multiple folds. In k-fold cross-validation, the data is split into k subsets; the model is trained on k-1 folds and evaluated on the remaining one, repeating k times. It provides a more reliable performance estimate than a single train-test split and helps detect overfitting.

Question 17

Q17. What is the difference between classification and regression?

Answer

Answer: Classification predicts discrete class labels (e.g., spam/not spam, disease/no disease). Regression predicts continuous numerical values (e.g., house price, temperature). The model architecture and loss functions differ: classification typically uses cross-entropy loss while regression uses mean squared error or mean absolute error.

Question 18

Q18. What is k-means clustering, and what are its limitations?

Answer

Answer: K-means partitions data into k clusters by minimizing the within-cluster sum of squared distances to the cluster centroid. It requires specifying k in advance, assumes clusters are spherical and similar in size, and is sensitive to outliers and the initial placement of centroids. It also does not perform well with non-convex cluster shapes. Alternatives include DBSCAN and hierarchical clustering.

Question 19

Q19. What is the difference between a list and a tuple in Python?

Answer

Answer: A list is mutable (elements can be changed) and uses square brackets. A tuple is immutable and uses parentheses. Tuples are faster than lists for iteration and are used for fixed data, such as coordinates or function return values. Lists are preferred when the collection needs to be modified.

Question 20

Q20. What are Pandas DataFrames and Series?

Answer

Answer: A Pandas Series is a one-dimensional labeled array that can hold any data type, similar to a column in a spreadsheet. A DataFrame is a two-dimensional labeled data structure, essentially a table with rows and columns, where each column is a Series. DataFrames are the primary data structure used for data manipulation and analysis in Python.

Question 21

Q21. How do you handle missing values in Pandas?

Answer

Answer: Missing values can be detected with df.isnull().sum(). Common strategies include dropping rows or columns with df.dropna(), imputing with the mean, median, or mode using df.fillna(), or using forward/backward fill (ffill/bfill) for time series data. More advanced approaches use model-based imputation (e.g., KNNImputer from scikit-learn).

Example:
df['age'].fillna(df['age'].median(), inplace=True)

Question 22

Q22. What is the difference between loc and iloc in Pandas?

Answer

Answer: loc is label-based indexing, used to select rows and columns by their label names. iloc is an integer-position-based indexer, selecting by row and column number (0-indexed). Use loc when you know the index labels; use iloc when you need positional slicing.

df.loc[0, 'name'] # label-based
df.iloc[0, 2] # position-based

Question 23

Q23. What is vectorization, and why is it important in NumPy?

Answer

Answer: Vectorization means applying operations on entire arrays at once without using explicit Python loops. NumPy performs these operations using highly optimized C code under the hood, making computations 10-100x faster than equivalent Python loops. This is critical in data science, where datasets can have millions of rows.

Question 24

Q24. How would you merge two DataFrames in Pandas?

Answer

Answer: You can use pd.merge() for SQL-style joins (inner, left, right, outer) or df.join() for index-based joins. The merge function requires specifying the key columns using the 'on' parameter.

result = pd.merge(df1, df2, on='customer_id', how='left')

Question 25

Q25. What are list comprehensions, and when would you use them?

Answer

Answer: List comprehensions provide a concise way to create lists by applying an expression to each element of an iterable, optionally filtering elements. They are faster and more readable than equivalent for loops for simple transformations.

squares = [x**2 for x in range(10) if x % 2 == 0]

Question 26

Q26. What is the difference between WHERE and HAVING?

Answer

Answer: WHERE filters rows before any grouping or aggregation occurs. HAVING filters groups after the GROUP BY clause has been applied. You cannot use aggregate functions like SUM() or COUNT() in a WHERE clause, but you can in HAVING.

Question 27

Q27. Explain the difference between INNER JOIN, LEFT JOIN, and FULL OUTER JOIN.

Answer

Answer: INNER JOIN returns only rows that have matching values in both tables. LEFT JOIN returns all rows from the left table and matching rows from the right (unmatched right rows are NULL). FULL OUTER JOIN returns all rows from both tables, with NULLs where there is no match on either side.

Question 28

Q28. What are window functions in SQL?

Answer

Answer: Window functions perform calculations across a set of rows related to the current row without collapsing the result into a single value (unlike GROUP BY). Common window functions include ROW_NUMBER(), RANK(), DENSE_RANK(), LAG(), LEAD(), and running totals with SUM() OVER(). They are essential for tasks like ranking customers or calculating moving averages.

SELECT name, salary,\n      RANK() OVER (PARTITION BY dept ORDER BY salary DESC) AS rank\nFROM employees;

Question 29

Q29. How do you find duplicate rows in a SQL table?

Answer

Answer: Group by all relevant columns and use HAVING COUNT(*) > 1 to identify duplicates.

SELECT email, COUNT(*) FROM users\nGROUP BY email HAVING COUNT(*) > 1;

Question 30

Q30. What is the difference between UNION and UNION ALL?

Answer

Answer: UNION combines the results of two SELECT statements and removes duplicate rows. UNION ALL combines the results and keeps all rows, including duplicates. UNION ALL is faster because it skips the deduplication step and should be preferred when duplicates are either not possible or acceptable.

Question 31

Q31. How would you calculate a 7-day rolling average in SQL?

Answer

Answer: Use a window function with ROWS BETWEEN 6 PRECEDING AND CURRENT ROW to define the rolling window.

SELECT date, revenue,\n AVG(revenue) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)\n AS rolling_7d_avg\nFROM daily_sales;

Question 32

Q32. What is query optimization, and how do you improve SQL query performance?

Answer

Answer: Query optimization reduces query execution time. Common strategies include: using indexes on frequently filtered or joined columns, avoiding SELECT *, filtering early with WHERE before joining, using CTEs or subqueries to break complex logic into steps, and analyzing the query execution plan (EXPLAIN) to identify bottlenecks like full table scans.

Question 33

Q33. What is the ROC-AUC score, and when is it useful?

Answer

Answer: ROC (Receiver Operating Characteristic) curve plots the True Positive Rate against the False Positive Rate at various classification thresholds. AUC (Area Under the Curve) summarizes this into a single value between 0 and 1. AUC of 1 = perfect model; 0.5 = random. It is useful for comparing models across all thresholds, especially when dealing with imbalanced classes. However, for severely imbalanced datasets, Precision-Recall AUC is often more informative.

Question 34

Q34. Explain precision, recall, and F1-score with a business example.

Answer

Answer: Precision = TP / (TP + FP): out of all predicted positives, how many are actually positive. Recall = TP / (TP + FN): out of all actual positives, how many did we catch? F1-score is their harmonic mean. Example: In fraud detection, high recall is critical (we want to catch most fraud cases even at the cost of some false alarms). In email filtering, high precision matters more (we do not want legitimate emails flagged as spam).

Question 35

Q35. How do you handle class imbalance in a classification problem?

Answer

Answer: Strategies include: resampling (oversampling the minority class with SMOTE or undersampling the majority class), using class weights in the loss function (class_weight='balanced' in scikit-learn), choosing appropriate evaluation metrics (F1, AUC-PR instead of accuracy), and using threshold tuning to shift the decision boundary. The best approach depends on the cost of false positives versus false negatives in the business context.

Question 36

Q36. What is the difference between MSE, RMSE, and MAE?

Answer

Answer: MSE (Mean Squared Error) averages the squared differences between predicted and actual values, penalizing large errors heavily. RMSE is the square root of MSE, returning the error in the original units of the target variable. MAE (Mean Absolute Error) averages absolute differences and is more robust to outliers. Use RMSE when large errors are especially costly; use MAE when all errors should be treated equally.

Question 37

Q37. What is overfitting, and how do you detect and prevent it?

Answer

Answer: Overfitting occurs when a model learns the noise and specific patterns of training data rather than generalizable relationships, resulting in high training accuracy but poor test accuracy. Detection: a large gap between training and validation metrics. Prevention strategies: use more training data, simplify the model, apply regularization (L1/L2), use dropout (neural networks), add early stopping, and always evaluate on a held-out test set.

Question 38

Q38. What is feature importance, and how is it calculated in tree-based models?

Answer

Answer: Feature importance measures the extent to which each feature contributes to the model's predictions. In tree-based models (decision trees, random forests, gradient boosting), it is typically calculated as the total reduction in impurity (Gini or entropy) contributed by a feature across all splits and trees, normalized to sum to 1. SHAP (SHapley Additive exPlanations) provides a more reliable and model-agnostic alternative that accounts for feature interactions.

Question 39

Q39. What is data leakage, and how do you prevent it?

Answer

Answer: Data leakage occurs when information outside the training data boundary (e.g., from the future or the target) is inadvertently used to train the model, leading to unrealistically optimistic performance metrics. Common sources include applying scaler or imputer transformations before the train-test split, using target-encoded features computed on the full dataset, and including features that are only available at prediction time. Prevention: always split data before any preprocessing; use pipelines to apply transformations within each cross-validation fold.

Question 40

Q40. What is the curse of dimensionality?

Answer

Answer: The curse of dimensionality refers to the phenomenon where the volume of the feature space grows exponentially as the number of dimensions increases, making data increasingly sparse. This causes distance-based algorithms (like KNN) to break down because all points become approximately equidistant. It also increases the risk of overfitting. Solutions include dimensionality reduction (PCA, t-SNE, UMAP), feature selection, and regularization.

Question 41

Q41. Explain the difference between PCA and t-SNE.

Answer

Answer: PCA (Principal Component Analysis) is a linear dimensionality reduction technique that projects data onto orthogonal axes of maximum variance. It preserves global structure and is deterministic. t-SNE (t-Distributed Stochastic Neighbor Embedding) is a nonlinear technique that preserves local neighborhood structure and is primarily used for 2D or 3D visualization. t-SNE is not suitable for preprocessing before modeling (due to non-determinism and inability to transform new data) but is excellent for exploratory visualization.

Question 42

Q42. What is A/B testing, and how do you determine the sample size?

Answer

Answer: A/B testing is a controlled experiment that compares two variants (A and B) to determine which performs better on a given metric. Sample size is determined using statistical power analysis, which requires specifying the significance level (alpha, typically 0.05), the desired power (1 - beta, typically 0.80), and the minimum detectable effect (the smallest business-meaningful difference). Online calculators or Python's statsmodels library can automate this. Running the test too short leads to underpowered results; stopping early based on interim results increases the risk of false positives (p-hacking).

Question 43

Q43. What is the difference between a generative and discriminative model?

Answer

Answer: A discriminative model (e.g., logistic regression, SVM, neural networks) learns the decision boundary between classes directly — it models P(y|x). A generative model (e.g., Naive Bayes, Gaussian Mixture Models, VAEs) models the joint distribution P(x, y) or the class-conditional distribution P(x|y), and can generate new data samples. Discriminative models typically outperform generative models on classification tasks with sufficient labeled data, but generative models are more data-efficient and can handle missing data.

Question 44

Q44. What is the difference between batch gradient descent, stochastic gradient descent, and mini-batch gradient descent?

Answer

Answer: Batch gradient descent computes the gradient using the entire dataset before each update — accurate but slow and memory-intensive for large datasets. Stochastic gradient descent (SGD) updates weights after every single sample, fast but noisy. Mini-batch gradient descent updates after processing a small batch (typically 32-256 samples), balancing speed and stability. Mini-batches are the standard approach in deep learning and are natively supported by all major frameworks.

Question 45

Q45. How do you deploy a machine learning model to production?

Answer

Answer: A typical ML deployment pipeline includes: (1) serializing the trained model using joblib or pickle, (2) wrapping it in a REST API (e.g., FastAPI or Flask), (3) containerizing with Docker, (4) deploying to a cloud service (AWS SageMaker, GCP Vertex AI, Azure ML) or Kubernetes cluster, (5) setting up monitoring for model performance, data drift, and concept drift using tools like MLflow or Evidently AI, and (6) implementing CI/CD pipelines for automated retraining and redeployment.

Question 46

Q46. What is the difference between online learning and batch learning?

Answer

Answer: Batch learning trains the model on the full dataset at once and deploys a static model — suitable when data does not change frequently. Online learning (also called incremental learning) continuously updates the model as new data arrives in a stream, making it suitable for dynamic environments such as recommendation engines, fraud detection, and stock prediction, where patterns shift over time. Models must be designed to accommodate online updates (e.g., SGD-based learners and the river library in Python).

Question 47

Q47. Tell me about a data science project you are most proud of. What was your approach?

Answer

Answer: Structure your answer using the STAR method: Situation (context and business problem), Task (your specific role), Action (techniques used, challenges overcome, decisions made), Result (measurable impact). Quantify outcomes wherever possible: 'reduced churn prediction error by 18%', 'automated a process that saved 15 analyst hours per week'. Show that you understood the business goal, not just the technical execution.

Question 48

Q48. How would you explain a complex model to a non-technical stakeholder?

Answer

Answer: Focus on what the model does, not how it does it. Use analogies from their domain. For example, explaining a credit scoring model: 'Think of it as an automated checklist that scores applicants the same way a senior underwriter would, based on the patterns we have observed in thousands of past applications.' Avoid jargon, use visuals if available, and always connect the model output to a business decision or action.

Question 49

Q49. How would you approach a situation where your model's performance drops significantly after deployment?

Answer

Answer: First, check for data pipeline issues such as missing values, schema changes, or encoding bugs that corrupt input features. Next, check for data drift by comparing feature distributions between the training data and recent production data using statistical tests or tools such as Evidently AI. If drift is confirmed, retrain the model on more recent data. Finally, evaluate whether the business context has changed (concept drift) and whether the model's objective still aligns with the current business goal. Establish monitoring dashboards to catch this faster in the future.

Question 50

Q50. You are given a dataset, but have no business context. How do you begin your analysis?

Answer

Answer: Start with exploratory data analysis (EDA): examine the shape of the data, data types, missing value counts, and basic summary statistics. Visualize distributions of key variables and look for correlations, outliers, and anomalies. Then ask business context questions: What decision will this data inform? What is the target variable? What time period does this cover? Is the data a sample or a population? Understanding the analysis goal before diving into modeling is critical to avoiding wasted effort.

Category	Number of Questions	Suitable For
Statistics & Probability	Q1 – Q8	Freshers, Interns, Experienced
Machine Learning	Q9 – Q18	Freshers, Interns, Experienced
Python for Data Science	Q19 – Q25	Freshers, Interns
SQL & Data Wrangling	Q26 – Q32	Freshers, Interns, Experienced
Model Evaluation & Metrics	Q33 – Q39	Experienced
Advanced & System Design	Q40 – Q46	Experienced
Behavioral & Case Questions	Q47 – Q50	All Levels

Concept A	Concept B	Key Difference
Supervised Learning	Unsupervised Learning	Labeled vs unlabeled data
Classification	Regression	Discrete vs continuous output
L1 Regularization	L2 Regularization	Feature selection vs weight shrinkage
Bagging	Boosting	Parallel vs sequential ensemble
Precision	Recall	Predicted positives vs actual positives
MSE	MAE	Penalizes outliers vs robust to outliers
PCA	t-SNE	Linear/global vs nonlinear/local
WHERE clause	HAVING clause	Before vs after aggregation
Batch Gradient Descent	Mini-Batch GD	Full dataset vs small batch updates
Overfitting	Underfitting	High variance vs high bias

Level	Focus Areas	Recommended Questions
Fresher / Entry Level	Statistics, Python basics, ML concepts, SQL fundamentals	Q1-Q10, Q19-Q29
Data Science Intern	Python, EDA, basic ML, SQL, statistics	Q1-Q8, Q19-Q32
1-3 Years Experience	ML algorithms, model evaluation, advanced SQL, feature engineering	Q9-Q39
Experienced (3+ Years)	System design, advanced ML, deployment, A/B testing, case questions	Q33-Q50

Top 50 Data Science Interview Questions (2026)

Quick Overview: Question Categories

Section 1: Data Science Statistics Interview Questions

Section 2: Machine Learning: Data Science Technical Interview Questions

Section 3: Python for Data Science: Entry Level & Intern Questions

Section 4: SQL & Data Wrangling Questions

Section 5: Model Evaluation & Metrics: Data Science Interview Questions for Experienced

Section 6: Advanced Data Science Interview Questions for Experienced Professionals

Section 7: Behavioral & Case-Based Questions

Key Comparisons (For a Quick Reference)

Which Questions Should You Focus On?

Interview Preparation Tips

Conclusion

FAQ

AI Skills & Salaries What Freshers Can Realistically Expect

5 Reasons Generative AI Courses Are Becoming Popular Among Students

Best Courses for IT Jobs in 2026

Explore Related Programs.

Full-Stack .NET Development + Angular With AI Integrated

Full-Stack Java + React With AI Integrated

UI-UX + Software Testing with Selenium AI Integrated

Human Resource

Data Science AI/ML with Python

Business Analytics + Data Analytics with Core Python

Start Your IT Career With a 100% Job Guarantee

Top 50 Data Science Interview Questions (2026)

Quick Overview: Question Categories

Section 1: Data Science Statistics Interview Questions

Section 2: Machine Learning: Data Science Technical Interview Questions

Section 3: Python for Data Science: Entry Level & Intern Questions

Section 4: SQL & Data Wrangling Questions

Section 5: Model Evaluation & Metrics: Data Science Interview Questions for Experienced

Section 6: Advanced Data Science Interview Questions for Experienced Professionals

Section 7: Behavioral & Case-Based Questions

Key Comparisons (For a Quick Reference)

Which Questions Should You Focus On?

Interview Preparation Tips

Conclusion

FAQ

Explore Related Programs.

Full-Stack .NET Development + Angular With AI Integrated

Full-Stack Java + React With AI Integrated

UI-UX + Software Testing with Selenium AI Integrated

Human Resource

Data Science AI/ML with Python

Business Analytics + Data Analytics with Core Python

Start Your IT Career With a 100% Job GuaranteeStart Your IT Career With a 100% Job Guarantee

Start Your IT Career With a 100% Job Guarantee