Data Science Interview Questions

Career in Data Science is one of the robust careers at present. Because of the increased importance of Data, the scope and demand for the Data Scientist have been growing tremendously over the years. Based on the report submitted by IBM’s predictions the demand for the Data Scientist role would rise to 28% by 2021. For equipping oneself to a Data Scientist position, one must have an idea about the various questions that are put forth to them in an Interview.

This Data Science Interview Question blog is designed specifically to provide you with the frequently asked and various Data Science Interview Questions that are asked in an Interview. These Questions are useful for the freshers who aspire to begin a career in the Data Science field. Also, the experienced candidates could brush up their knowledge in Data Science by referring to this blog before taking an Interview.

What is Selection Bias?

Selection bias is the type of error that occurs at the time when the researcher decides about who is going to be studied. This is generally related to research where the selection of participants is not taking place randomly. It is also known as selection effects. It occurs because of the distortion of the statistical analysis and results from the way of collecting the samples. If we fail to take the selection basis into the account then a few conclusions of the study would not be accurate.

Explain the features of Vectors?

The feature of a vector is an n-dimensional vector with numerical features that primarily represent some objects. When it comes to Machine learning, the feature vectors are used for representing the symbolic or numeric characteristics, that are called features. The features of an object in mathematical term and that can be easily analyzed.

Elucidate Root Cause Analysis?

It was initially developed to analyze industrial accidents. Currently, it is used widely in other areas. Root Cause Analysis is the problem-solving technique that is used for isolating the faults or root cause of the problem. A factor is called as a root cause when it is deducted from the problem-fault sequence. This averts the final undesirable event from recurring.

What are the types of Selection Bias?

There are four types of Selection Bias and they are

Data
Attrition
Time Interval
Sampling Bias

What do you know about Recommender Systems?

These are the systems of subclass information that filters the systems which are meant to predict the rating or preferences that a user would provide to a product.

Write the formula to calculate R-square?

The R-Square can be calculated using the formula –

1 – (Residual Sum of Squares/ Total Sum of Squares)

How to assess the statistical significance of insight?

The statistical importance of insight could be accessed using Hypothesis Testing.

List the basic assumptions to be made for linear regression?

Given below are the basic assumptions to be made for linear regression

Linearity
Additivity
Normality of error distribution
Statistical independence of errors

List the properties of Normal Distribution?

Unimodal
Asymptotic
Symmetrical
Bell-shaped
Mean, Mode, and Median that is all located in the center

Differentiate between type I vs type II error?

Type I error occurs if the null hypothesis is true but it is rejected. Type II error occurs if the null hypothesis is false, but it does not erroneously fail to be rejected. These are the Basic Data Science Interview Questions that are asked to a fresher in an Interview.

What do you mean by power analysis?

It is the experimental design technique that is used for determining the effect of a given sample size.

Explain Collaborative filtering?

It is the process of filtering that is used by most of the recommender system to identify the information or patterns by collaborating viewpoints, various data sources, and multiple agents.

What do you mean by sampling?

Data Sampling is the statistical analysis technique that is used to manipulate, select, and analyze the representative subset of the data points to find the trends and patterns in larger data set that is being examined.

What is Interpolation?

It is the process of estimating values from 2 known values in a list of values is known as Interpolation.

What is Extrapolation?

It is the method of approximating the value by extending the known set of facts or values. These are the Common Data Science Interview Questions that are asked to a fresher in an Interview.

What is the statistical interaction?

It is the interaction between the effects of input variable and output variable that differs among the levels of other factors.

Elucidate the steps in making a Decision Tree.

Take the complete data set as the input.
Find a split that maximizes the separation of classes. The split test is any test that divides the data into two sets.
Apply the splits into the input data.
Again apply steps 1 and 2 to divide the data.
Stop it when you meet the stopping criteria.
Also, this step is called pruning. Clean up the tree in case you went too far in doing the splits.

Explain the goal of A/B Testing.

This is the statistical hypothesis testing that is used for random experiments with two variables A and B. The objective of A/B testing is to identify the changes in a web page to maximize or increase the outcome of a strategy.

What is Cross-Validation?

It is the model validation technique that is used for evaluating how the outcomes of the statistical analysis would generalize to the independent set of data. This is mainly used in the background where the aim is to forecast and one needs to estimate the accuracy of the model that would accomplish the practice. The objective of cross-validation is to term the data set for testing the model on the training phase to limit the problems and overfit the insight and to generalize the independent data set.

Do Gradient descent methods at All-Time converge to a similar point?

No, as in some cases, they only reach a local optimum or local minima point. They would not reach the global optima point. It is governed by the data and its starting conditions. These are the Basic Data Science Interview Questions that are put forth to a fresher in an interview.

List the Drawbacks of the Linear Model?

The drawbacks of the linear model are as follows

Assumption of linearity of the errors.
It could not be used for count outcomes or binary outcomes.
It has overfitting problems that it can not solve.

What is Confounding Variables?

They are extraneous variables in the statistical model that correlates inversely or directly with both the dependent and independent variable and estimates fail to account for the confounding factor.

Explain the Law of Large Numbers?

This is the theorem that describes the result of performing the same experiment in many numbers of times. It forms the basis of frequency-style thinking. It states the sample mean, sample variance, and the standard deviation converge to what they are estimating.

Explain Star Schema.

This is the traditional database schema with the central table. Satellite tables map IDs to physical names or descriptions could be connected to the central fact table that is using the ID fields. The tables are known as lookup tables and they are useful in real-time applications since they save a lot of memory. At times the star schemas involve the layers of summarization for recovering the information quicker.

How regularly should an Algorithm be updated?

You should update an algorithm when:

If you need the model to evolve as the data streams through the infrastructure
While the underlying data source is changing
When there is a case of non-stationarity.

These are the Common Data Science Interview Question that is asked to a fresher in an Interview.

Explain Selective Bias.

Selection bias is the usual and a problematic situation in which the error is introduced due to the non-random population sample.

Why is Resampling done?

Resampling is done in any of the following cases,

To estimate the accuracy of the sample statistics by using subsets of the accessible data or random drawing with the replacement form the set of data points.
Substituting the labels on the data points while performing the significance tests.
Validating the models by using the random subsets[ cross-validation or bootstrapping]

Elucidate Survivorship Bias.

Survivorship Bias is the logical error that focuses on the aspects that support some surviving process and overlooks casually that which did not work because of the lack of prominence. It can lead to the wrong conclusions in various means.

What is Eigenvector?

It is used for understanding the linear transformations. In the Data Analysis, it is generally used in calculating the eigenvectors for the covariance and correlation of matrix.

What do you mean by Eigenvalue?

These are the directions along with a particular linear transformation that acts by flipping, stretching or compressing. These are the Basic Data Science Interview Questions that are asked to a fresher in an Interview.

List the types of Biases that can occur during sampling?

Three types of Bias can occur during Sampling and they are,

Selection bias
Under coverage bias
Survivorship bias

How to work towards a Random Forest?

The major reason for this technique is that most of the weak learners are combined to provide a keen learner. The steps involved here are

Building several decision trees on the bootstrapped training samples of the data.
In each tree, the time is split is considered as the random samples of the mm predictors is chosen as the spilled candidates, out of all the PP predictors.
Thumb Rule: Every split m=p m=p
Predictions: The majority rule.

Differentiate between Supervised learning and Unsupervised Machine Learning.

Supervised Learning	Unsupervised Learning
It requires training labeled data.	It does not require labeled data.

What is Bias?

It is the error introduced in the model due to the simplification of the Machine Learning Algorithm. This can lead to underfitting. If you train the model at the time model, it makes simplified assumptions for the target to function easier to understand.

What are the low and high Bias Machine Learning algorithms?

Low bias Machine learning algorithms include Decision Trees, SVM, and k-NN.
High bias Machine learning algorithms include Linear Regression and Logistic Regression. These are the Common Data Science Interview Questions that are asked to a fresher in an Interview.

What are the different kernel’s functions in SVM?

There are four types of kernels in SVM.

Linear Kernel
Sigmoid kernel
Polynomial kernel
Radial basis kernel

What are Recurrent Neural Networks[RNNs]?

The Recurrent nets are the type of artificial neural networks that are designed to recognize the pattern from the sequence of data like Time series, Stock market, and government agencies, etc. For understanding the recurrent nets, initially, you have to understand the basics of the feed-forward nets. These networks RNN and the feed-forward are named after the way channel information is set through a series of mathematical operations that are performed at the nodes of a network.

What is ‘Naive’ in a Naive Bayes?

The Naive Bayes Algorithms are based on the Bayes Theorem. The Bayes

theorem describes a probability of the event, based on the prior knowledge of conditions which may be related to the event.

What is Boosting?

Boosting is the iterative technique that is used for adjusting the weight of the observation depending upon the last classification. It is observed that the classification was classified incorrectly and this tries to increase the weight of this observation. Boosting is a common bias error that builds strong predictive models.

What is Bagging?

It tries to implement similar learners on the small sample population and then it takes the means of every prediction. In generalized bagging, we could use different learners on a different population. Since this expect us to reduce the variance error. These are Basic Data Science Interview Questions that are asked to a fresher in an Interview.

List the classification of Algorithms?

The classification of Algorithms are as follows

SVM
Linear
Quadratic
Decision Trees
Neural Networks
Kernel Estimation

What is a Linear Regression?

Linear Regression is the statistical technique, here the score of a variable Y is predicted based on the score of the second variable X.

Generally, X is known as the predictor variable and Y is referred to as the criterion variable.

How are the outlier values are treated?

It is identified using univariate or other graphical analysis methods. In case the number of outlier values are few, then they could be assessed individually for many numbers of outliers. The values shall be substituted in the 99th or the 1st percentile values.

Write the steps involved in the Analytics project?

First, understanding the Business problem.
Exploring the data and being familiar with it.
Preparing the data for modeling through detecting outliers, transforming variables, and treating missing values.
After the data preparation, we should start running the model, analyze the results and tweak the approach.
Validating the model using the new data set.
Implementing the model and tracking the results for analyzing the performance model over a while.

What are the common ways to treat the outlier value?

Not all extreme values are outlier values. The common ways to treat the outlier values are

Changing the value and bringing it within a range.
Or by just removing the values. These are the Basic Data Science Interview Questions that are asked to a fresher in an Interview.

Give some examples where the collaborative filtering concepts are used?

The collaborative filtering concept is used in recommending movies on Netflix, BookMyShow, IMDB, and product recommenders in e-commerce sites such as Amazon, Flipkart, YouTube, and eBay. Also, this is used in gaming recommendations in Xbox.

When working on the Data Set, How do you choose the important variables? Explain it?

By following the methods we can select the important variables,

Removing the correlated variables before selecting the important variables.
Using the linear regression and selecting the variables based on p values.
By using Forward Selection, Backward Selection, or Stepwise Selection.
Using the Random Forest, plot variable importance chart and Xgboost.
Using the Lasso Regression.
Measuring the information gain for the available set features and choosing the top features accordingly.

What is the time interval in Selection bias?

The trail could be terminated at an extreme value (usually for ethical reasons) the extreme value is likely to be reached by the variables with larger variance, though all the variables have a similar mean.

Differentiate between covariance and correlation.

Covariance	Correlation
They are the standardized form of covariance.	They are difficult to compare. For example, when we calculate the covariances of salary($) and age (years) we would get different covariances that can’t be compared as it has unequal scales. For combating a situation like that, we need the correlation value to be calculated between -1 and 1 values irrespective of their scales.

Can we secure the correlation between the continuous and categorical variables? Is yes, explain?

Yes, we could use ANCOVA (analysis of covariance) technique to capture the association between the continuous and categorical variables. These are the Common Data Science Interview Questions that are asked to freshers and experienced in an Interview.

Explain the Data Type in the Selection Bias.

While the specific subsets of data are chosen for supporting a conclusion or the rejection of bad data on the arbitrary grounds, instead of previously stated or generally agreed on criteria.

Are the True Positive Rate and Recall is related? Write the equation.

Yes, they are related. True Positive Rate = Recall. The formula is (TP/TP + FN).

How will you deal with the missing values in a given set of data?

By assigning a unique category for the missing values and knowing that the missing values may decipher some trend. Or we could remove them abruptly. Also, we could sensibly check their distribution with the target variable and finding the pattern for the missed value and assigning them to a new category while removing the others.

Which cross-validation technique you would use on the time series data set, k-fold or LOOCV?

We won’t use either of these.

Why you won’t use k-fold or LOOCV?

On the time series problem, k fold could be troublesome as there may be some pattern in the year 4 or 5 that is not in the year 3. Resampling those data set would separate the trends and may end up validation in the past years that is incorrect. Rather we could use the forward chaining strategy with 5 fold. These are the Common Data Science Interview Questions that are asked to a fresher in an Interview.

Explain the Central limit theorem and it’s essential?

The Central Limit theorem is defined as the statistical theory that indicates a large sample size from the available population with a definite level of variance. It means the mean of the sample population is exactly equal to the mean of the total population.

What are the types of Sampling methods?

The types of Sampling are as follows

Cluster sampling
Stratified sampling
Multistage sampling
Systematic sampling
Simple random sampling method

What are the algorithms that are used in Supervised Learning?

Regression
Naive Bayes
Decision Trees
Neural Networks
Support Vector Machines
K-nearest Neighbor Algorithm

What is the ROC curve?

The term ROC stands for Receiver Operating Characteristic This is basically a plot between the true and false positive rate. It helps in finding out the correct trade-off between the true and false-positive rates for various probability thresholds of the predicted values. Closer the curve to the upper left corner, the better the model is. In simple, whichever curve has wider areas under it would be a better model.

What algorithm is used in Unsupervised learning?

Clustering
Neural Networks
Anomaly Detection
Latent Variable Models.These are the Basic Data Science Interview Questions that are put forth to a fresher in an Interview.

What are the various types of sorting algorithms that are available in the R language for Data Science?

There are three types of Algorithms available and they are,

Bubble
Insertion
Selection Sorting

Write the command that is used for storing R objects in a file?

save (x, file=”x.Rdata”)

Elucidate the purpose of the group functions in SQL?

The purpose of group functions is necessary to obtain the summary statistics of the data set.

List out some examples of Group functions

Some of the Group functions are COUNT, MAXIMUM, MINIMUM, AVERAGE, SUM and DISTINCT.

What is the UNION?

UNION removes all the duplicate records (when all columns in the results are the same).

Does UNION ALL remove the duplicate records?

The UNION ALL does not remove the duplicate records.

When the table contains duplicate rows, do the query result display the duplicate value by default?

Yes, the query results display the duplicate value by default, when the table contains the duplicate rows.

How will you eliminate the duplicate rows from the query results?

We can eliminate the duplicate rows with the DISTINCT clause.

Distinguish between SQL and MySQL or SQL Server?

SQL	MySQL or SQL Server
The abbreviation of SQL is Structured Query Language. This is the Standard language that is used for accessing and manipulating the databases.	MySQL is a Database Management System such as SQL Server, Oracle, Postgres, and Informix.

What is the difference between the Inner Join, left/right join, and union?

In the Venn diagram, the inner join occurs when both the tables have a match. The left join takes place when there is a match on the left table and while the right table is null. A right join is contrary to the left join. Also, a full join is when all the data is combined. These are the Common Data Science Interview Questions for freshers.

The Expected value and Mean value are different from each other?

They are not different from each other. The terms are used in different contexts. Mean is usually referred to while talking about the probability distribution or the sample population. Whereas the expected value is generally referred to in a random variable context.

Explain what you understand by the term Normal Distribution?

Data is predominantly distributed in various ways with a bias to the right, left or all can be jumbled upon. Despite this, there are chances where the data is distributed around the central value without any bias to the right or left and then reaches to the normal distribution. In the form of a bell-shaped curve. Mostly the random variables are distributed in the form of an asymmetrical bell-shaped curve.

Can we use machine learning for the time series analysis?

Yes, we can use machine learning for the time series analysis. But it depends on the applications.

How to arrive at the mean values for Distributions?

The Mean value and Expected value are the same irrespective of the distribution. Under this condition, the distribution is the same in the same population.

What is the significance of the P-value in the statistical data?

P-value is used in determining the significance of results after the hypothesis test in statistics. The P-value helps the readers to conclude and it is always between 0 and 1.

Explain when the P-values can be rejected and when it can not be rejected?

P-value > 0.05 indicates weak evidence against the null hypothesis that means it can not be rejected.
P-value< =0.05 denotes that it is strong evidence against the null hypothesis and the null hypothesis could be rejected.
P-value=0.05 is the marginal value indicating the possibilities for going either way.

How do you assess a good logistic model?

There are different ways to assess the results of the logistic regression analysis:

By using the Classification Matrix to identify the true negatives and false positives.
A concordance helps in identifying the ability of the logistic model. Also, in differentiating between the event happening or not happening.
The Lift helps in assessing the logistic model by comparing it with the random selection.

How do you iterate over a list also retrieve the element that indices at the same time?

It can be done using the enumerate function. This takes every element in the sequence just like the list and adds in its location just before it.

How does the mean value arrive at the Sampling Data?

Mean values are the only values that arrive from the Sampling Data. The Expected value is the mean of all the means. It is the value that is built from numerous samples. Expected values are the population mean.

What is the technique that is used for predicting the categorical responses?

The classification technique is used commonly in mining for classifying the data sets. These are the Common Data Science Interview Questions that is asked to fresher and experienced candidates in an Interview.

Elucidate Gradient Descent.

The gradient measures how much the output of the function changes when you change the inputs little by little. This simply measures the changes in all the weights about the change in error. The gradient descent is the slope of a function.

What is Back Propagation?

Back Propagation is the training algorithm that is used for multi-layer neural networks. By using this method, we can move the error from an end of the network to the complete weight of the inside networks and thus permitting efficient computation of a gradient.

Explain the working of Back Propagation.

The following are the steps that are used in Back Propagation,

Forwarding the Propagation of the Training Data.
Derivatives are obtained using the output and the target.
Back Propagation for computing the derivative of error wrt and output activation.
By using the previously calculated derivatives for the output.
Updating the Weights.

List the variants of Back Propagation.

There are three variants in Back Propagation and they are,

Batch Gradient Descent
Stochastic Gradient Descent
Mini-Batch Gradient Descent.

Explain the variants of Back Propagation?

Stochastic Gradient Descent: It is used for calculating the single training examples for the calculation of the gradient and update parameters.
Batch Gradient Descent: This is used for calculating the gradient for the complete data set and perform the update at every iteration.
Mini-batch Gradient Descent: This is the most popular optimization algorithm. It is a variant of the Stochastic Gradient Descent and here rather than a single training example, the Mini-Batch of Sample is used.

What are the various Deep Learning Frameworks?

Caffe
Keras
Pytorch
Chainer
TensorFlow
Microsoft Cognitive Toolkit.

What are False Positives?

False Positives are the cases when you wrongly classified the non-event as an event. It is also known as a Type I error.

What is Test Set?

The Test Set is used for testing or evaluating the performance of the trained Machine Learning Model.

What is the Validation set?

The Validation Set is considered to be a part of the training set. Since it is used as a parameter for the selection and to avoid overfitting of a model being built.

What do you mean by False Negatives?

False Negatives are the cases when you wrongly classify the events as non-events. It is also known as a Type II error. These are the Basic Data Science Interview Questions for experienced.

Explain the role of the Activation Function.

This function is used in introducing the nonlinearity into the neural networking and helping to learn complex functions. Without the Activation Function, the neural network would be able to learn only the linear function that is a linear combination of the input data. The Activation Function is the function in the artificial neuron that delivers the outputs based on the input.

Elucidate Auto-Encoder.

Autoencoders are the simple learning networks that focus to transform the inputs into outputs with minimum possible errors. It means the output we want would be close to the input and the sizes of those layers are smaller compared to the input layer. The Autoencoders mostly receive the unlabeled input that is then encoded for reconstructing the input.

How to find the correlation between the categorical variable and the continuous variable?

We can use the analysis of the covariance technique to find out the correlation between a continuous variable and a categorical variable.

Mention the advantages of performing the dimensionality reduction before fitting the SVM?

The Support Vector Machine Learning Algorithm mostly performs better in the reduced space. This is beneficial to perform the dimensionality reduction before fitting the SVM when the number of features is large while compared to many observations.

List the types of Machine Learning.

There are three different types and they are

Supervised learning
Unsupervised learning
Reinforcement learning

What is the difference between L1 and L2 regularization methods?

The Regression model which uses the L1 regularization technique is known as “Lasso Regression” and the model that uses the L2 Regression is called “Ridge Regression”. The major difference between these two is its penalty term.

What do you mean by hash table collisions?

When the range of key values is larger than the size of a hash table, which is the common case, then we should account for the possibility of two different records with two different keys that could hash the same index table. There are quite a few ways to resolve this issue. On the hash table vernacular, the solution that is implemented is referred to as the collision resolution.

What are Precision and Recall and what are their roles in the ROC Curve?

The Precision describes the percent of correct positive predictions. The recall is the description of the percentage of true positives that are described as positive by the models. A ROC curve displays the relationship between model recall and specificity. Precision and Recall, and the ROC measures that are used to identify how useful the given classification model is.

What is a Random Forest?

It is the most versatile method in Machine Learning which performs both the classification and regression tasks. This also helps in areas such as treating the missing values, outlier values, and dimensionality reduction. This is more like gathering the multiple weak modules that come together to form the robust model.

What is Data Science in short?

Data Science is mining and analysis of the relevant information from data to resolve the analytically complicated problems. This is the most widely used technique amongst Machine Learning and Artificial Intelligence. These are the Basic Data Science Interview Questions for freshers in an Interview.

What is Pruning in the decision tree?

The method of removing the sub-nodes of a decision node is called pruning. This is also known as the Opposite Process of splitting.

In the coming days, we would add many relevant Data Science Interview Questions to this blog. Also, our institute provides professional training for Data Science under the guidance of working professionals with market-relevant skills. The Data Science Course in jalandhar at O7 Solutions provides in-depth knowledge in the Data Science Course along with certification. Enroll and learn the course professionally with certification and enhance your career opportunities.

Mobile App

Software Testing

Web Development

Graphic Designing

Advance Python

Cloud Computing

Digital Marketing

Networking