Machine Learning Interview Questions – Set 11

What is the difference between stochastic gradient descent (SGD) and gradient descent (GD)?

Both algorithms are methods for finding a set of parameters that minimize a loss function by evaluating parameters against data and then making adjustments.

In standard gradient descent, you’ll evaluate all training samples for each set of parameters. This is akin to taking big, slow steps toward the solution.

In stochastic gradient descent, you’ll evaluate only 1 training sample for the set of parameters before updating them. This is akin to taking small, quick steps toward the solution.

What are the advantages and disadvantages of using an Array?

Advantages:

  1. Random access is enabled
  2. Saves memory
  3. Cache friendly
  4. Predictable compile timing
  5. Helps in re-usability of code
    Disadvantages:
  6. Addition and deletion of records is time consuming even though we get the element of interest immediately through random access. This is due to the fact that the elements need to be reordered after insertion or deletion.
  7. If contiguous blocks of memory are not available in the memory, then there is an overhead on the CPU to search for the most optimal contiguous location available for the requirement.
  8. Now that we know what arrays are, we shall understand them in detail by solving some interview questions. Before that, let us see the functions that Python as a language provides for arrays, also known as, lists.
  • append() – Adds an element at the end of the list
  • copy() – returns a copy of a list.
  • reverse() – reverses the elements of the list
  • sort() – sorts the elements in ascending order by default.

Define and explain the concept of Inductive Bias with some examples.

Inductive Bias is a set of assumptions that humans use to predict outputs given inputs that the learning algorithm has not encountered yet. When we are trying to learn Y from X and the hypothesis space for Y is infinite, we need to reduce the scope by our beliefs/assumptions about the hypothesis space which is also called inductive bias. Through these assumptions, we constrain our hypothesis space and also get the capability to incrementally test and improve on the data using hyper-parameters. Examples:

  • We assume that Y varies linearly with X while applying Linear regression.
  • We assume that there exists a hyperplane separating negative and positive examples.

What is normal distribution?

The distribution having the below properties is called normal distribution.

  • The mean, mode and median are all equal.
  • The curve is symmetric at the center (i.e. around the mean, μ).
  • Exactly half of the values are to the left of center and exactly half the values are to the right.
  • The total area under the curve is 1.

What are the drawbacks of a linear model?

There are a couple of drawbacks of a linear model:

  • A linear model holds some strong assumptions that may not be true in application. It assumes a linear relationship, multivariate normality, no or little multicollinearity, no auto-correlation, and homoscedasticity
  • A linear model can’t be used for discrete or binary outcomes.
  • You can’t vary the model flexibility of a linear model.

What do you understand by selection bias?

  • It is a statistical error that causes a bias in the sampling portion of an experiment.
  • The error causes one sampling group to be selected more often than other groups included in the experiment.
  • Selection bias may produce an inaccurate conclusion if the selection bias is not identified.

How would you predict who will renew their subscription next month? What data would you need to solve this? What analysis would you do? Would you build predictive models? If so, which algorithms?

  • Let’s assume that we’re trying to predict renewal rate for Netflix subscription. So our problem statement is to predict which users will renew their subscription plan for the next month.
  • Next, we must understand the data that is needed to solve this problem. In this case, we need to check the number of hours the channel is active for each household, the number of adults in the household, number of kids, which channels are streamed the most, how much time is spent on each channel, how much has the watch rate varied from last month, etc. Such data is needed to predict whether or not a person will continue the subscription for the upcoming month.
  • After collecting this data, it is important that you find patterns and correlations. For example, we know that if a household has kids, then they are more likely to subscribe. Similarly, by studying the watch rate of the previous month, you can predict whether a person is still interested in a subscription. Such trends must be studied.
  • The next step is analysis. For this kind of problem statement, you must use a classification algorithm that classifies customers into 2 groups:
  • Customers who are likely to subscribe next month
  • Customers who are not likely to subscribe next month
  • Would you build predictive models? Yes, in order to achieve this you must build a predictive model that classifies the customers into 2 classes like mentioned above.
  • Which algorithms to choose? You can choose classification algorithms such as Logistic Regression, Random Forest, Support Vector Machine, etc.
  • Once you’ve opted the right algorithm, you must perform model evaluation to calculate the efficiency of the algorithm. This is followed by deployment.

Why ensemble learning is used?

Ensemble learning is used to improve the classification, prediction, function approximation etc of a model.

What is the difference between stochastic gradient descent (SGD) and gradient descent (GD)?

Gradient Descent and Stochastic Gradient Descent are the algorithms that find the set of parameters that will minimize a loss function.
The difference is that in Gradient Descend, all training samples are evaluated for each set of parameters. While in Stochastic Gradient Descent only one training sample is evaluated for the set of parameters identified.

When should ridge regression be preferred over lasso?

We should use ridge regression when we want to use all predictors and not remove any as it reduces the coefficient values but does not nullify them.