Machine Learning Interview Questions – Set 13

Do you have experience with Spark or big data tools for machine learning?

You’ll want to get familiar with the meaning of big data for different companies and the different tools they’ll want. Spark is the big data tool most in demand now, able to handle immense datasets with speed. Be honest if you don’t have experience with the tools demanded, but also take a look at job descriptions and see what tools pop up: you’ll want to invest in familiarizing yourself with them.

Mention why feature engineering is important in model building and list out some of the techniques used for feature engineering.

Algorithms necessitate features with some specific characteristics to work appropriately. The data is initially in a raw form. You need to extract features from this data before supplying it to the algorithm. This process is called feature engineering. When you have relevant features, the complexity of the algorithms reduces. Then, even if a non-ideal algorithm is used, results come out to be accurate.

Feature engineering primarily has two goals:

  • Prepare the suitable input data set to be compatible with the machine learning algorithm constraints.
  • Enhance the performance of machine learning models.
  • Some of the techniques used for feature engineering include Imputation, Binning, Outliers Handling, Log transform, grouping operations, One-Hot encoding, Feature split, Scaling, Extracting date.

What are overfitting and underfitting? Why does the decision tree algorithm suffer often with overfitting problem?What are overfitting and underfitting? Why does the decision tree algorithm suffer often with overfitting problem?

Overfitting is a statistical model or machine learning algorithm which captures the noise of the data. Underfitting is a model or machine learning algorithm which does not fit the data well enough and occurs if the model or algorithm shows low variance but high bias.

In decision trees, overfitting occurs when the tree is designed to perfectly fit all samples in the training data set. This results in branches with strict rules or sparse data and affects the accuracy when predicting samples that aren’t part of the training set.

What is Genetic Programming?

Genetic programming is one of the two techniques used in machine learning. The model is based on the testing and selecting the best choice among a set of results.

What is the main key difference between supervised and unsupervised machine learning?

Supervised learning technique needs labeled data to train the model. For example, to solve a classification problem (a supervised learning task), you need to have label data to train the model and to classify the data into your labeled groups. Unsupervised learning does not need any labelled dataset. This is the main key difference between supervised learning and unsupervised learning.

What are your favorite use cases of machine learning models?

The Quora thread below contains some examples, such as decision trees that categorize people into different tiers of intelligence based on IQ scores. Make sure that you have a few examples in mind and describe what resonated with you. It’s important that you demonstrate an interest in how machine learning is implemented.

In what real world applications is Naive Bayes classifier used?

Some of real world examples are as given below

  • To mark an email as spam, or not spam?
  • Classify a news article about technology, politics, or sports?
  • Check a piece of text expressing positive emotions, or negative emotions?
  • Also used for face recognition software

Explain the difference between KNN and k.means clustering?

  •  it is a supervised machine learning algorithm where we need to provide the labelled data to the model it then classifies the points based on the distance of the point from the nearest points.
  • Whereas, on the other hand, K-Means clustering is an unsupervised machine learning algorithm thus we need to provide the model with unlabelled data and this algorithm classifies points into clusters based on the mean of the distances between different points

What is sequence learning?

Sequence learning is a method of teaching and learning in a logical manner.

How is a decision tree pruned?

Pruning is what happens in decision trees when branches that have weak predictive power are removed in order to reduce the complexity of the model and increase the predictive accuracy of a decision tree model. Pruning can happen bottom-up and top-down, with approaches such as reduced error pruning and cost complexity pruning.

Reduced error pruning is perhaps the simplest version: replace each node. If it doesn’t decrease predictive accuracy, keep it pruned. While simple, this heuristic actually comes pretty close to an approach that would optimize for maximum accuracy.