Explain the differences between Random Forest and Gradient Boosting machines.
Random forests are a significant number of decision trees pooled using averages or majority rules at the end. Gradient boosting machines also combine decision trees but at the beginning of the process unlike Random forests. Random forest creates each tree independent of the others while gradient boosting develops one tree at a time. Gradient boosting yields better outcomes than random forests if parameters are carefully tuned but it’s not a good option if the data set contains a lot of outliers/anomalies/noise as it can result in overfitting of the model.Random forests perform well for multiclass object detection. Gradient Boosting performs well when there is data which is not balanced such as in real time risk assessment.
Which algorithms can be used for important variable selection?
Random Forest, Xgboost and plot variable importance charts can be used for variable selection.
Why do we need a validation set and a test set?
We split the data into three different categories while creating a model:
- Training set: We use the training set for building the model and adjusting the model’s variables. But, we cannot rely on the correctness of the model build on top of the training set. The model might give incorrect outputs on feeding new inputs.
- Validation set: We use a validation set to look into the model’s response on top of the samples that don’t exist in the training dataset. Then, we will tune hyperparameters on the basis of the estimated benchmark of the validation data.
When we are evaluating the model’s response using the validation set, we are indirectly training the model with the validation set. This may lead to the overfitting of the model to specific data. So, this model won’t be strong enough to give the desired response to the real-world data. - Test set: The test dataset is the subset of the actual dataset, which is not yet used to train the model. The model is unaware of this dataset. So, by using the test dataset, we can compute the response of the created model on hidden data. We evaluate the model’s performance on the basis of the test dataset.
Note: We always expose the model to the test dataset after tuning the hyperparameters on top of the validation set.
As we know, the evaluation of the model on the basis of the validation set would not be enough. Thus, we use a test set for computing the efficiency of the model.
How do you deal with the class imbalance in a classification problem?
Class imbalance can be dealt with in the following ways:
- Using class weights
- Using Sampling
- Using SMOTE
- Choosing loss functions like Focal Loss
What are the different Algorithm techniques in Machine Learning?
The different types of techniques in Machine Learning are
- Supervised Learning
- Unsupervised Learning
- Semi-supervised Learning
- Reinforcement Learning
- Transduction
- Learning to Learn
You are given a data set consisting of variables having more than 30% missing values? Let’s say, out of 50 variables, 8 variables have missing values higher than 30%. How will you deal with them?
We can deal with them in the following ways:
- Assign a unique category to missing values, who knows the missing values might decipher some trend
- We can remove them blatantly.
- Or, we can sensibly check their distribution with the target variable, and if found any pattern we’ll keep those missing values and assign them a new category while removing others.
What is Deep Learning?
is a subset of machine learning that involves systems that think and learn like humans using artificial neural networks. The term ‘deep’ comes from the fact that you can have several layers of neural networks.
One of the primary differences between machine learning and deep learning is that feature engineering is done manually in machine learning. In the case of deep learning, the model consisting of neural networks will automatically determine which features to use (and which not to use).
Which one is better, Naive Bayes Algorithm or Decision Trees?
Although it depends on the problem you are solving, but some general advantages are following:
Naive Bayes:
- Work well with small dataset compared to DT which need more data
- Lesser overfitting
- Smaller in size and faster in processing
- Decision Trees:
- Decision Trees are very flexible, easy to understand, and easy to debug
- No preprocessing or transformation of features required
- Prone to overfitting but you can use pruning or Random forests to avoid that.
Pick an algorithm. Write the pseudo-code for a parallel implementation.
This kind of question demonstrates your ability to think in parallelism and how you could handle concurrency in programming implementations dealing with big data. Take a look at pseudocode frameworks such as Peril-L and visualization tools such as Web Sequence Diagrams to help you demonstrate your ability to write code that reflects parallelism.
Differentiate between Statistical Modeling and Machine Learning?
Machine learning models are about making accurate predictions about the situations, like Foot Fall in restaurants, Stock-Price, etc. where-as, Statistical models are designed for inference about the relationships between variables, as What drives the sales in a restaurant, is it food or Ambience.