How would you build a data pipeline?

Data pipelines are the bread and butter of machine learning engineers, who take data science models and find ways to automate and scale them. Make sure you’re familiar with the tools to build data pipelines (such as Apache Airflow) and the platforms where you can host models and pipelines (such as Google Cloud or AWS or Azure). Explain the steps required in a functioning data pipeline and talk through your actual experience building and scaling them in production.

You are asked to build a multiple regression model but your model R² isn’t as good as you wanted. For improvement, you remove the intercept term now your model R² becomes 0.8 from 0.3. Is it possible? How?

Yes, it is possible.

The intercept term refers to model prediction without any independent variable or in other words, mean prediction
R² = 1 – ∑(Y – Y´)²/∑(Y – Ymean)² where Y´ is the predicted value.
In the presence of the intercept term, R² value will evaluate your model with respect to the mean model.
In the absence of the intercept term (Ymean), the model can make no such evaluation,
With large denominator,
Value of ∑(Y – Y´)²/∑(Y)² equation becomes smaller than actual, thereby resulting in a higher value of R².

What are the components of relational evaluation techniques?

The important components of relational evaluation techniques are

Data Acquisition
Ground Truth Acquisition
Cross Validation Technique
Query Type
Scoring Metric
Significance Test

List the most popular distribution curves along with scenarios where you will use them in an algorithm.

The most popular distribution curves are as follows- Bernoulli Distribution, Uniform Distribution, Binomial Distribution, Normal Distribution, Poisson Distribution, and Exponential Distribution.
Each of these distribution curves is used in various scenarios.

Bernoulli Distribution can be used to check if a team will win a championship or not, a newborn child is either male or female, you either pass an exam or not, etc.

Uniform distribution is a probability distribution that has a constant probability. Rolling a single dice is one example because it has a fixed number of outcomes.

Binomial distribution is a probability with only two possible outcomes, the prefix ‘bi’ means two or twice. An example of this would be a coin toss. The outcome will either be heads or tails.

Normal distribution describes how the values of a variable are distributed. It is typically a symmetric distribution where most of the observations cluster around the central peak. The values further away from the mean taper off equally in both directions. An example would be the height of students in a classroom.

Poisson distribution helps predict the probability of certain events happening when you know how often that event has occurred. It can be used by businessmen to make forecasts about the number of customers on certain days and allows them to adjust supply according to the demand.

Exponential distribution is concerned with the amount of time until a specific event occurs. For example, how long a car battery would last, in months.

Define Precision and Recall.

Precision
Precision is the ratio of several events you can correctly recall to the total number of events you recall (mix of correct and wrong recalls).

Precision = (True Positive) / (True Positive + False Positive)

Recall
A recall is the ratio of a number of events you can recall the number of total events.

Recall = (True Positive) / (True Positive + False Negative)

You’re asked to build a random forest model with 10000 trees. During its training, you got training error as 0.00. But, on testing the validation error was 34.23. What is going on? Haven’t you trained your model perfectly?

The model is overfitting the data.
Training error of 0.00 means that the classifier has mimicked the training data patterns to an extent.
But when this classifier runs on the unseen sample, it was not able to find those patterns and returned the predictions with more number of errors.
In Random Forest, it usually happens when we use a larger number of trees than necessary. Hence, to avoid such situations, we should tune the number of trees using cross-validation.

What are the different methods for Sequential Supervised Learning?

The different methods to solve Sequential Supervised Learning problems are

Sliding-window methods
Recurrent sliding windows
Hidden Markow models
Maximum entropy Markow models
Conditional random fields
Graph transformer networks

How do we check the normality of a data set or a feature?

Visually, we can check it using plots. There is a list of Normality checks, they are as follow:

Shapiro-Wilk W Test
Anderson-Darling Test
Martinez-Iglewicz Test
Kolmogorov-Smirnov Test
D’Agostino Skewness Test

What is Decision Tree Classification?

A decision tree builds classification (or regression) models as a tree structure, with datasets broken up into ever-smaller subsets while developing the decision tree, literally in a tree-like way with branches and nodes. Decision trees can handle both categorical and numerical data.

What are parametric models? Give an example.

Parametric models are those with a finite number of parameters. To predict new data, you only need to know the parameters of the model. Examples include linear regression, logistic regression, and linear SVMs.

Non-parametric models are those with an unbounded number of parameters, allowing for more flexibility. To predict new data, you need to know the parameters of the model and the state of the data that has been observed. Examples include decision trees, k-nearest neighbors, and topic models using latent dirichlet analysis.