What does NLP stand for?

NLP stands for Natural Language Processing. It is a branch of artificial intelligence that gives machines the ability to read and understand human languages.

What distance metrics can be used in KNN?

Following distance metrics can be used in KNN.

Manhattan
Minkowski
Tanimoto
Jaccard
Mahalanobis

What is algorithm independent machine learning?

Machine learning in where mathematical foundations is independent of any particular classifier or learning algorithm is referred as algorithm independent machine learning?

You are given a data set consisting of variables having more than 30% missing values? Let’s say, out of 50 variables, 8 variables have missing values higher than 30%. How will you deal with them?

Assign a unique category to the missing values, who knows the missing values might uncover some trend.
We can remove them blatantly.
Or, we can sensibly check their distribution with the target variable, and if found any pattern we’ll keep those missing values and assign them a new category while removing others.

What are Bayesian Networks (BN) ?

Bayesian Network is used to represent the graphical model for probability relationship among a set of variables.

Mention the difference between Data Mining and Machine learning?

Machine learning relates with the study, design and development of the algorithms that give computers the capability to learn without being explicitly programmed. While, data mining can be defined as the process in which the unstructured data tries to extract knowledge or unknown interesting patterns. During this process machine, learning algorithms are used.

How is linear classifier relevant to SVM?

An svm is a type of linear classifier. If you don’t mess with kernels, it’s arguably the most simple type of linear classifier.

Linear classifiers (all?) learn linear fictions from your data that map your input to scores like so: scores = Wx + b. Where W is a matrix of learned weights, b is a learned bias vector that shifts your scores, and x is your input data. This type of function may look familiar to you if you remember y = mx + b from high school.

A typical svm loss function ( the function that tells you how good your calculated scores are in relation to the correct labels ) would be hinge loss. It takes the form: Loss = sum over all scores except the correct score of max(0, scores – scores(correct class) + 1).

You’ve built a random forest model with 10000 trees. You got delighted after getting training error as 0.00. But, the validation error is 34.23. What is going on? Haven’t you trained your model perfectly?

The model has overfitted. Training error 0.00 means the classifier has mimiced the training data patterns to an extent, that they are not available in the unseen data. Hence, when this classifier was run on unseen sample, it couldn’t find those patterns and returned prediction with higher error. In random forest, it happens when we use larger number of trees than necessary. Hence, to avoid these situation, we should tune number of trees using cross validation.

What are some key business metrics for (S-a-a-S startup | Retail bank | e-Commerce site)?

Thinking about key business metrics, often shortened as KPI’s (Key Performance Indicators), is an essential part of a data scientist’s job. Here are a few examples, but you should practice brainstorming your own.

Tip: When in doubt, start with the easier question of “how does this business make money?”

S-a-a-S startup: Customer lifetime value, new accounts, account lifetime, churn rate, usage rate, social share rate
Retail bank: Offline leads, online leads, new accounts (segmented by account type), risk factors, product affinities
e-Commerce: Product sales, average cart value, cart abandonment rate, email leads, conversion rate

Explain the handling of missing or corrupted values in the given dataset.

An easy way to handle missing values or corrupted values is to drop the corresponding rows or columns. If there are too many rows or columns to drop then we consider replacing the missing or corrupted values with some new value.

Identifying missing values and dropping the rows or columns can be done by using IsNull() and dropna( ) functions in Pandas. Also, the Fillna() function in Pandas replaces the incorrect values with the placeholder value.