What is Pruning in Decision Trees, and How Is It Done?
Pruning is a technique in machine learning that reduces the size of decision trees. It reduces the complexity of the final classifier, and hence improves predictive accuracy by the reduction of overfitting.
Pruning can occur in:
Top-down fashion. It will traverse nodes and trim subtrees starting at the root
Bottom-up fashion. It will begin at the leaf nodes
There is a popular pruning algorithm called reduced error pruning, in which:
- Starting at the leaves, each node is replaced with its most popular class
- If the prediction accuracy is not affected, the change is kept
- There is an advantage of simplicity and speed
What is the “Curse of Dimensionality?”
The difficulty of searching through a solution space becomes much harder as you have more features (dimensions).
Consider the analogy of looking for a penny in a line vs. a field vs. a building. The more dimensions you have, the higher volume of data you’ll need.
What is an Array?
The array is defined as a collection of similar items, stored in a contiguous manner. Arrays is an intuitive concept as the need to group similar objects together arises in our day to day lives. Arrays satisfy the same need. How are they stored in the memory? Arrays consume blocks of data, where each element in the array consumes one unit of memory. The size of the unit depends on the type of data being used. For example, if the data type of elements of the array is int, then 4 bytes of data will be used to store each element. For character data type, 1 byte will be used. This is implementation specific, and the above units may change from computer to computer.
Example:
fruits = [‘apple’, banana’, pineapple’]
In the above case, fruits is a list that comprises of three fruits. To access them individually, we use their indexes. Python and C are 0- indexed languages, that is, the first index is 0. MATLAB on the contrary starts from 1, and thus is a 1-indexed language.
State the limitations of Fixed Basis Function.
Linear separability in feature space doesn’t imply linear separability in input space. So, Inputs are non-linearly transformed using vectors of basic functions with increased dimensionality. Limitations of Fixed basis functions are:
- Non-Linear transformations cannot remove overlap between two classes but they can increase overlap.
- Often it is not clear which basis functions are the best fit for a given task. So, learning the basic functions can be useful over using fixed basis functions.
- If we want to use only fixed ones, we can use a lot of them and let the model figure out the best fit but that would lead to overfitting the model thereby making it unstable.
Which sampling technique is most suitable when working with time-series data?
We can use a custom iterative sampling such that we continuously add samples to the train set. We only should keep in mind that the sample used for validation should be added to the next train sets and a new sample is used for validation.
Why is Naive Bayes so bad? How would you improve a spam detection algorithm that uses naive Bayes?
One major drawback of Naive Bayes is that it holds a strong assumption in that the features are assumed to be uncorrelated with one another, which typically is never the case.
One way to improve such an algorithm that uses Naive Bayes is by decorrelating the features so that the assumption holds true.
Which algorithm can be used in value imputation in both categorical and continuous categories of data?
KNN is the only algorithm that can be used for imputation of both categorical and continuous variables.
How would you explain Machine Learning to a school-going kid?
- Suppose your friend invites you to his party where you meet total strangers. Since you have no idea about them, you will mentally classify them on the basis of gender, age group, dressing, etc.
- In this scenario, the strangers represent unlabeled data and the process of classifying unlabeled data points is nothing but unsupervised learning.
- Since you didn’t use any prior knowledge about people and classified them on-the-go, this becomes an unsupervised learning problem.
We have two options for serving ads within Newsfeed: 1 – out of every 25 stories, one will be an ad 2 – every story has a 4% chance of being an ad For each option, what is the expected number of ads shown in 100 news stories? If we go with option 2, what is the chance a user will be shown only a single ad in 100 stories? What about no ads at all?
- The expected number of ads shown in 100 new stories for option 1 is equal to 4 (100/25 = 4).
- Similarly, for option 2, the expected number of ads shown in 100 new stories is also equal to 4 (4/100 = 1/25 which suggests that one out of every 25 stories will be an ad, therefore in 100 new stories there will be 4 ads)
- Therefore for each option, the total number of ads shown in 100 new stories is 4.
- The second part of the question can be solved by using Binomial distribution. Binomial distribution takes three parameters:
- The probability of success and failure, which in our case is 4%.
- The total number of cases, which is 100 in our case.
- The probability of the outcome, which is a chance that a user will be shown only a single ad in 100 stories
p(single ad) = (0.96)^99*(0.04)^1
What is ensemble learning?
To solve a particular computational program, multiple models such as classifiers or experts are strategically generated and combined. This process is known as ensemble learning.