List out some of the best practices for data cleaning?
Some of the best practices for data cleaning includes,
- Sort data by different attributes
- For large datasets cleanse it stepwise and improve the data with each step until you achieve a good data quality
- For large datasets, break them into small data. Working with less data will increase your iteration speed
- To handle common cleansing task create a set of utility functions/tools/scripts. It might include, remapping values based on a CSV file or SQL database or, regex search-and-replace, blanking out all values that don’t match a regex
- If you have an issue with data cleanliness, arrange them by estimated frequency and attack the most common problems
- Analyze the summary statistics for each column ( standard deviation, mean, number of missing values,)
- Keep track of every date cleaning operation, so you can alter changes or remove operations if required
What are the important steps in data validation process?
Data Validation is performed in 2 different steps-
Data Screening – In this step various algorithms are used to screen the entire data to find any erroneous or questionable values. Such values need to be examined and should be handled.
Data Verification- In this step each suspect value is evaluated on case by case basis and a decision is to be made if the values have to be accepted as valid or if the values have to be rejected as invalid or if they have to be replaced with some redundant values.
What is aggregation and disaggregation of data?
Aggregation of data: Aggregation of data refers to the process of viewing numeric values or the measures at a higher and more summarized level of data. When you place a measure on a shelf, Tableau will automatically aggregate your data. You can determine whether the aggregation has been applied to a field or not, by simply looking at the function. This is because the function always appears in front of the field’s name when it is placed on a shelf.
Example: Sales field will become SUM(Sales) after aggregation.
You can aggregate measures using Tableau only for relational data sources. Multidimensional data sources contain aggregated data only. In Tableau, multidimensional data sources are supported only in Windows.
Disaggregation of data: Disaggregation of data allows you to view every row of the data source which can be useful while analyzing measures.
Example: Consider a scenario where you are analyzing results from a product satisfaction survey. Here the Age of participants is along one axis. Now, you can aggregate the Age field to determine the average age of participants, or you can disaggregate the data to determine the age at which the participants were most satisfied with their product.
Explain what is K-mean Algorithm?
K mean is a famous partitioning method. Objects are classified as belonging to one of K groups, k chosen a priori.
In K-mean algorithm,
- The clusters are spherical: the data points in a cluster are centered around that cluster
- The variance/spread of the clusters is similar: Each data point belongs to the closest cluster
What are the problems that a Data Analyst can encounter while performing data analysis?
A critical data analyst interview question you need to be aware of. A Data Analyst can confront the following issues while performing data analysis:
- Presence of duplicate entries and spelling mistakes. These errors can hamper data quality.
- Poor quality data acquired from unreliable sources. In such a case, a Data Analyst will have to spend a significant amount of time in cleansing the data.
- Data extracted from multiple sources may vary in representation. Once the collected data is combined after being cleansed and organized, the variations in data representation may cause a delay in the analysis process.
- Incomplete data is another major challenge in the data analysis process. It would inevitably lead to erroneous or faulty results.
Why Do You Want to Be a Data Analyst?
If you already have experience as a data analyst, this can be easier to answer: explain why you love working as a data analyst and why you want to continue. As a new data analyst, this question can catch you off-guard, but be prepared with an honest answer as to why you want to work in this industry. For example, you can say that you enjoy working with data, and it has always fascinated you.
Take a few minutes to explain how you would estimate how many tourists visit Paris every May.
Many interviewers ask you this type of behavioral questions to see an analyst’s thought process without the help of computers and data sets. After all, technology is only as good and reliable as the people behind it. In your answer include: how you identified the variables, how you communicated them, and ideas you had to find the answer.
This example answer touches on all these points:
“First, I would gather data on how many people live in Paris, how many tourists visit in May, and their average length of stay. I’d break down the numbers by age, gender, and income, and find the numbers on how many vacation days and bank holidays there are in France. I’d also figure out if the tourist office had any data I could look at.”
What is the purpose of trailing @ and @@? How do you use them?
The trailing @ is commonly known as the column pointer. So, when we use the trailing @, in the Input statement, it gives you the ability to read a part of the raw data line, test it and decide how can the additional data be read from the same record.
- The single trailing @ tells the SAS system to “hold the line”.
- The double trailing @@ tells the SAS system to “hold the line more strongly”.
An Input statement ending with @@ instructs the program to release the current raw data line only when there are no data values left to be read from that line. The @@, therefore, holds the input record even across multiple iterations of the data step.
How often should a data model be retained?
A good data analyst would be able to understand the market dynamics and act accordingly to retain a working data model so as to adjust to the new environment.
What is the KNN imputation method?
KNN (K-nearest neighbour) is an algorithm that is used for matching a point with its closest k neighbours in a multi-dimensional space.