Data Analytics Interview Questions | Eklavya Online

Data Analytics Interview Questions

Now, you can start solving the problem by considering the number of cars racing. Since there are 25 cars racing with 5 lanes, there would be initially 5 races conducted, with each group having 5 cars. Next, a sixth race will be conducted between the winners of the first 5 races to determine the 3 fastest cars(let us say X1, Y1, and Z1).

Now, suppose X1 is the fastest among the three, then that means A1 is the fastest car among the 25 cars racing. But the question is how to find the 2nd and the 3rd fastest? We cannot assume that Y1 and Z1 are 2nd and 3rd since it may happen that the rest cars from the group of X1s’ cars could be faster than Y1 and Z1. So, to determine this a 7th race is conducted between cars Y1, Z1, and the cars from X1’s group(X2, X3), and the second car from Y1’s group Y2.

So, the cars that finish the 1st and 2nd is the 7th race are actually the 2nd and the 3rd fastest cars among all cars.

Communication is key in any position. Specifically, with a data analyst role, you will be expected to successfully present your findings and collaborate with the team. Assure them of your ability to communicate with an answer like this:

“My greatest communication strength would have to be my ability to relay information. I’m good at speaking in a simple, yet effective manner so that even people who aren’t familiar with the terms can grasp the overall concepts. I think communication is extremely valuable in a role like this, specifically when presenting my findings so that everyone understands the overall message.”

A hash table collision happens when two different keys hash to the same value. Two data cannot be stored in the same slot in array.

To avoid hash table collision there are many techniques, here we list out two

• Separate Chaining:
• It uses the data structure to store multiple items that hash to the same slot.
• It searches for other slots using a second function and store item in first empty slot that is found

Final question in our data analyst interview questions and answers guide. A Data Analyst can use conditional formatting to highlight the cells having negative values in an Excel sheet. Here are the steps for conditional formatting:

• First, select the cells that have negative values.
• Now, go to the Home tab and choose the Conditional Formatting option.
• Then, go to the Highlight Cell Rules and select the Less Than option.
• In the final step, you must go to the dialog box of the Less Than option and enter “0” as the value.

The best way to answer this question is to give an example of how you have handled stress in a previous job. That way, the interviewer can get a clear picture of how well you work in stressful situations. Avoid mentioning a time when you put yourself in a needlessly stressful situation. Rather, describe a time when you were given a difficult task or multiple assignments and rose to the occasion:

“I actually work better under pressure, and I’ve found that I enjoy working in a challenging environment. I thrive under quick deadlines and multiple projects. I find that when I’m under the pressure of a deadline, I can do some of my highest quality work. For example, I once had three large projects due in the same week, which was a lot of pressure. However, because I created a schedule that detailed how I would break down each project into small assignments, I completed all three projects ahead of time and avoided additional stress.”

A data collection plan is used to collect all the critical data in a system. It covers –

• Type of data that needs to be collected or gathered
• Different data sources for analyzing a data set

Up to 80% of a data analyst’s time can be spent on cleaning data. That makes this a very important concept to understand. Even more important when you consider that, if your data is unclean and produces inaccurate insights, it could lead to costly company actions based on false information. Yikes. That could mean trouble for you.

You need to demonstrate not only that you understand the difference between messy data and clean data but also that you used that knowledge to cleanse the data. This article shows the sort of workflow you might be looking for in your response, as well as some methods for identifying inconsistent data and cleaning it.

Just as with any other question where you’re asked to describe a situation you’ve encountered in the past, it’s a good time to employ the STAR method: situation, task, action, result.

A client of ours was unhappy with our staffing reports, so I needed to pore over one to see what was causing their chagrin. I was looking at some data in a spreadsheet that contained information about when our call center employees went to break, took lunch, etc., and I noticed that the time stamps were inconsistent: some had a.m., some had p.m., some didn’t have any specifications for morning or night, and worst of all, many of these employees were located in different time zones, so this needed to be made more consistent as well.

To solve the a.m./p.m. dilemma, I made sure all times were specified in military. This had two benefits: first, it eliminated the strings in the data and made the whole column numeric; second, it removed any need to specify morning or night as military time does this inherently. Next, I converted all times to UTC, this way all of the data was on the same time zone. This was important for the report I was working on because otherwise the data would be presented out of order and it could cause confusion for our client. Reorganizing the report’s data this way helped improve our relationship with the client, who, due to the time discrepancies, previously believed we were understaffed at specific times of day.

Yes, we can create one Pivot Table from multiple different tables when there is a connection between these tables.

There is no right or wrong answer to this question necessarily, but it’s good to be prepared for the possibility of this question coming up. Being an analytical thinker and good problem solver is two examples of answers you could use for this type of question.

As mentioned earlier, these data analyst interview questions are just sample questions that may or may not be asked in a data analyst interview, and it would largely vary based on the skillsets and the experience level the interviewer would be looking for. So, you need to be prepared for all kinds of questions on the related topics, including probability and statistics, regression and correlation, Python, R and SAS programming, and more.

Whether you’re new at data analysis or you’re looking to further your training, Simplilearn has a variety of courses and programs available to suit your needs and goals. Two popular choices include our Business Analytics Expert Master’s Program and our Business Analytics Certification Training with Excel. We also offer specialized training for those looking to learn more about a specific aspect of data analysis, such as our Python for Data Science Certification Training Course, Data Science Certification Training – R Programming Course, and Data Science with SAS Certification Training. Enroll in one of our highly accredited programs today and get a jumpstart on your career.

The ANYDIGIT function is used to search for a character string. After the string is found it will simply return the desired string.

The Joining term is used when you are combining data from the same source, for example, worksheet in an Excel file or tables in an Oracle database. While blending requires two completely defined data sources in your report.

To explain the Alternative Hypothesis, you can first explain what the null hypothesis is. Null Hypothesis is a statistical phenomenon that is used to test for possible rejection under the assumption that result of chance would be true.

After this, you can say that the alternative hypothesis is again a statistical phenomenon which is contrary to the Null Hypothesis. Usually, it is considered that the observations are a result of an effect with some chance of variation.

Missing data may lead to some critical issues; hence, imputation is the methodology that can help to avoid pitfalls. It is the process of replacing missing data with substituted values. Imputation helps in preventing list-wise deletion of cases with missing values.

When you specify sing dash between the variables, then that specifies consecutively numbered variables. Similarly, if you specify the Double Dash between the variables, then that would specify all the variables available within the dataset.

For Example:
Consider the following data set:

Data Set: ID NAME X1 X2 Y1 X3

Then, X1 – X3 would return X1 X2 X3

and X1 — X3 would return X1 X2 Y1 X3

To view the underlying SQL Queries in Tableau, we mainly have two options:

• Use the Performance Recording Feature: You have to create a Performance Recording to record the information about the main events you interact with the workbook. Users can view the performance metrics in a workbook created by Tableau.
Help -> Settings and Performance -> Start Performance Recording.
Help -> Setting and Performance -> Stop Performance Recording.
• Reviewing the Tableau Desktop Logs: You can review the Tableau Desktop Logs located at C:UsersMy DocumentsMy Tableau Repository. For live connection to the data source, you can check log.txt and tabprotosrv.txt files. For an extract, check tdeserver.txt file.

A Pivot Table is a Microsoft Excel feature used to summarize huge datasets quickly. It sorts, reorganizes, counts, or groups data stored in a database. This data summarization includes sums, averages, or other statistics.

A question on the most used tool is something you’ll mostly find in any data analytics interview questions.
The most useful tools for data analysis are:

• Tableau
• KNIME
• RapidMiner
• Solver
• OpenRefine
• NodeXL
• io

It is naïve because it assumes that all dataset are equally important and independent, which is not the case in a real-world scenario.

This method is used to impute the missing attribute values which are imputed by the attribute values that are most similar to the attribute whose values are missing. The similarity of the two attributes is determined by using the distance functions.

Time series analysis can be done in two domains, frequency domain and the time domain. In Time series analysis the output of a particular process can be forecast by analyzing the previous data by the help of various methods like exponential smoothening, log-linear regression method, etc.

To become a data analyst,

• Robust knowledge on reporting packages (Business Objects), programming language (XML, Javascript, or ETL frameworks), databases (SQL, SQLite, etc.)
• Strong skills with the ability to analyze, organize, collect and disseminate big data with accuracy
• Technical knowledge in database design, data models, data mining and segmentation techniques
• Strong knowledge on statistical packages for analyzing large datasets (SAS, Excel, SPSS, etc.)

From a given dataset for analysis, it is extremely important to sort the information required for data analysis. Data cleaning is a crucial step in the analysis process wherein data is inspected to find any anomalies, remove repetitive data, eliminate any incorrect information, etc. Data cleansing does not involve deleting any existing information from the database, it just enhances the quality of data so that it can be used for analysis.
Some of the best practices for data cleansing include –

• Developing a data quality plan to identify where maximum data quality errors occur so that you can assess the root cause and design the plan according to that.
• Follow a standard process of verifying the important data before it is entered into the database.
• Identify any duplicates and validate the accuracy of the data as this will save lot of time during analysis.
• Tracking all the cleaning operations performed on the data is very important so that you repeat or remove any operations as necessary.

NVL(exp1, exp2) and NVL2(exp1, exp2, exp3) are functions which check whether the value of exp1 is null or not.

If we use NVL(exp1,exp2) function, then if exp1 is not null, then the value of exp1 will be returned; else the value of exp2 will be returned. But, exp2 must be of the same data type of exp1.

Similarly, if we use NVL2(exp1, exp2, exp3) function, then if exp1 is not null, exp2 will be returned, else the value of exp3 will be returned.

• Prepare a validation report that gives information of all suspected data. It should give information like validation criteria that it failed and the date and time of occurrence
• Experience personnel should examine the suspicious data to determine their acceptability
• Invalid data should be assigned and replaced with a validation code
• To work on missing data use the best analysis strategy like deletion method, single imputation methods, model based methods, etc.

Series analysis can usually be performed in two domains – time domain and frequency domain.
Time series analysis is the method where the output forecast of a process is done by analyzing the data collected in the past using techniques like exponential smoothening, log-linear regression method, etc.

You should easily be able to demonstrate to your interviewer that you know and understand these steps, so be prepared for this question if you are asked. Be sure to not only answer with the two different steps—data validation and data verification—but also how they are performed.

There are many different types of data analyst, including operations analysts, marketing analystsfinancial analysts, and more. Explain which type you prefer. Be specific in your answer to indicate to the interviewer that you’ve done your research.

You might answer something like this:

“I would prefer to work as a marketing analyst because it’s in line with my skills and interests. In addition, I have seen that the companies who hire for this role work in industries that are booming and can therefore provide good career growth.

You need to build the following equation:

The total distance that needs to be traveled both ways is 120 miles. The average speed that we need to obtain is 40 miles; therefore, the car must travel for 3 hours in order to achieve that:

120 miles/40 miles per hour = 3 hours

The car has already traveled for two hours:

60 miles/30 miles per hour = 2 hours

So, on the way back it needs to travel only 1 hour. The distance is 60 miles. Hence the car needs to travel at 60 miles per hour.

The different types of hypothesis testing are as follows:

• T-test: T-test is used when the standard deviation is unknown and the sample size is comparatively small.
• Chi-Square Test for Independence: These tests are used to find out the significance of the association between categorical variables in the population sample.
• Analysis of Variance (ANOVA): This kind of hypothesis testing is used to analyze differences between the means in various groups. This test is often used similarly to a T-test but, is used for more than two groups.
• Welch’s T-test: This test is used to find out the test for equality of means between two population samples.

The various types of data validation methods used are:

• Field Level Validation – validation is done in each field as the user enters the data to avoid errors caused by human interaction.
• Form Level Validation – In this method, validation is done once the user completes the form before a save of the information is needed.
• Data Saving Validation – This type of validation is performed during the saving process of the actual file or database record. This is usually done when there are multiple data entry forms.
• Search Criteria Validation – This type of validation is relevant to the user to match what the user is looking for to a certain degree. It is to ensure that the results are actually returned.

As a data analyst, you don’t specifically need experience with statistical models, unless it’s required for the job you’re applying for. If you haven’t been involved in building, using, or maintaining statistical models, be open about it and mention any knowledge or partial experience you may have.

Example
“Being a data analyst, I can’t say I’ve had direct experience building statistical models. However, I’ve helped the statistical department by making sure they have access to the proper data and analyzing it. The model in question was built with the purpose of identifying the customers who were most inclined to buy additional products and predicting when they were most likely to make that decision. My job was to establish the appropriate variables used in the model and assess its performance once it was ready.”

Standard deviation is a very popular method to measure any degree of variation in a data set. It measures the average spread of data around the mean most accurately.

It’s normal for a data analyst to have preferences of certain tasks over others. However, you’ll most probably be expected to deal with all steps of a project – from querying and cleaning, through analyzing, to communicating findings. So, make sure you don’t show antipathy to any of the above. Instead, use this question to highlight your strengths. Just focus on the task you like performing the most and explain why it’s your favorite.

Example
“If I had to select one step as a favorite, it would be analyzing the data. I enjoy developing a variety of hypotheses and searching for evidence to support or refute them. Sometimes, while following my analytical plan, I have stumbled upon interesting and unexpected learnings from the data. I believe there is always something to be learned from the data, whether big or small, that will help me in future analytical projects.”

An Affinity Diagram is an analytical tool used to cluster or organize data into subgroups based on their relationships. These data or ideas are mostly generating from discussions or brainstorming sessions, and are used in analyzing complex issues.

Data Profiling focuses on analyzing individual attributes of data, thereby providing valuable information on data attributes such as data type, frequency, length, along with their discrete values and value ranges. On the contrary, data mining aims to identify unusual records, analyze data clusters, and sequence discovery, to name a few.

The standardized coefficient is interpreted in terms of standard deviation while unstandardized coefficient is measured in actual values.

The complete Hadoop Ecosystem was developed for processing large dataset for an application in a distributed computing environment. The Hadoop Ecosystem consists of the following Hadoop components.

HDFS -> Hadoop Distributed File System
YARN -> Yet Another Resource Negotiator
MapReduce -> Data processing using programming
Spark -> In-memory Data Processing
PIG, HIVE-> Data Processing Services using Query (SQL-like)
HBase -> NoSQL Database
Mahout, Spark MLlib -> Machine Learning
Apache Drill -> SQL on Hadoop
Zookeeper -> Managing Cluster
Oozie -> Job Scheduling
Flume, Sqoop -> Data Ingesting Services
Solr & Lucene -> Searching & Indexing
Ambari -> Provision, Monitor and Maintain cluster

A correlogram analysis is the common form of spatial analysis in geography. It consists of a series of estimated autocorrelation coefficients calculated for a different spatial relationship. It can be used to construct a correlogram for distance-based data, when the raw data is expressed as distance rather than values at individual points.

Various steps in an analytics project include

• Problem definition
• Data exploration
• Data preparation
• Modelling
• Validation of data
• Implementation and tracking

Data analysts require inputs from the business owners and a collaborative environment to operationalize analytics. To create and deploy predictive models in production there should be an effective, efficient and repeatable process. Without taking feedback from the business owner, the model will just be a one-and-done model.

The best way to answer this question would be to say that you would first partition the data into 3 different sets Training, Testing and Validation. You would then show the results of the validation set to the business owner by eliminating biases from the first 2 sets. The input from the business owner or the client will give you an idea on whether you model predicts customer churn with accuracy and provides desired results.

According to your question, you must have a country, state, profit and sales fields in your dataset.

• Double-click on the country field.
• Drag the state and drop it into Marks card.
• Drag the sales and drop it into size.
• Drag profit and drop it into color.
• Click on size legend and increase the size.
• Right-click on the country field and select show quick filter.
• Select any country now and check the view.

To deal the multi-source problems,

• Restructuring of schemas to accomplish a schema integration
• Identify similar records and merge them into single record containing all relevant attributes without redundancy

To tackle multi-source problems, you need to:

• Identify similar data records and combine them into one record that will contain all the useful attributes, minus the redundancy.
• Facilitate schema integration through schema restructuring.

Shown in a box plot, the interquartile range is the difference between the lower and upper quartile, and is a measure of the dispersion of data. If you’re interviewing for a data analyst job, it’s important to be prepared with a similar answer and to answer confidently.

This question tells the interviewer if you have the hard skills needed and can provide insight into what areas you might need training in. It’s also another way to ensure basic competency. In your answer, include the software the job ad emphasized, any experience with that software you have, and use familiar terminology.

“I have a breadth of software experience. For example, at my current employer, I do a lot of ELKI data management and data mining algorithms. I can also create databases in Access and make tables in Excel.”

Pick a small family restaurant and not a chain of restaurants. This should make calculations much easier.

Then define the main parameters of the restaurant that we are talking about:

• Days of the week in which the restaurant is open
• Number of tables/seats
• Average number of visitors:
– during lunchtime;

– at dinner;

• Average expenditure:

– per client during lunch;

– per client during dinner.

The restaurant is open 6 days of the week (they are closed on Monday), which means that is open 25 times during lunch and dinner time per month. It is a small family restaurant with around 60 places. On average 30 customers visit the restaurant at lunch and 40 people come to have dinner. The typical lunch menu costs 10 euro, while dinner at this restaurant costs twice that amount – 20 euro. Therefore, they are able to achieve revenues of:

25 (days) * 30 (customers) * 10 (EUR) = 7,500 EUR (lunch)

25 (days) * 40 (customers) * 20 (EUR) = 20,000 EUR (dinner)

The restaurant is able to achieve 27,500 EUR of sales. Besides, the owner and his wife 4 people work there as well. Let’s say that the 3 waiters make 2,000 EUR each and the chef makes 3,000 EUR (including social security contributions). So the cost of personnel is 9,000 EUR. Usually, food and drinks cost around one-third of the overall amount of sales. Therefore the cost of goods sold amounts to 9,125 EUR. Utility and other expenses are another 10% of Sales, so we will have an additional cost of 2,750 EUR. The owners do not pay rent, because they own the place. After the calculations that we made, it results in a monthly profit of (before taxes) 6,625 EUR.

Variance and Covariance are two mathematical terms which are used frequently in statistics. Variance basically refers to how apart numbers are in relation to the mean. Covariance, on the other hand, refers to how two random variables will change together. This is basically used to calculate the correlation between variables.

In case you have attended any Data Analytics interview in the recent past, do paste those interview questions in the comments section and we’ll answer them ASAP. You can also comment below if you have any questions in your mind, which you might have faced in your Data Analytics interview.

Any observation that lies at an abnormal distance from other observations is known as an outlier. It indicates either a variability in the measurement or an experimental error.

Data analysts should have basic statistics knowledge and experience. That means you should be comfortable with calculating mean, median and mode, as well as conducting significance testing. In addition, as a data analyst, you must be able to interpret the above in connection to the business. If a higher level of statistics is required, it will be listed in the job description.

Example
“In my line of work, I’ve used basic statistics – mostly calculated the mean and standard variances, as well as significance testing. The latter helped me determine the statistical significance of measurement differences between two populations for a project. I’ve also determined the relationship between 2 variables in a data set, working with correlation coefficients.”

The important Big Data analytics tools are –

• NodeXL
• KNIME
• Tableau
• Solver
• OpenRefine
• Rattle GUI
• Qlikview

KNN imputation method seeks to impute the values of the missing attributes using those attribute values that are nearest to the missing attribute values. The similarity between two attribute values is determined using the distance function.

R-squared measures the proportion of variation in the dependent variables explained by the independent variables.

Adjusted R-squared gives the percentage of variation explained by those independent variables that in reality affect the dependent variable.

The waterfall chart shows both positive and negative values which lead to the final result value. For example, if you are analyzing a company’s net income, then you can have all the cost values in this chart. With such kind of a chart, you can visually, see how the value from revenue to the net income is obtained when all the costs are deducted.

In computing, a hash table is a map of keys to values. It is a data structure used to implement an associative array. It uses a hash function to compute an index into an array of slots, from which desired value can be fetched.

Data cleaning also referred as data cleansing, deals with identifying and removing errors and inconsistencies from data in order to enhance the quality of data.

• Having a poor formatted data file. For instance, having CSV data with un-escaped newlines and commas in columns.
• Having inconsistent and incomplete data can be frustrating.
• Common Misspelling and Duplicate entries are a common data quality problem that most of the data analysts face.
• Having different value representations and misclassified data.

A heat map is used for comparing categories with color and size. With heat maps, you can compare two different measures together. A treemap is a powerful visualization that does the same as that of the heat map. Apart from that, it is also used for illustrating hierarchical data and part-to-whole relationships.

Hierarchical clustering algorithm combines and divides existing groups, creating a hierarchical structure that showcase the order in which groups are divided or merged.

The core steps of a Data Analysis project include:

• The foremost requirement of a Data Analysis project is an in-depth understanding of the business requirements.
• The second step is to identify the most relevant data sources that best fit the business requirements and obtain the data from reliable and verified sources.
• The third step involves exploring the datasets, cleaning the data, and organizing the same to gain a better understanding of the data at hand.
• In the fourth step, Data Analysts must validate the data.
• The fifth step involves implementing and tracking the datasets.
• The final step is to create a list of the most probable outcomes and iterate until the desired results are accomplished.

Another must-know term for any data analyst, the outlier (whether multivariate or univariate), refers to a distant value that deviates from a sample’s pattern.

With a question like this, the interviewer is gaining insight into how you approach and solve problems. It also provides an idea of the type of work you have already done. Be sure to explain the event, action, and result (EAR), avoid blaming others, and explain why this project was difficult:

“My most difficult project was on endangered animals. I had to predict how many animals would survive to 2020, 2050, and 2100. Before this, I’d dealt with data that was already there, with events that had already happened. So, I researched the various habitats, the animal’s predators and other factors, and did my predictions. I have high confidence in the results.”

For the most part, this sort of question can serve as an icebreaker. However, sometimes, even if the interviewers don’t explicitly say it, they expect you to answer a more specific question: “Why do you want to be a data analyst for us?”

With these self-reflective questions, there’s not really a right answer I can offer you. There are wrong answers, though—red flags for which the employer is searching.

Answers that show you misunderstand the role are the main “wrong” answers here. Equally, an answer that makes you sound wishy-washy about data analysis can raise red flags.

A few things you probably want to get across include:

1. You love data.
2. You’ve researched the company and understand why your role as a data analyst will help it succeed.
3. You more or less understand what’s expected of your role.
4. You’re confident in your decision.

The basic syntax style of writing code in SAS is as follows:

1. Write the DATA statement which will basically name the dataset.
2. Write the INPUT statement to name the variables in the data set.
3. All the statements should end with a semi-colon.
4. There should be a proper space between word and a statement.

Data profiling is usually done to assess a dataset for its uniqueness, consistency and logic. It cannot identify incorrect or inaccurate data values.

Data mining is the process of finding relevant information which has not been found before. It is the way in which raw data is turned into valuable information.

During imputation we replace missing data with substituted values. The types of imputation techniques involve are

• Single Imputation
• Hot-deck imputation: A missing value is imputed from a randomly selected similar record by the help of punch card
• Cold deck imputation: It works same as hot deck imputation, but it is more advanced and selects donors from another datasets
• Mean imputation: It involves replacing missing value with the mean of that variable for all other cases
• Regression imputation: It involves replacing missing value with the predicted values of a variable based on other variables
• Stochastic regression: It is same as regression imputation, but it adds the average regression variance to regression imputation
• Multiple Imputation
• Unlike single imputation, multiple imputation estimates the values multiple times

Most large companies work with numerous scripting languages. So, a good command of more than one is definitely a plus. Nevertheless, if you aren’t well familiar with the main language used by the company you apply at, you can still make a good impression. Demonstrate enthusiasm to expand your knowledge, and point out that your fluency in other scripting languages gives you a solid foundation for learning new ones.

Example
“I’m most confident in using SQL, since that’s the language I’ve worked with throughout my Data Analyst experience. I also have a basic understanding of Python and have recently enrolled in a Python Programming course to sharpen my skills. So far, I’ve discovered that my expertise in SQL helps me advance in Python with ease.”

Truth Table is a collection of facts, determining the truth or falsity of a proposition. It works as a complete theorem-prover and is of three types –

• Accumulative truth Table
• Photograph truth Table
• Truthless Fact Table

In such a case, a data analyst needs to:

• Use data analysis strategies like deletion method, single imputation methods, and model-based methods to detect missing data.
• Prepare a validation report containing all information about the suspected or missing data.
• Scrutinize the suspicious data to assess their validity.
• Replace all the invalid data (if any) with a proper validation code.

The aim of principal component analysis is to explain the covariance between variables while the aim of factor analysis is to explain the variance between variables.

A Pivot Table is a simple feature in Microsoft Excel which allows you to quickly summarize huge datasets. It is really easy to use as it requires dragging and dropping rows/columns headers to create reports.

A Pivot table is made up of four different sections:

• Values Area: Values are reported in this area
• Rows Area: The headings which are present on the left of the values.
• Column Area: The headings at the top of the values area makes the columns area.
• Filter Area: This is an optional filter used to drill down in the data set.

Some of the best practices for data cleaning includes,

• Sort data by different attributes
• For large datasets cleanse it stepwise and improve the data with each step until you achieve a good data quality
• For large datasets, break them into small data. Working with less data will increase your iteration speed
• To handle common cleansing task create a set of utility functions/tools/scripts. It might include, remapping values based on a CSV file or SQL database or, regex search-and-replace, blanking out all values that don’t match a regex
• If you have an issue with data cleanliness, arrange them by estimated frequency and attack the most common problems
• Analyze the summary statistics for each column ( standard deviation, mean, number of missing values,)
• Keep track of every date cleaning operation, so you can alter changes or remove operations if required

Data Validation is performed in 2 different steps-

Data Screening – In this step various algorithms are used to screen the entire data to find any erroneous or questionable values. Such values need to be examined and should be handled.

Data Verification- In this step each suspect value is evaluated on case by case basis and a decision is to be made if the values have to be accepted as valid or if the values have to be rejected as invalid or if they have to be replaced with some redundant values.

Aggregation of data: Aggregation of data refers to the process of viewing numeric values or the measures at a higher and more summarized level of data. When you place a measure on a shelf, Tableau will automatically aggregate your data. You can determine whether the aggregation has been applied to a field or not, by simply looking at the function. This is because the function always appears in front of the field’s name when it is placed on a shelf.

Example: Sales field will become SUM(Sales) after aggregation.

You can aggregate measures using Tableau only for relational data sources. Multidimensional data sources contain aggregated data only. In Tableau, multidimensional data sources are supported only in Windows.

Disaggregation of data: Disaggregation of data allows you to view every row of the data source which can be useful while analyzing measures.

Example: Consider a scenario where you are analyzing results from a product satisfaction survey. Here the Age of participants is along one axis. Now, you can aggregate the Age field to determine the average age of participants, or you can disaggregate the data to determine the age at which the participants were most satisfied with their product.

K mean is a famous partitioning method. Objects are classified as belonging to one of K groups, k chosen a priori.

In K-mean algorithm,

• The clusters are spherical: the data points in a cluster are centered around that cluster
• The variance/spread of the clusters is similar: Each data point belongs to the closest cluster

A critical data analyst interview question you need to be aware of. A Data Analyst can confront the following issues while performing data analysis:

• Presence of duplicate entries and spelling mistakes. These errors can hamper data quality.
• Poor quality data acquired from unreliable sources. In such a case, a Data Analyst will have to spend a significant amount of time in cleansing the data.
• Data extracted from multiple sources may vary in representation. Once the collected data is combined after being cleansed and organized, the variations in data representation may cause a delay in the analysis process.
• Incomplete data is another major challenge in the data analysis process. It would inevitably lead to erroneous or faulty results.

If you already have experience as a data analyst, this can be easier to answer: explain why you love working as a data analyst and why you want to continue. As a new data analyst, this question can catch you off-guard, but be prepared with an honest answer as to why you want to work in this industry. For example, you can say that you enjoy working with data, and it has always fascinated you.

Many interviewers ask you this type of behavioral questions to see an analyst’s thought process without the help of computers and data sets. After all, technology is only as good and reliable as the people behind it. In your answer include: how you identified the variables, how you communicated them, and ideas you had to find the answer.

This example answer touches on all these points:

“First, I would gather data on how many people live in Paris, how many tourists visit in May, and their average length of stay. I’d break down the numbers by age, gender, and income, and find the numbers on how many vacation days and bank holidays there are in France. I’d also figure out if the tourist office had any data I could look at.”

The trailing @ is commonly known as the column pointer. So, when we use the trailing @, in the Input statement, it gives you the ability to read a part of the raw data line, test it and decide how can the additional data be read from the same record.

• The single trailing @ tells the SAS system to “hold the line”.
• The double trailing @@ tells the SAS system to “hold the line more strongly”.
An Input statement ending with @@ instructs the program to release the current raw data line only when there are no data values left to be read from that line. The @@, therefore, holds the input record even across multiple iterations of the data step.

A good data analyst would be able to understand the market dynamics and act accordingly to retain a working data model so as to adjust to the new environment.

KNN (K-nearest neighbour) is an algorithm that is used for matching a point with its closest k neighbours in a multi-dimensional space.

Knowing what the company wants will help you emphasize your ability to solve their problems. Do not discuss your personal goals outside of work, such as having a family or traveling around the world, in response to this question. This information is not relevant.”

Instead, stick to something work-related like this:

“My long-term goals involve growing with a company where I can continue to learn, take on additional responsibilities, and contribute as much value as I can. I love that your company emphasizes professional development opportunities. I intend to take advantage of all of these.”

Although single imputation is widely used, it does not reflect the uncertainty created by missing data at random. So, multiple imputation is more favorable then single imputation in case of data missing at random.

SQL is considered as one of the easiest scripting languages to learn. So, if you want to be competitive on the job market as a Data Analyst, you should be able to demonstrate excellent command of SQL. Even if you don’t have many years of experience, highlight how your skills have improved with each new project.

Example
“I’ve used SQL in at least 80% of my projects over a period of 5 years. Of course, I’ve also turned to other programming languages for the different phases of my projects. But, all in all, it’s SQL that I’ve utilized the most and consider the best for most of my data analyst tasks.”

Stories are used to narrate a sequence of events or make a business use-case. The Tableau Dashboard provides various options to create a story. Each story point can be based on a different view or dashboard, or the entire story can be based on the same visualization, just seen at different stages, with different marks filtered and annotations added.

To create a story in Tableau you can follow the below steps:

• Click the New Story tab.
• In the lower-left corner of the screen, choose a size for your story. Choose from one of the predefined sizes, or set a custom size, in pixels.
• By default, your story gets its title from its sheet name. To edit it, double-click the title. You can also change your title’s font, color, and alignment. Click Apply to view your changes.
• To start building your story, drag a sheet from the Story tab on the left and drop it into the center of the view.
• Click Add a caption to summarize the story point.
• To highlight a key takeaway for your viewers, drag a text object over to the story worksheet and type your comment.
• To further highlight the main idea of this story point, you can change a filter or sort on a field in the view, then save your changes by clicking Update above the navigator box.

There are many ways to validate datasets. Some of the most commonly used data validation methods by Data Analysts include:

• Field Level Validation – In this method, data validation is done in each field as and when a user enters the data. It helps to correct the errors as you go.
• Form Level Validation – In this method, the data is validated after the user completes the form and submits it. It checks the entire data entry form at once, validates all the fields in it, and highlights the errors (if any) so that the user can correct it.
• Data Saving Validation – This data validation technique is used during the process of saving an actual file or database record. Usually, it is done when multiple data entry forms must be validated.
• Search Criteria Validation – This validation technique is used to offer the user accurate and related matches for their searched keywords or phrases. The main purpose of this validation method is to ensure that the user’s search queries can return the most relevant results.

The fundamental steps involved in a data analysis project are –

• Get the data
• Explore and clean the data
• Validate the data
• Implement and track the data sets
• Make predictions
• Iterate

If you wish to select all the blank cells in Excel, then you can use the Go To Special Dialog Box in Excel. Below are the steps that you can follow to select all the blank cells in Excel.

• First, select the entire dataset and press F5. This will open a Go To Dialog Box.
• Click the ‘Special‘ button which will open a Go To special Dialog box.
• After that, select the Blanks and click on OK.
The final step will select all the blank cells in your dataset.

Logistic regression is a statistical method for examining a dataset in which there are one or more independent variables that defines an outcome.

A model does not hold any value if it cannot produce actionable results, an experienced data analyst will have a varying strategy based on the type of data being analysed. For example, if a customer complain was retweeted then should that data be included or not. Also, any sensitive data of the customer needs to be protected, so it is also advisable to consult with the stakeholder to ensure that you are following all the compliance regulations of the organization and disclosure laws, if any.

You can answer this question by stating that you would first consult with the stakeholder of the business to understand the objective of classifying this data. Then, you would use an iterative process by pulling new data samples and modifying the model accordingly and evaluating it for accuracy. You can mention that you would follow a basic process of mapping the data, creating an algorithm, mining the data, visualizing it and so on. However, you would accomplish this in multiple segments by considering the feedback from stakeholders to ensure that you develop an enriching model that can produce actionable results.

In simpler terms, data visualization is a graphical representation of information and data. It enables the users to view and analyze data in a smarter way and use technology to draw them into diagrams and charts.

A data scientist must have the following skills

• Database knowledge
• Database management
• Data blending
• Querying
• Data manipulation
• Predictive Analytics
• Basic descriptive statistics
• Predictive modeling
• Big Data Knowledge
• Big data analytics
• Unstructured data analysis
• Machine learning
• Presentation skill
• Data visualization
• Insight presentation
• Report design

For a data model to be considered as good and developed, it must depict the following characteristics:

• It should have predictable performance so that the outcomes can be estimated accurately, or at least, with near accuracy.
• It should be adaptive and responsive to changes so that it can accommodate the growing business needs from time to time.
• It should be capable of scaling in proportion to the changes in data.
• It should be consumable to allow clients/customers to reap tangible and profitable results.

The trick to this question is to demonstrate that you not only persuaded others of a decision, but that it was the right decision.

As a data analyst intern at my last company, we didn’t really have a modern means of transferring files between coworkers. We used flash drives. It took some work, but eventually I convinced my manager to let me research file-sharing services that would work best for our team. We tried Google Drive and Dropbox, but eventually we settled on using Sharepoint drives because it integrated well with some of the software we were already using on a daily basis, especially Excel. It definitely improved productivity and minimized the wasted time searching for who had what files at what times.

Working with large datasets and dealing with a substantial number of variables and columns is important for a lot of hiring managers. When answering the question, you don’t have to reveal background information about the project or how you managed each stage. Focus on the size and type of data.

“I believe the largest data set I’ve worked with was within a joint software development project. The data set comprised more than a million records and 600-700 variables. My team and I had to work with Marketing data which we later loaded into an analytical tool to perform EDA.”

Here, we will calculate the weeks between 31st December 2017 and 1st January 2018. 31st December 2017 was a Saturday. So 1st January 2018 will be a Sunday in the next week.

• Hence, Weeks = 1 since both the days are in different weeks.
• Years = 1 since both the days are in different calendar years.
• Months = 1 since both the days are in different months of the calendar.

PROC SQL is nothing but a simultaneous process for all the observations. The following steps occur when a PROC SQL gets executed:

• SAS scans each and every statement in the SQL procedure and checks the syntax errors.
• The SQL optimizer scans the query inside the statement. So, the SQL optimizer basically decides how the SQL query should be executed in order to minimize the runtime.
• If there are any tables in the FROM statement, then they are loaded into the data engine where they can then be accessed in the memory.
• Codes and Calculations are executed.
• The Final Table is created in the memory.
• The Final Table is sent to the output table described in the SQL statement.

KNN is used for missing values under the assumption that a point value can be approximated by the values of the points that are closest to it, based on other variables.

N-gram:

An n-gram is a contiguous sequence of n items from a given sequence of text or speech. It is a type of probabilistic language model for predicting the next item in such a sequence in the form of a (n-1).

If you are an Excel expert, it would be difficult to list all the functions you have experience using. Instead, concentrate on highlighting the more difficult ones, particularly statistical functions. If you have experience utilizing the more challenging functions, hiring managers will presume you have experience using the more basic ones. Be sure to highlight your pivot table skills, as well as your ability to create graphs in Excel. If you have not attained these skills yet, it is worthwhile to invest in training to learn them.

If you’re an Excel pro, there is no need to recite each and every function you’ve used. Instead, highlight your advanced Excel skills, such as working with statistical functions, pivot tables, and graphs. Of course, if you lack the experience, it’s worth considering a specialized Excel training that will help you build a competitive skillset.

Example
“I think I’ve used Excel every day of my data analyst career in every single phase of my analytical projects. For example, I’ve checked, cleaned, and analyzed data sets using Pivot tables. I’ve also turned to statistical functions to calculate standard deviations, correlation coefficients, and others. Not to mention that the Excel graphing function is great for developing visual summaries of the data. As a case in point, I’ve worked with raw data from external vendors in many customer satisfaction surveys. First, I’d use sort functions and pivot tables to ensure the data was clean and loaded properly. In the analysis phase, I’d segment the data with pivot tables and the statistical functions, if necessary. Finally, I’d build tables and graphs for efficient visual representation.”

You can embed interactive Tableau views and dashboards into web pages, blogs, wiki pages, web applications, and intranet portals. Embedded views update as the underlying data changes, or as their workbooks are updated on Tableau Server. Embedded views follow the same licensing and permission restrictions used on Tableau Server. That is, to see a Tableau view that’s embedded in a web page, the person accessing the view must also have an account on Tableau Server.

Alternatively, if your organization uses a core-based license on Tableau Server, a Guest account is available. This allows people in your organization to view and interact with Tableau views embedded in web pages without having to sign in to the server. Contact your server or site administrator to find out if the Guest user is enabled for the site you publish to.

You can do the following to embed views and adjust their default appearance:

• Get the embed code provided with a view: The Share button at the top of each view includes embedded code that you can copy and paste into your webpage. (The Share button doesn’t appear in embedded views if you change the showShareOptions parameter to false in the code.)
• Customize the embed code: You can customize the embed code using parameters that control the toolbar, tabs, and more. For more information, see Parameters for Embed Code.
• Use the Tableau JavaScript API: Web developers can use Tableau JavaScript objects in web applications. To get access to the API, documentation, code examples, and the Tableau developer community, see the Tableau Developer Portal.

A data analyst interview question and answers guide will not complete without this question. An outlier is a term commonly used by data analysts when referring to a value that appears to be far removed and divergent from a set pattern in a sample. There are two kinds of outliers – Univariate and Multivariate.

The two methods used for detecting outliers are:

• Box plot method – According to this method, if the value is higher or lesser than 1.5*IQR (interquartile range), such that it lies above the upper quartile (Q3) or below the lower quartile (Q1), the value is an outlier.
• Standard deviation method – This method states that if a value is higher or lower than mean ± (3*standard deviation), it is an outlier.

Since data preparation is a critical approach to data analytics, the interviewer might be interested in knowing what path you will take up to clean and transform raw data before processing and analysis. As an answer to this data analytics interview question, you should discuss the model you will be using, along with logical reasoning for it. In addition, you should also discuss how your steps would help you to ensure superior scalability and accelerated data usage.

Well, the answer to this question varies on a case-to-case basis. But, here are a few common questions that you can ask while creating a dashboard in Excel.

• Purpose of the Dashboards
• Different data sources
• Usage of the Excel Dashboard
• The frequency at which the dashboard needs to be updated
• The version of Office the client uses.
• Tableau
• RapidMiner
• OpenRefine
• KNIME
• Solver
• NodeXL
• io
• Wolfram Alpha’s
• The developed model should have predictable performance.
• A good data model can adapt easily to any changes in business requirements.
• Any major data changes in a good data model should be scalable.
• A good data model is one that can be easily consumed for actionable results.

Since it is easier to view and understand complex data in the form of charts or graphs, the trend of data visualization has picked up rapidly.

A measure of the dispersion of data that is shown in a box plot is referred to as the interquartile range. It is the difference between the upper and the lower quartile.

Variance and covariance are both statistical terms. Variance depicts how distant two numbers (quantities) are in relation to the mean value. So, you will only know the magnitude of the relationship between the two quantities (how much the data is spread around the mean). On the contrary, covariance depicts how two random variables will change together. Thus, covariance gives both the direction and magnitude of how two quantities vary with respect to each other.

This question is a measure of your enthusiasm and passion for the field; it serves as a pretty good ice breaker or an en passant between questions. Really about the only thing you don’t want to say is that you don’t have any sort of feeling for data.

I feel that data is king. If you just think about it at a sensory level, data propels everything we do. We take sensory input such as sight, taste, sound, smell, or touch, and we convert that data into actionable insights: only we do it so fast we don’t even realize. But that’s exactly what we do. I’m just the weird type of person who stops to think about the sources of that data and wants to learn what more I can glean from data and how I can use it both more efficiently and effectively.

or hiring managers, it’s important that they pick a data analyst who is not only knowledgeable but also confident enough to initiate a change that would improve the company’s status quo. When talking about the recommendation you made, give as many details as possible, including your reasoning behind it. Even if the recommendation you made was not implemented, it still demonstrates that you’re driven and you strive for improvement.

Example
“Although data from non-technical departments is usually handled by data analysts, I’ve worked for a company where colleagues who were not on the data analysis side had access to data. This brought on many cases of misinterpreted data that caused significant damage to the overall company strategy. I gathered examples and pointed out that working with data dictionaries can actually do more harm than good. I recommended that my coworkers depend on data analysts for data access. Once we implemented my recommendation, the cases misinterpreted data dropped drastically.”

This question is straightforward enough. You could, theoretically, compute the solution simply by adding the numbers in sequence, like so: 1+2+3… But this is impractical and probably not what the interviewer is looking for. Fortunately, there’s a formula called a series sum. It’s the number multiplied by itself plus 1, and the resulting solution divided by 2.

n(n+1)/2

Sample answer: Thankfully, there’s a formula that can help with this: 100(100 + 1) = 10,100; 10,100 / 2 = 5,050.

Working with numbers is not the only aspect of a data analyst job. Data analysts also need strong writing skills, so they can present the results of their analysis to management and stakeholders efficiently. If you think you are not the greatest data “storyteller”, make sure you’re making efforts in that direction, e.g. through additional training.

Example
“Over time, I’ve had plenty of opportunities to enhance my writing skills, be it through email communication with coworkers, or through writing analytical project summaries for the upper management. I believe I can interpret data in a clear and succinct manner. However, I’m constantly looking for ways to improve my writing skills even further.”

We can read the last observation to a new dataset using end = dataset option.

For example:

1. data example.newdataset;
2. set example.olddataset end=last;
3. If last;
4. run;
Where newdataset is a new data set to be created and olddataset is the existing data set. last is the temporary variable (initialized to 0) which is set to 1 when the set statement reads the last observation.

When there is a doubt in data or there is missing data, then:

• Make a validation report to provide information on the suspected data.
• Have an experienced personnel look at it so that its acceptability can be determined.
• Invalid data should be updated with a validation code.
• Use the best analysis strategy to work on the missing data like simple imputation, deletion method or case wise imputation.

Criteria for a good data model includes

• It can be easily consumed
• Large data changes in a good model should be scalable
• It should provide predictable performance
• A good model can adapt to changes in requirements

Dashboards are essential for managers, as they visually capture KPIs and metrics and help them track business goals. That said, data analysts are often involved in both building and updating dashboards. Some of the best tools for the purpose are Excel, Tableau, and Power BI (so make sure you’ve got a good command of those). When you talk about your experience, outline the types of data visualizations, and metrics you used in your dashboard.

Example
“In my line of work. I’ve created dashboards related to customer analytics in both Power BI and Excel. That means I used marketing metrics, such as brand awareness, sales, and customer satisfaction. To visualize the data, I operated with pie charts, bar graphs, line graphs, and tables.”

If you notice the condition in the question, you will observe that there is a circular misplacement. By which I mean that, if Black is wrongly labeled as Black, Black cannot be labeled as White. So, it must be named as Back + White. If you consider that all the 3 jars are wrongly placed, that is, Black + White jar contains either the Black balls or the White balls, but not the both. Now, just assume you pick one ball from the Black + White jar and let us assume it to be a Black ball. So, obviously, you will name the jar as Black. However, the jar labeled Black cannot have Black + White. Thus, the third jar left in the process should be labeled Black + White. So, if you just pick up one ball, you can correctly label the jars.

Clustering is a method in which data is classified into clusters and groups. A clustering algorithm has the following properties:

• Hierarchical or flat
• Hard and soft
• Iterative
• Disjunctive

The most popular tools used in data analytics are:

• Tableau
• Konstanz Information Miner (KNIME)
• RapidMiner
• Solver
• OpenRefine
• NodeXL
• Io
• Pentaho
• SQL Server Reporting Services (SSRS)
• Microsoft data management stack

A Print Area in Excel is a range of cells that you designate to print whenever you print that worksheet. For example, if you just want to print the first 20 rows from the entire worksheet, then you can set the first 20 rows as the Print Area.

Now, to set the Print Area in Excel, you can follow the below steps:

• Select the cells for which you want to set the Print Area.
• Then, click on the Page Layout Tab.
• Click on Print Area.
• Click on Set Print Area.

The difference between data mining and data profiling is that

Data profiling: It targets on the instance analysis of individual attributes. It gives information on various attributes like value range, discrete value and their frequency, occurrence of null values, data type, length, etc.

Data mining: It focuses on cluster analysis, detection of unusual records, dependencies, sequence discovery, relation holding between several attributes, etc.

Metadata refers to the detailed information about the data system and its contents. It helps to define the type of data or information that will be sorted.

At the close of the interview, most interviewers ask whether you have any questions about the job or company. It’s always a good idea to have a few ready so that you show you’ve prepared for the interview and have thought about some things relative to the company or to the role that you would like to explore further.

Questions about the role: This is a unique opportunity to learn more about what you’ll do, if it hasn’t already been thoroughly covered in the earlier part of the interview. For example:
• Can you share more about the day-to-day responsibilities of this position? What’s a typical day like?

Questions about the company or the interviewer: This is also a good opportunity to get a sense of company culture and how the company is doing.
• What’s the company organization and culture like?

It’s important to be prepared to respond effectively to the interview questions that employers typically ask at job interviews. Since these questions are so common, hiring managers and interviewers will expect you to be able to answer them smoothly and without hesitation.

You don’t need to memorize your answers to the point you sound like a robot, but do think about what you’re going to say so you’re not put on the spot during the job interview. Practice with a friend so you’re familiar and comfortable with the questions. Good luck!

Collaborative filtering is a simple algorithm to create a recommendation system based on user behavioral data. The most important components of collaborative filtering are users- items- interest.

A good example of collaborative filtering is when you see a statement like “recommended for you” on online shopping sites that’s pops out based on your browsing history.

One of the popular data analyst interview questions. Normal distribution, better known as the Bell Curve or Gaussian curve, refers to a probability function that describes and measures how the values of a variable are distributed, that is, how they differ in their means and their standard deviations. In the curve, the distribution is symmetric. While most of the observations cluster around the central peak, probabilities for the values steer further away from the mean, tapering off equally in both directions.

Data mining is a process in which you identify patterns, anomalies, and correlations in large data sets to predict outcomes. On the other hand, data profiling lets analysts monitor and cleanse data.

Whereas data mining is concerned with collecting knowledge from data, data profiling is concerned primarily with evaluating the quality of data.

To conduct a meaningful analysis, data analysts must use both the quantitative and qualitative data available to them. In surveys, there are both quantitative and qualitative questions, so merging those 2 types of data presents no challenge whatsoever. In other cases, though, a data analyst must use creativity to find matching qualitative data. That said, when answering this question, talk about the project where the most creative thinking was required.

Example
“In my experience, I’ve performed a few analyses where I had qualitative survey data at my disposal. However, I realized I can actually enhance the validity of my recommendations by also implementing valuable data from external survey sources. So, for a product development project, I used qualitative data provided by our distributors, and it yielded great results.”

The SUM function returns the sum of non-missing arguments whereas “+” operator returns a missing value if any of the arguments are missing. Consider the following example.

Example:

1. data exampledata1;
2. input a b c;
3. cards;
4. 44 4 4
5. 34 3 4
6. 34 3 4
7. . 1 2
8. 24 . 4
9. 44 4 .
10. 25 3 1
11. ;
12. run;
13. data exampledata2;
14. set exampledata1;
15. x = sum(a,b,c);
16. y=a+b+c;
17. run;
In the output, the value of y is missing for 4th, 5th, and 6th observation as we have used the “+” operator to calculate the value of y.

x y
52 52
41 41
41 41
3 .
28 .
48 .
29 29

Some of the common problems faced by data analyst are

• Common misspelling
• Duplicate entries
• Missing values
• Illegal values
• Varying value representations
• Identifying overlapping data

Kmeans algorithm partitions a data set into clusters such that a cluster formed is homogeneous and the points in each cluster are close to each other. The algorithm tries to maintain enough separation between these clusters. Due to the unsupervised nature, the clusters have no labels.

All jobs have their challenges, and your interviewer not only wants to test your knowledge on these common issues but also know that you can easily find the right solutions when available. In your answer, you can address some common issues, such as having a data file that’s poorly formatted or having incomplete data.

Data analysts often face the challenge of communicating findings to coworkers from different departments or senior management with limited understanding of data. This requires excellent skills in interpreting specific terms using non-technical language. Moreover, it also requires extra patience to listen to your coworkers’ questions and provide answers in an easy-to-digest way. Show the interviewer that you’re capable of working efficiently with people from different types of background who don’t speak your “language”.

Example
“In my work with stakeholders, it often comes down to the same challenge – facing a question I don’t have the answer to, due to limitations of the gathered data or the structure of the database. In such cases, I analyze the available data to deliver answers to the most closely related questions. Then, I give the stakeholders a basic explanation of the current data limitations and propose the development of a project that would allow us to gather the unavailable data in the future. This shows them that I care about their needs and I’m willing to go the extra mile to provide them with what they need.”

The approach to answering this question is simple. You just must cut the pumpkin horizontally down the center, followed by making 2 other cuts vertically intersecting each other. So, this would give you your 8 equal pieces.

K-mean is a partitioning technique in which objects are categorized into K groups. In this algorithm, the clusters are spherical with the data points are aligned around that cluster, and the variance of the clusters is similar to one another.

The most popular statistical methods used in data analytics are –

• Linear Regression
• Classification
• Resampling Methods
• Subset Selection
• Shrinkage
• Dimension Reduction
• Nonlinear Models
• Tree-Based Methods
• Support Vector Machines
• Unsupervised Learning

Well, there are various ways to handle slow Excel workbooks. But, here are a few ways in which you can handle workbooks.

• Try using manual calculation mode.
• Maintain all the referenced data in a single sheet.
• Often use excel tables and named ranges.
• Use Helper columns instead of array formulas.
• Try to avoid using entire rows or columns in references.
• Convert all the unused formulas to values.

Overfitting – In overfitting, a statistical model describes any random error or noise, and occurs when a model is super complicated. An overfit model has poor predictive performance as it overreacts to minor fluctuations in training data.

Underfitting – In underfitting, a statistical model is unable to capture the underlying data trend. This type of model also shows poor predictive performance.

Data Mining: Data Mining refers to the analysis of data with respect to finding relations that have not been discovered earlier. It mainly focuses on the detection of unusual records, dependencies and cluster analysis.

Data Profiling: Data Profiling refers to the process of analyzing individual attributes of data. It mainly focuses on providing valuable information on data attributes such as data type, frequency etc.

Tools used in Big Data includes

• Hive
• Pig
• Flume
• Mahout
• Sqoop

Univariate analysis refers to a descriptive statistical technique that is applied to datasets containing a single variable. The univariate analysis considers the range of values and also the central tendency of the values.

Bivariate analysis simultaneously analyzes two variables to explore the possibilities of an empirical relationship between them. It tries to determine if there is an association between the two variables and the strength of the association, or if there are any differences between the variables and what is the importance of these differences.

Multivariate analysis is an extension of bivariate analysis. Based on the principles of multivariate statistics, the multivariate analysis observes and analyzes multiple variables (two or more independent variables) simultaneously to predict the value of a dependent variable for the individual subjects.

KPI: It stands for Key Performance Indicator, it is a metric that consists of any combination of spreadsheets, reports or charts about business process

Design of experiments: It is the initial process used to split your data, sample and set up of a data for statistical analysis

80/20 rules: It means that 80 percent of your income comes from 20 percent of your clients

The R-Squared technique is a statistical measure of the proportion of variation in the dependent variables, as explained by the independent variables. The Adjusted R-Squared is essentially a modified version of R-squared, adjusted for the number of predictors in a model. It provides the percentage of variation explained by the specific independent variables that have a direct impact on the dependent variables.

This question takes many forms, but the premise of it is quite simple. It’s asking you to work through a mathematical problem, usually figuring out the number of an item in a certain place, or figuring out how much of something could potentially be sold somewhere. Here are some real examples from Glassdoor:

• “How many piano tuners are in the city of Chicago?” (Quicken Loans)
• “How many windows are in New York City, by you estimation?” (Petco)
• “How many gas stations are there in the United States?” (Progressive)
The idea here is to put you in a situation where you can’t possibly know something off the top of your head, but to see you work through it anyway. That’s the trap, though. You don’t want to just give up and say, well, gee, I don’t know. As James Patounas, associate director and senior data analyst at Source One, puts it, “I have been asked something similar as well as asked something similar. I personally would not accept ‘you can’t really know’ as an answer; or, at least, I would not hire someone that thought this was a sufficient answer.”

He went on: “Mathematical modeling is typically an approximation of the real world. It is rarely an exact representation.”

Basically, you want to pull the data you do have, or at least can approximate, and work yourself through a solution. Let’s take the number of windows in New York City as an example for the sample answer below.

Note: Figures in this answer do not necessarily realistically reflect facts; they are approximations (there are actually 8.6 million people in NYC, according to 2017 data, for example).

I believe there are about 10 million people in New York, give or take a couple million. Assuming each of them lives in a residential building, with three rooms or more, if there were one window per room, that would make approximately 30 million windows. I’m making a few different assumptions that are probably inaccurate. For instance, that everyone lives alone and that the average size of their residences is just three rooms with one window per room. Obviously, there will be a lot of variations in reality. But I think, in terms of residences, 30 million windows could be close.

Then you’d have to take windows for businesses, subway rail cars, and personal vehicles. If the average subway car seats 1,000 people, with 1 window per 2 seats, that’s 500 windows per car. A little more math: I’d guess there are at least enough subway cars to support the whole population of New York: so 10 million divided by 1,000 comes out to 10,000. So there are another 5 million windows for subway cars. If half of all people own their own vehicle, that’s another six windows per person, so 30 million more windows. I’d guess there are at least 100,000 businesses with windows in NYC. Let’s just say for the sake of argument there’s an average of 10 windows each. That’s another million. I’m sure there’s way more than that.

Overall, we’re at 66 million windows (30,000,000 x 2 + 5,000,000 + 1,000,000). All of this pretty much hinges on how close I am to the actual population of New York City. Also, there are other places to find windows, such as busses or boats. But that’s a start.

Strong presentation skills are extremely valuable for any data analyst. Employers are looking for candidates who not only possess brilliant analytical skills, but also have the confidence and eloquence to present their results to different audiences, including upper-level management and executives, and non-technical coworkers. So, when talking about the audiences you’ve presented to, make sure you mention the following:

• Size of the audience;
• Whether it included executives;
• Departments and background of the audience;
• Whether the presentation was in person or remote, as the latter can be very challenging.
Example
“In my role as a Data Analyst, I have presented to various audiences made up of coworkers and clients with differing backgrounds. I’ve given presentation to both small and larger groups. I believe the largest so far has been around 30 people, mostly colleagues from non-technical departments. All of these presentations were conducted in person, except for 1 which was remote via video conference call with senior management.”

The default TCP port assigned by the official Internet Number Authority(IANA) for SQL server is 1433.

Hadoop and MapReduce is the programming framework developed by Apache for processing large data set for an application in a distributed computing environment.

There is no difference, they are the same, with the formula:

(true positive)/(true positive + false negative)

It is important to be able to define the role you’re interviewing for clearly. Some of the different responsibilities of a data analyst you can use in your response include: analyzing all information related to data, creating business reports with data, and identifying areas that need improvement.

When starting an analysis, most data analysts have a rough prediction of the outcome rested on findings from previous projects. But there’s always room for surprise, and sometimes the results are completely unexpected. This question gives you a chance to talk about the types of analytical projects you’ve been involved in. Plus, it allows you to demonstrate your excitement about drawing new learnings from your projects. And don’t forget to mention the action you and the stakeholders took as a result of the unexpected outcome.

Example
“While performing routine analysis of a customer database, I was completely surprised to discover a customer subsegment that the company could target with a new suitable product and a relevant message. That presented a great opportunity for additional revenue for the company by utilizing a subset of an existing customer base. Everyone on my team was pleasantly surprised and soon enough we began devising strategies with Product Development to address the needs of this newly discovered subsegment.”

The solution to this puzzle is very simple. You just must pick 1 coin from the 1st stack, 2 coins from the 2nd stack, 3 coins from the 3rd stack and so on till 10 coins from the 10th stack. So, if you add the number of coins then it would be equal to 55.

So, if none of the coins are defective then the weight would 55*10 = 550 grams.

Yet, if stack 1 turns out to be defective, then the total weight would be 1 less then 550 grams, that is 549 grams. Similarly, if stack 2 was defective then the total weight would be equal to 2 less than 50 grams, that is 548 grams. Similarly, you can find for the other 8 cases.

So, just one measurement is needed to identify the defective stack.

Collaborative filtering is an algorithm that creates a recommendation system based on the behavioral data of a user. For instance, online shopping sites usually compile a list of items under “recommended for you” based on your browsing history and previous purchases. The crucial components of this algorithm include users, objects, and their interest.

The primary benefits of version control are –

• Enables comparing files, identifying differences, and merging the changes
• Allows keeping track of application builds by identifying which version is under development, QA, and production
Helps to improve the collaborative work culture
• Keeps different versions and variants of code files secure
• Allows seeing the changes made in the file’s content
• Keeps a complete history of the project files in case of central server breakdown

Multiple sorting refers to the sorting of a column and then sorting the other column by keeping the first column intact. In Excel, you can definitely sort multiple columns at a one time.

To do multiple sorting, you need to use the Sort Dialog Box. Now, to get this, you can select the data that you want to sort and then click on the Data Tab. After that, click on the Sort icon.

In this Dialog box, you can specify the details for one column, and then sort to another column, by clicking on the Add Level button.

Some of the vital Python libraries used in Data Analysis include –

• Bokeh
• Matplotlib
• NumPy
• Pandas
• SciKit
• SciPy
• Seaborn
• TensorFlow
• Keras

As the name suggests Data Validation is the process of validating data. This step mainly has two processes involved in it. These are Data Screening and Data Verification.

• Data Screening: Different kinds of algorithms are used in this step to screen the entire data to find out any inaccurate values.
• Data Verification: Each and every suspected value is evaluated on various use-cases, and then a final decision is taken on whether the value has to be included in the data or not.

This data analyst interview question tests your knowledge about the required skill set to become a data scientist.
To become a data analyst, you need to:

• Be well-versed with programming languages (XML, Javascript, or ETL frameworks), databases (SQL, SQLite, Db2, etc.), and also have extensive knowledge on reporting packages (Business Objects).
• Be able to analyze, organize, collect and disseminate Big Data efficiently.
• You must have substantial technical knowledge in fields like database design, data mining, and segmentation techniques.
• Have a sound knowledge of statistical packages for analyzing massive datasets such as SAS, Excel, and SPSS, to name a few.

Well, the answer to this question may vary from person to person. But below are a few criteria which I think are a must to be considered to decide whether a developed data model is good or not:

• A model developed for the dataset should have predictable performance. This is required to predict the future.
• A model is said to be a good model if it can easily adapt to changes according to business requirements.
• If the data gets changed, the model should be able to scale according to the data.
• The model developed should also be able to easily consumed by the clients for actionable and profitable results.

Map-reduce is a framework to process large data sets, splitting them into subsets, processing each subset on a different server and then blending results obtained on each.

The main advantages of version control are –

• It allows you to compare files, identify differences, and consolidate the changes seamlessly.
• It helps to keep track of application builds by identifying which version is under which category – development, testing, QA, and production.
• It maintains a complete history of project files that comes in handy if ever there’s a central server breakdown.
• It is excellent for storing and maintaining multiple versions and variants of code files securely.
• It allows you to see the changes made in the content of different files.

Data analysis deals with collecting, inspecting, cleansing, transforming and modelling data to glean valuable insights and support better decision making in an organization. The various steps involved in the data analysis process include –

Data Exploration –

Having identified the business problem, a data analyst has to go through the data provided by the client to analyse the root cause of the problem.

Data Preparation

This is the most crucial step of the data analysis process wherein any data anomalies (like missing values or detecting outliers) with the data have to be modelled in the right direction.

Data Modelling

The modelling step begins once the data has been prepared. Modelling is an iterative process wherein the model is run repeatedly for improvements. Data modelling ensures that the best possible result is found for a given business problem.

Validation

In this step, the model provided by the client and the model developed by the data analyst are validated against each other to find out if the developed model will meet the business requirements.

Database Management System (DBMS) is a software application that interacts with the user, applications and the database itself to capture and analyze data. The data stored in the database can be modified, retrieved and deleted, and can be of any type like strings, numbers, images etc.

There are mainly 4 types of DBMS, which are Hierarchical, Relational, Network, and Object-Oriented DBMS.

Hierarchical DBMS: As the name suggests, this type of DBMS has a style of predecessor-successor type of relationship. So, it has a structure similar to that of a tree, wherein the nodes represent records and the branches of the tree represent fields.
Relational DBMS (RDBMS): This type of DBMS, uses a structure that allows the users to identify and access data in relation to another piece of data in the database.
Network DBMS: This type of DBMS supports many to many relations wherein multiple member records can be linked.
Object-oriented DBMS: This type of DBMS uses small individual software called objects. Each object contains a piece of data and the instructions for the actions to be done with the data.

The missing patterns that are generally observed are

• Missing completely at random
• Missing at random
• Missing that depends on the missing value itself
• Missing that depends on unobserved input variable

The criteria that define a good data model are:

• It is intuitive.
• Its data can be easily consumed.
• The data changes in it are scalable.
• It can evolve and support new business cases.

There are quite a few answers you can give to this question, so be prepared to answer without much hesitation. Some of the examples you should give to your interviewer include the simplex algorithm, Markov process, and bayesian method.

A data analyst is usually seen as a professional with a technical background and excellent math and statistical skills. However, even though creativity is not the first data analyst quality that comes to your mind, it’s still important in developing analytical plans and data visualizations, and even finding unorthodox solutions to data issues. That said, provide an answer with examples of your out-of-the-box way of thinking.

Example
“I can say creativity can make all the difference in a data analyst’s work. In my personal experience, it has helped me find intriguing ways to present analysis results to clients. Moreover, it has helped me devise new data checks that identify issues resulting in anomalous results during data analysis.”

The solution to the above problem can be as follows:

The velocity of the two buses approaching towards each other = (40 + 40)km/hr
The time taken for the buses to collide = 80km/hr = 1 hour.
The total distance traveled by the bird = 100km/hr * 1 hr = 100 km.

This is a pretty straightforward question, aiming to assess if you have industry-specific skills and experience. Even if you don’t, make sure you’ve prepared an answer in advance where you explain how you can apply your background skills from a different field to the benefit of the company.

Example
“As a data analyst with financial background, I can say there are a few similarities between this industry and healthcare. I think the most prominent one is data security. Both industries utilize highly sensitive personal data that must be kept secure and confidential. This leads to 2 things: more restricted access to data, and, consequently, more time to complete its analysis. This has taught me to be more time efficient when it comes to passing through all the security. Moreover, I learned how important it is to clearly state the reasons behind requiring certain data for my analysis.”

The statistical methods that are mostly used by data analysts are:

• Bayesian method
• Markov process
• Simplex algorithm
• Imputation
• Spatial and cluster processes
• Rank statistics, percentile, outliers detection
• Mathematical optimization

What is Collaborative Filtering?Collaborative filtering is a technique used by recommender systems by making automatic predictions or filtering about a user’s interests. This is achieved by collecting information from many users.

A/B testing is the statistical hypothesis testing for a randomized experiment with two variables A and B. Also known as the split testing, it is an analytical method that estimates population parameters based on sample statistics. This test compares two web pages by showing two variants A and B, to a similar number of visitors, and the variant which gives better conversion rate wins.

The goal of A/B Testing is to identify if there are any changes to the web page. For example, if you have a banner ad on which you have spent an ample amount of money. Then, you can find out the return of investment i.e. the click rate through the banner ad.

This is the most commonly asked data analyst interview question. You must have a clear idea as to what your job entails.
A data analyst is required to perform the

• Collect and interpret data from multiple sources and analyze results.
• Filter and “clean” data gathered from multiple sources.
• Offer support to every aspect of data analysis.
• Analyze complex datasets and identify the hidden patterns in them.
• Keep databases secured.

Business data keeps changing on a day-to-day basis, but the format doesn’t change. As and when a business operation enters a new market, sees a sudden rise of opposition or sees its own position rising or falling, it is recommended to retrain the model. So, as and when the business dynamics change, it is recommended to retrain the model with the changing behaviors of customers.

Clustering is a classification method that is applied to data. Clustering algorithm divides a data set into natural groups or clusters.

Properties for clustering algorithm are

• Hierarchical or flat
• Iterative
• Hard and soft
• Disjunctive

Responsibility of a Data analyst include,

Provide support to all data analysis and coordinate with customers and staffs
Resolve business associated issues for clients and performing audit on data
Analyze results and interpret data using statistical techniques and provide ongoing reports
Prioritize business needs and work closely with management and information needs

• Identify new process or areas for improvement opportunities
• Analyze, identify and interpret trends or patterns in complex data sets
• Acquire data from primary or secondary data sources and maintain databases/data systems
• Filter and “clean” data, and review computer reports
• Determine performance indicators to locate and correct code problems
• Securing database by developing access system by determining user level of access

Data Profiling, also referred to as Data Archeology is the process of assessing the data values in a given dataset for uniqueness, consistency and logic. Data profiling cannot identify any incorrect or inaccurate data but can detect only business rules violations or anomalies. The main purpose of data profiling is to find out if the existing data can be used for various other purposes.

Data Mining refers to the analysis of datasets to find relationships that have not been discovered earlier. It focusses on sequenced discoveries or identifying dependencies, bulk analysis, finding various types of attributes, etc.

Normalization is the process of organizing data to avoid duplication and redundancy. There are many successive levels of normalization. These are called normal forms. Each consecutive normal form depends on the previous one. The first three normal forms are usually adequate.

• First Normal Form (1NF) – No repeating groups within rows
• Second Normal Form (2NF) – Every non-key (supporting) column value is dependent on the whole primary key.
• Third Normal Form (3NF) – Dependent solely on the primary key and no other non-key (supporting) column value.
• Boyce- Codd Normal Form (BCNF) – BCNF is the advanced version of 3NF. A table is said to be in BCNF if it is 3NF and for every X ->Y, relation X should be the super key of the table.
• Better Database organization
• More Tables with smaller rows
• Efficient data access
• Greater Flexibility for Queries
• Quickly find the information
• Easier to implement Security
• Allows easy modification
• Reduction of redundant and duplicate data
• More Compact Database
• Ensure Consistent data after modification

In KNN imputation, the missing attribute values are imputed by using the attributes value that are most similar to the attribute whose values are missing. By using a distance function, the similarity of two attributes is determined.

T-test is usually used when we have a sample size of less than 30 and z-test when we have a sample test greater than 30.

If you’re interviewing for a data analyst job, you’ll likely be asked this question and its one that your interviewer will expect that you can quickly answer, so be prepared. Be sure to go into detail and list and describe the different steps of a typical data analyst process. These steps include data exploration, data preparation, data modeling, validation, and implementation of the model and tracking.

When answering this question, keep in mind that the hiring manager would like to hear something different than “communication skills”. Think of an approach you’ve used in your role as a data analyst to improve the quality of work in a cross-functional team.

Example
“I think the role of a data analyst goes beyond explaining technical terms in a non-technical language. I always strive to gain a deeper understanding of the work of my colleagues, so I can bridge my explanation of statistical concepts to the specific parts of the business they deal with, and how these concepts relate to the tasks at hand they need to solve.”

There are 5 basic best practices for data cleaning:

• Make a data cleaning plan by understanding where the common errors take place and keep communications open.
• Standardise the data at the point of entry. This way it is less chaotic and you will be able to ensure that all information is standardised, leading to fewer errors on entry.
• Focus on the accuracy of the data. Maintain the value types of data, provide mandatory constraints and set cross-field validation.
• Identify and remove duplicates before working with the data. This will lead to an effective data analysis process.
• Create a set of utility tools/functions/scripts to handle common data cleaning tasks.

Hiring managers appreciate a candidate who is serious about advancing their career options through additional qualifications. Certificates prove that you have put in the effort to master new skills and knowledge of the latest analytical tools and subjects. While answering the question, list the certificates you have acquired and briefly explain how they’ve helped you boost your data analyst career. If you haven’t earned any certifications so far, make sure you mention the ones you’d like to work towards and why.

Example
“I’m always looking for ways to upgrade my analytics skillset. This is why I recently earned a certification in Customer Analytics in Python. The training and requirements to finish it really helped me sharpen my skills in analyzing customer data and predicting the purchase behavior of clients.”

An n-gram is a connected sequence of n items in a given text or speech. Precisely, an N-gram is a probabilistic language model used to predict the next item in a particular sequence, as in (n-1).

Yes, I have a fair idea of the job responsibilities of a data analyst. Their primary responsibilities are –

• To work in collaboration with IT, management and/or data scientist teams to determine organizational goals
• To dig data from primary and secondary sources
• To clean the data and discard irrelevant information
• To perform data analysis and interpret results using standard statistical methodologies
• To highlight changing trends, correlations and patterns in complicated data sets
• To strategize process improvement
• To ensure clear data visualizations for management

The differences between univariate, bivariate and multivariate analysis are as follows:

• Univariate: A descriptive statistical technique that can be differentiated based on the count of variables involved at a given instance of time.
• Bivariate: This analysis is used to find the difference between two variables at a time.
• Multivariate: The study of more than two variables is nothing but multivariate analysis. This analysis is used to understand the effect of variables on the responses.

If you are sitting for a data analyst job, this is one of the most frequently asked data analyst interview questions.
Data cleansing primarily refers to the process of detecting and removing errors and inconsistencies from the data to improve data quality.
The best ways to clean data are:

• Segregating data, according to their respective attributes.
• Breaking large chunks of data into small datasets and then cleaning them.
• Analyzing the statistics of each data column.
• Creating a set of utility functions or scripts for dealing with common cleaning tasks.
• Keeping track of all the data cleansing operations to facilitate easy addition or removal from the datasets, if required.

Box plot method: if the value is higher or lesser than 1.5*IQR (inter quartile range) above the upper quartile (Q3) or below the lower quartile (Q1) respectively, then it is considered an outlier.

Standard deviation method: if value higher or lower than mean ± (3*standard deviation), then it is considered an outlier.

The following are a few problems that are usually encountered while performing data analysis.

• Presence of Duplicate entries and spelling mistakes, reduce data quality.
• If you are extracting data from a poor source, then this could be a problem as you would have to spend a lot of time cleaning the data.
• When you extract data from sources, the data may vary in representation. Now, when you combine data from these sources, it may happen that the variation in representation could result in a delay.
• Lastly, if there is incomplete data, then that could be a problem to perform analysis of data.

Statistical methods that are useful for data scientist are

• Bayesian method
• Markov process
• Spatial and cluster processes
• Rank statistics, percentile, outliers detection
• Imputation techniques, etc.
• Simplex algorithm
• Mathematical optimization

Responsibility of a Data analyst include,

Provide support to all data analysis and coordinate with customers and staffs
Resolve business associated issues for clients and performing audit on data
Analyze results and interpret data using statistical techniques and provide ongoing reports
Prioritize business needs and work closely with management and information needs

• Identify new process or areas for improvement opportunities
• Analyze, identify and interpret trends or patterns in complex data sets
• Acquire data from primary or secondary data sources and maintain databases/data systems
• Filter and “clean” data, and review computer reports
• Determine performance indicators to locate and correct code problems
• Securing database by developing access system by determining user level of access