When do you think you should retrain a model? Is it dependent on the data?
Business data keeps changing on a day-to-day basis, but the format doesn’t change. As and when a business operation enters a new market, sees a sudden rise of opposition or sees its own position rising or falling, it is recommended to retrain the model. So, as and when the business dynamics change, it is recommended to retrain the model with the changing behaviors of customers.
Explain what is Clustering? What are the properties for clustering algorithms?
Clustering is a classification method that is applied to data. Clustering algorithm divides a data set into natural groups or clusters.
Properties for clustering algorithm are
- Hierarchical or flat
- Iterative
- Hard and soft
- Disjunctive
Mention what is the responsibility of a Data analyst?
Responsibility of a Data analyst include,
Provide support to all data analysis and coordinate with customers and staffs
Resolve business associated issues for clients and performing audit on data
Analyze results and interpret data using statistical techniques and provide ongoing reports
Prioritize business needs and work closely with management and information needs
- Identify new process or areas for improvement opportunities
- Analyze, identify and interpret trends or patterns in complex data sets
- Acquire data from primary or secondary data sources and maintain databases/data systems
- Filter and “clean” data, and review computer reports
- Determine performance indicators to locate and correct code problems
- Securing database by developing access system by determining user level of access
What is the difference between Data Mining and Data Profiling?
Data Profiling, also referred to as Data Archeology is the process of assessing the data values in a given dataset for uniqueness, consistency and logic. Data profiling cannot identify any incorrect or inaccurate data but can detect only business rules violations or anomalies. The main purpose of data profiling is to find out if the existing data can be used for various other purposes.
Data Mining refers to the analysis of datasets to find relationships that have not been discovered earlier. It focusses on sequenced discoveries or identifying dependencies, bulk analysis, finding various types of attributes, etc.
What is Normalization? Explain different types of Normalization with advantages.
Normalization is the process of organizing data to avoid duplication and redundancy. There are many successive levels of normalization. These are called normal forms. Each consecutive normal form depends on the previous one. The first three normal forms are usually adequate.
- First Normal Form (1NF) – No repeating groups within rows
- Second Normal Form (2NF) – Every non-key (supporting) column value is dependent on the whole primary key.
- Third Normal Form (3NF) – Dependent solely on the primary key and no other non-key (supporting) column value.
- Boyce- Codd Normal Form (BCNF) – BCNF is the advanced version of 3NF. A table is said to be in BCNF if it is 3NF and for every X ->Y, relation X should be the super key of the table.
Some of the advantages are: - Better Database organization
- More Tables with smaller rows
- Efficient data access
- Greater Flexibility for Queries
- Quickly find the information
- Easier to implement Security
- Allows easy modification
- Reduction of redundant and duplicate data
- More Compact Database
- Ensure Consistent data after modification
Explain what is KNN imputation method?
In KNN imputation, the missing attribute values are imputed by using the attributes value that are most similar to the attribute whose values are missing. By using a distance function, the similarity of two attributes is determined.
What is the condition for using a t-test or a z-test?
T-test is usually used when we have a sample size of less than 30 and z-test when we have a sample test greater than 30.
What Does the Standard Data Analysis Process Look Like?
If you’re interviewing for a data analyst job, you’ll likely be asked this question and its one that your interviewer will expect that you can quickly answer, so be prepared. Be sure to go into detail and list and describe the different steps of a typical data analyst process. These steps include data exploration, data preparation, data modeling, validation, and implementation of the model and tracking.
What are the most important skills a data analyst should possess to work efficiently with team members with various backgrounds, roles, and duties?
When answering this question, keep in mind that the hiring manager would like to hear something different than “communication skills”. Think of an approach you’ve used in your role as a data analyst to improve the quality of work in a cross-functional team.
Example
“I think the role of a data analyst goes beyond explaining technical terms in a non-technical language. I always strive to gain a deeper understanding of the work of my colleagues, so I can bridge my explanation of statistical concepts to the specific parts of the business they deal with, and how these concepts relate to the tasks at hand they need to solve.”
What are the best practices for data cleaning?
There are 5 basic best practices for data cleaning:
- Make a data cleaning plan by understanding where the common errors take place and keep communications open.
- Standardise the data at the point of entry. This way it is less chaotic and you will be able to ensure that all information is standardised, leading to fewer errors on entry.
- Focus on the accuracy of the data. Maintain the value types of data, provide mandatory constraints and set cross-field validation.
- Identify and remove duplicates before working with the data. This will lead to an effective data analysis process.
- Create a set of utility tools/functions/scripts to handle common data cleaning tasks.