Blog Details

Data Analytics

Latest Data analytics interview questions and answers 2024

sdvsdvsdv1@ / 8 Mar, 2024

1. What are the key differences between Data Analysis and Data Mining?

Data analysis involves the process of cleaning, organizing, and using data to produce meaningful insights.

Data mining is used to search for hidden patterns in the data.

Data analysis produces results that are far more comprehensible by a variety of audiences than the results from data mining.

2. What is Data Validation?

Data validation, as the name suggests, is the process that involves determining the accuracy of data and the quality of the source as well.

There are many processes in data validation but the main ones are data screening and data verification.

· Data screening: Making use of a variety of models to ensure that the data is accurate and no redundancies are present.

· Data verification: If there is a redundancy, it is evaluated based on multiple steps and then a call is taken to ensure the presence of the data item.

3. What Is Metadata?

Metadata is data that talks about the data in a dataset. That is, it’s not the data you’re working with itself, but data about that data.

Metadata can give you information on things like who produced a piece of data, how different types of data.

4. Explain what is KNN imputation method?

In KNN imputation, the missing attribute values are imputed by using the attributes value that are most similar to the attribute whose values are missing.

By using a distance function, the similarity of two attributes is determined.

· are related, and the access rights to the data that you’re working with.

5. Define Outlier. Explain Steps To Treat an Outlier in a Dataset.

An outlier is a piece of data that varies significantly from the average features of the dataset that it is in.

There are two methods to treat outliers:

· Box plot method. In this method, a particular value is classified as an outlier if it is above the top quartile or below

the bottom quartile of that dataset.

· Standard deviation method. If a value is greater than or less than the mean of the data +/- (3*standard deviation),

then it is called an outlier in the standard deviation method.

6. What Is Data Visualization? How Many Types of Visualization Are There?

Data visualization is the practice of representing data and data-based insights in graphical form.

Visualization makes it easy for viewers to quickly glean the trends and outliers in a dataset.

There are several types of data visualizations, including:

· Pie charts

· Column charts

· Bar graphs

· Scatter plots

· Heat maps

· Line graphs

· Bullet graphs

· Waterfall charts

7. What Is a Pivot Table?

A pivot table is a data analysis tool that sources groups from larger datasets and puts those grouped values in a tabular form for easier analysis.

The purpose is to make it easier to find figures or trends in the data by applying a particular aggregation function to the values that have been grouped together.

8. Explain Data Warehousing.

A data warehouse is a data storage system that collects data from various disparate sources and stores them in a way that makes it

easy to produce important business insights. Data warehousing is the process of identifying heterogeneous data sources, sourcing data,

cleaning it, and transforming it into a manageable form for storage in a data warehouse.

9. What is the difference between data mining and data profiling?

Ans. Below are some of the major differences between data mining and data profiling.

Data Mining	Data Profiling
Refers to the process of identifying patterns in a pre-built database	Analyses raw data from existing datasets
Turns raw data into useful information by evaluating the existing database and large datasets	Collects statistics or informative summaries of the data
Identifies the hidden patterns, searches for new, valuable, and non-trivial knowledge to generate useful information	Helps evaluate data sets for consistency, uniqueness, and logic
Cannot identify incorrect or inaccurate data values	Identifies the wrong data at the initial stage of data

10. Name some of the essential tools useful for Big Data analytics.

Ans. The important Big Data analytics tools are :-

· NodeXL

· KNIME

· Tableau

· Solver

· OpenRefine

· Rattle GUI

· Qlikview

11. What is data visualization?

Ans. In simpler terms, data visualization is a graphical representation of information and data. It enables the users to view and

analyze data more smartly and use technology to draw them into diagrams and charts.

12. What Do You Mean by Hierarchical Clustering?

Hierarchical clustering is a data analysis method that first considers every data point as its own cluster. It then uses the

following iterative method to create larger clusters:

· Identify the values, which are now clusters themselves, that are the closest to each other.

· Merge the two clusters that are most compatible with each other.

13. What are the key requirements for becoming a Data Analyst?

These are standard data science interview questions frequently asked by interviewers to check your perception of the skills required.

This data analyst interview question tests your knowledge about the required skill set to become a data scientist.

To become a data analyst, you need to:

Be wellversed with programming languages (XML, Javascript, or ETL frameworks), databases (SQL, SQLite, Db2, etc.), and also have

extensive knowledge on reporting packages (Business Objects).

· Be able to analyze, organize, collect and disseminate Big Data efficiently.

· You must have substantial technical knowledge in fields like database design, data mining, and segmentation techniques.

· Have a sound knowledge of statistical packages for analyzing massive datasets such as SAS, Excel, and SPSS, to name a few.

· Proficient in using data visualization tools for comprehensible representation.

· A data analyst should be having knowledge of the data visualisation tools as well.

· Data cleaning

· Strong Microsoft Excel skills

· Linear Algebra and Calculation

Q. Do Analysts Need Version Control?

Yes, data analysts should use version control when working with any dataset. This ensures that you retain original datasets and

can revert to a previous version even if a new operation corrupts the data in some way. Tools like Pachyderm and Dolt can be

used for creating versions of datasets.

14. What is Data Profiling?

Data profiling is a methodology that involves analyzing all entities present in data to a greater depth. The goal here is to provide highly

accurate information based on the data and its attributes such as the datatype, frequency of occurrence, and more.

15. Explain what is collaborative filtering?

Collaborative filtering is a simple algorithm to create a recommendation system based on user behavioral data. The most important

components of collaborative filtering are users- items- interest.

A good example of collaborative filtering is when you see a statement like “recommended for you” on online shopping sites that’s

pops out based on your browsing history

16. What are hash table collisions? How is it avoided?

A hash table collision happens when two different keys hash to the same value. Two data cannot be stored in the same slot in array.

To avoid hash table collision there are many techniques, here we list out two

· Separate Chaining:

It uses the data structure to store multiple items that hash to the same slot.

· Open addressing:

It searches for other slots using a second function and store item in first empty slot that is found

17. Explain what is the criteria for a good data model?

Criteria for a good data model includes

· It can be easily consumed

· Large data changes in a good model should be scalable

· It should provide predictable performance

· A good model can adapt to changes in requirements

18. What are the various steps involved in any analytics project?

This is one of the most basic data analyst interview questions. The various steps involved in any common analytics projects are as follows:

Understanding the Problem:

Understand the business problem, define the organizational goals, and plan for a lucrative solution.

Collecting Data

Gather the right data from various sources and other information based on your priorities.

Cleaning Data

Clean the data to remove unwanted, redundant, and missing values, and make it ready for analysis.

Exploring and Analyzing Data

Use data visualization and business intelligence tools, data mining techniques, and predictive modeling to analyze data.

Interpreting the Results

Interpret the results to find out hidden patterns, future trends, and gain insights.

19. What is MapReduce?

Ans. MapReduce is a framework that enables you to write applications to process large data sets, splitting them into subsets,

processing each subset on a different server, and then blending results obtained on each. It consists of two tasks, namely Map and Reduce.

The map performs filtering and sorting while reduce performs a summary operation. As the name suggests, the Reduce process occurs after the map task.

20. What is the significance of Exploratory Data Analysis (EDA)?

· Exploratory data analysis (EDA) helps to understand the data better.

· It helps you obtain confidence in your data to a point where you’re ready to engage a machine learning algorithm.

· It allows you to refine your selection of feature variables that will be used later for model building.

· You can discover hidden trends and insights from the data.

21. What is your statistical knowledge for data analysis?

This question is usually asking if you have a basic understanding of statistics and how you have used them in your previous data analysis work.

If you are entry-level and not familiar with statistical methods, make sure to research the following concepts:

· Standard deviation

· Variance

· Regression

· Sample size

· Descriptive and inferential statistics

· Mean

If you do have some knowledge, be specific about how statistical analysis ties into business goals. List the types of statistical calculations

you’ve used in the past and what business insights those calculations yielded.

22. Write a query

As this is the technical part of the data analyst interview questions, you’ll likely need to demonstrate your skills to some degree.

The interviewer may give you either a problem or a selection of data, and you’ll need to write queries to store, edit, retrieve or

remove data accordingly. The difficulty of this task usually depends on the role you’re applying for and its seniority.

23. Which data validation methods are used in data analytics?

Ans: The various types of data validation methods used are:

1. Field Level Validation – Field level validation is done in each field as the user enters the data to avoid errors caused by human interaction.

2. Form Level Validation – In this method, validation is done once the user completes the form before a save of the information is needed.

3. Data Saving Validation – This type of validation is performed during the saving process of the actual file or database record.

This is usually done when there are multiple data entry forms.

4. Search Criteria Validation – This type of validation is relevant to the user to match what the user is looking for to a certain degree.

It is to ensure that the results are actually returned.

24. Explain what you do with suspicious or missing data.

Ans. When there is a doubt in data or there is missing data, then:

· Make a validation report to provide information on the suspected data.

· Have experienced personnel look at it so that its acceptability can be determined.

· Invalid data should be updated with a validation code.

· Use the best analysis strategy to work on the missing data like simple imputation, deletion method, or case-wise imputation.

25. What statistical methods have you used in data analysis?

What they’re really asking: Do you have basic statistical knowledge?

Most entry-level data analyst roles will require at least a basic competency in statistics and an understanding of how

statistical analysis ties into business goals. List the statistical calculations you’ve used and what business insights those calculations yielded.

If you’ve ever worked with or created statistical models, mention that. If you’re not already, familiarise yourself with the following statistical concepts:

· Mean

· Standard deviation

· Variance

· Regression

· Sample size

· Descriptive and inferential statistics

26. What are the different types of sampling techniques used by data analysts?

Sampling is a statistical method to select a subset of data from an entire dataset (population) to estimate the characteristics of the whole population.

There are majorly five types of sampling methods:

· Simple random sampling

· Systematic sampling

· Cluster sampling

· Stratified sampling

· Judgmental or purposive sampling

27. How can you handle missing values in a dataset?

This is one of the most frequently asked data analyst interview questions, and the interviewer expects you to give a detailed answer here,

and not just the name of the methods. There are four methods to handle missing values in a dataset.

Listwise Deletion

In the listwise deletion method, an entire record is excluded from analysis if any single value is missing.

Average Imputation

Take the average value of the other participants' responses and fill in the missing value.

Regression Substitution

You can use multiple-regression analyses to estimate a missing value.

Multiple Imputations

It creates plausible values based on the correlations for the missing data and then averages the simulated datasets by incorporating random errors in your predictions.

28. Explain the term Normal Distribution.

Normal Distribution refers to a continuous probability distribution that is symmetric about the mean. In a graph, normal distribution will appear as a bell curve.

· The mean, median, and mode are equal

· All of them are located in the center of the distribution

· 68% of the data falls within one standard deviation of the mean

· 95% of the data lies between two standard deviations of the mean

· 99.7% of the data lies between three standard deviations of the mean

29. Define Outlier

A data analyst interview question and answers guide will not be complete without this question. An outlier is a term commonly

used by data analysts when referring to a value that appears to be far removed and divergent from a set pattern in a sample.

The outlier values vary greatly from the data sets. These could be either smaller, or larger but they would be away from the main data values.

There could be many reasons behind these outlier values such as measurement, errors, etc. There are two kinds of outliers – Univariate and Multivariate.

The two methods used for detecting outliers are:

1. Box plot method – According to this method, if the value is higher or lesser than 1.5IQR (interquartile range), such that
it lies above the upper quartile (Q3) or below the lower quartile (Q1), the value is an outlier.
2. Standard deviation method – This method states that if a value is higher or lower than mean ± (3standard deviation), it is an outlier.

30. What do you do for data preparation?

Ans. Since data preparation is a critical approach to data analytics, the interviewer might be interested in knowing what path you will take up

to clean and transform raw data before processing and analysis. As an answer to such data analyst interview questions, you should discuss

the model you will be using, along with logical reasoning for it. In addition, you should also discuss how your steps would help you to

ensure superior scalability and accelerated data usage.

31. How do you treat outliers in a dataset?

An outlier is a data point that is distant from other similar points. They may be due to variability in the measurement or may indicate experimental errors.

The graph depicted below shows there are three outliers in the dataset.

To deal with outliers, you can use the following four methods:

· Drop the outlier records

· Cap your outliers data

· Assign a new value

· Try a new transformation

32. Define “Collaborative Filtering”.

Collaborative filtering is an algorithm that creates a recommendation system based on the behavioral data of a user. For instance,

online shopping sites usually compile a list of items under “recommended for you” based on your browsing history and previous purchases.

The crucial components of this algorithm include users, objects, and their interests. It is used to broaden the options the users could have.

Online entertainment applications are another example of collaborative filtering.

For example, Netflix shows recommendations basis the user’s behavior. It follows various techniques, such as-

i) Memory-based approach

ii) Model-based approach

33. What is Time Series analysis?

Time Series analysis is a statistical procedure that deals with the ordered sequence of values of a variable at equally spaced time intervals.

Time series data are collected at adjacent periods. So, there is a correlation between the observations. This feature distinguishes time-series data from cross-sectional data.

Below is an example of time-series data on coronavirus cases and its graph.

34. What is a hash table collision? How can it be prevented?

This is one of the important data analyst interview questions. When two separate keys hash to a common value, a hash table collision occurs.

This means that two different data cannot be stored in the same slot.
Hash collisions can be avoided by:

· Separate chaining – In this method, a data structure is used to store multiple items hashing to a common slot.

· Open addressing – This method seeks out empty slots and stores the item in the first empty slot available.

A better way to prevent the hash collision would be to use good and appropriate hash functions. The reason is that a good hash function

would uniformly distribute the elements. Once the values would be distributed evenly over the hash table there would be lesser chances of having collisions.

35. What are the prerequisites to become a Data Analyst?

There are many skills that a budding data analyst needs. Here are some of them:

· Being well-versed in programming languages such as XML, JavaScript, and ETL frameworks

· Proficient in databases such as SQL, MongoDB, and more

· Ability to effectively collect and analyze data

· Knowledge of database designing and data mining

· Having the ability/experience of working with large datasets

36. Define a SQL term

Again, your interviewer might seek to test your understanding of SQL principles by asking about specific SQL queries and terms

and what they do. It’s worth preparing your knowledge of terms such as:

· Clustered vs non clustered index

· Constraints

· Cursor

· DBMS vs RDMBS

· ETL

· Index

37. What are the problems that a Data Analyst can encounter while performing data analysis?

A critical data analyst interview question you need to be aware of. A Data Analyst can confront the following issues while performing data analysis:

· Presence of duplicate entries and spelling mistakes. These errors can hamper data quality.

· Poor quality data acquired from unreliable sources. In such a case, a Data Analyst will have to spend a significant amount of time in cleansing the data.

· Data extracted from multiple sources may vary in representation. Once the collected data is combined after being cleansed

and organized, the variations in data representation may cause a delay in the analysis process.

· Incomplete data is another major challenge in the data analysis process. It would inevitably lead to erroneous or faulty results.

38.What are the advantages of version control?

The main advantages of version control are –

· It allows you to compare files, identify differences, and consolidate the changes seamlessly.

· It helps to keep track of application builds by identifying which version is under which category – development, testing, QA, and production.

· It maintains a complete history of project files that comes in handy if ever there’s a central server breakdown.

· It is excellent for storing and maintaining multiple versions and variants of code files securely.

· It allows you to see the changes made in the content of different files.

Version control can also be called source control. It tracks the changes that happen in software. Using certain algorithms and

functions manages those changes so the team who is responsible for the task can effectively work on the software without losing efficiency.

The version control happens with the use of certain version control tools. It is responsible to manage changes happening

in a computer program and saving them. For example, in Google Word Doc, whatever has been added in the doc, can be accessed

by the user the next time they visit

without having the need to save each change. Also, the changes or edits appeared on a real-time basis to all the users having the access to the doc.

39. What is an Affinity Diagram?

Ans. An Affinity Diagram is an analytical tool to cluster or organize data into subgroups based on their relationships. These data or ideas

are mostly generated from discussions or brainstorming sessions and used to analyse complex issues.

40. What is data visualization?

Ans. In simpler terms, data visualization is a graphical representation of information and data. It enables the users to view and analyze

data more smartly and use technology to draw them into diagrams and charts.

41.What are some Python libraries used in Data Analysis?

Ans. Some of the vital Python libraries used in Data Analysis include –

· Bokeh

· Matplotlib

· NumPy

· Pandas

· SciKit

· SciPy

· Seaborn

· TensorFlow

· Keras

42.Explain what is Map Reduce?

Map-reduce is a framework to process large data sets, splitting them into subsets, processing each subset on a different server and then blending results obtained on each.

43. What is the difference between COUNT, COUNTA, COUNTBLANK, and COUNTIF in Excel?

· COUNT function returns the count of numeric cells in a range

· COUNTA function counts the non-blank cells in a range

· COUNTBLANK function gives the count of blank cells in a range

· COUNTIF function returns the count of values by checking a given condition

43. Explain how to deal with multi-source problems?

Ans. To deal with multi-source problems, one can:

· Restructure the schemas to accomplish a schema integration

· Identify similar records and merge them into a single record containing all relevant attributes

44. Which are the types of hypothesis testing used today?

There are many types of hypothesis testing. Some of them are as follows:

Analysis of variance (ANOVA): Here, the analysis is conducted between the mean values of multiple groups.

· T-test: This form of testing is used when the standard deviation is not known, and the sample size is relatively small.

· Chi-square Test: This kind of hypothesis testing is used when there is a requirement to find the level of association between the categorical variables in a sample.

45. What is imputation? Explain different types of imputation techniques.

Ans. Imputation is the process of replacing the missing data with substituted values. While there are many ways to approach missing data,

the most common imputation techniques are:

There are two types of imputation–single or multiple.

· Single Imputation: In this, you find a single estimate of the missing value. The following are the single imputation techniques:

· Mean imputation: Replace the missing value with the mean of that variable for all other cases.

· Hot deck imputation: Identify all the sample subjects who are similar on other variables, then randomly choose one of their values on the missing variable.

· Cold deck imputation: It works just like the hot deck but in a systematic manner. A systematically chosen value from an individual with similar values on other variables.

· Regression imputation: The predicted value obtained by regressing the missing variable on other variables.

· Stochastic regression: It works like the regression imputation and adds the average regression variance to the regression imputation.

· Substitution: Impute the value from a new variable that was not selected to be in the sample.

· Multiple Imputation: The values are estimated multiple times in the Multiple Imputation techniques.

46. Explain how VLOOKUP works in Excel?

VLOOKUP is used when you need to find things in a table or a range by row.

VLOOKUP accepts the following four parameters:

lookup_value - The value to look for in the first column of a table

table - The table from where you can extract value

col_index - The column from which to extract value

range_lookup - [optional] TRUE = approximate match (default). FALSE = exact match

Let’s understand VLOOKUP with an example.

If you wanted to find the department to which Stuart belongs to, you could use the VLOOKUP function as shown below:

Here, A11 cell has the lookup value, A2:E7 is the table array, 3 is the column index number with information about departments, and 0 is the range lookup.

If you hit enter, it will return “Marketing”, indicating that Stuart is from the marketing department.

47. Which challenges are usually faced by data analysts? Share your opinion.

Ans. There are various ways you can answer the question. It might be very badly formatted data when the data isn’t enough to work with,

clients provide data they have supposedly cleaned it but it has been made worse, not getting updated data or there might be factual/data entry errors.

48. What are the ideal situations in which t-test or z-test can be used?

It is a standard practice that a t-test is used when there is a sample size less than 30 and the z-test is considered when the sample size exceeds 30 in most cases.

49. What are the benefits of using version control?

Ans. The primary benefits of version control are –

· Enables comparing files, identifying differences, and merging the changes

· Allows keeping track of application builds by identifying which version is under development, QA, and production

· Helps to improve the collaborative work culture

· Keeps different versions and variants of code files secure

· Allows seeing the changes made in the file’s content

· Keeps a complete history of the project files in case of central server breakdown

50.What is a data collection plan?

Ans. A data collection plan collects all the critical data in a system. It covers –

· Type of data that needs to be collected or gathered

· Different data sources for analyzing a data set.