1. What are the key differences between Data Analysis and Data Mining?
Data analysis involves the process of cleaning, organizing, and using data to produce meaningful insights.
Data mining is used to search for hidden patterns in the data.
Data analysis produces results that are far more comprehensible by a variety of audiences than the results from data mining.
2. What is Data Validation?
Data validation, as the name suggests, is the process that involves determining the accuracy of data and the quality of the source as well.
There are many processes in data validation but the main ones are data screening and data verification.
· Data screening: Making use of a variety of models to ensure that the data is accurate and no redundancies are present.
· Data verification: If there is a redundancy, it is evaluated based on multiple steps and then a call is taken to ensure the presence of the data item.
3. What Is Metadata?
In KNN imputation, the missing attribute values are imputed by using the attributes value that are most similar to the attribute whose values are missing.
By using a distance function, the similarity of two attributes is determined.
· are related, and the access rights to the data that you’re working with.
5. Define Outlier. Explain Steps To Treat an Outlier in a Dataset.
An outlier is a piece of data that varies significantly from the average features of the dataset that it is in.
There are two methods to treat outliers:
· Box plot method. In this method, a particular value is classified as an outlier if it is above the top quartile or below
the bottom quartile of that dataset.
· Standard deviation method. If a value is greater than or less than the mean of the data +/- (3*standard deviation),
then it is called an outlier in the standard deviation method.
6. What Is Data Visualization? How Many Types of Visualization Are There?
Data visualization is the practice of representing data and data-based insights in graphical form.
Visualization makes it easy for viewers to quickly glean the trends and outliers in a dataset.
There are several types of data visualizations, including:
· Pie charts
· Column charts
· Bar graphs
· Scatter plots
· Heat maps
· Line graphs
· Bullet graphs
· Waterfall charts
A pivot table is a data analysis tool that sources groups from larger datasets and puts those grouped values in a tabular form for easier analysis.
The purpose is to make it easier to find figures or trends in the data by applying a particular aggregation function to the values that have been grouped together.
A data warehouse is a data storage system that collects data from various disparate sources and stores them in a way that makes it
easy to produce important business insights. Data warehousing is the process of identifying heterogeneous data sources, sourcing data,
cleaning it, and transforming it into a manageable form for storage in a data warehouse.
Ans. Below are some of the major differences between data mining and data profiling.
Data Mining | Data Profiling |
Refers to the process of identifying patterns in a pre-built database | Analyses raw data from existing datasets |
Turns raw data into useful information by evaluating the existing database and large datasets | Collects statistics or informative summaries of the data |
Identifies the hidden patterns, searches for new, valuable, and non-trivial knowledge to generate useful information | Helps evaluate data sets for consistency, uniqueness, and logic |
Cannot identify incorrect or inaccurate data values | Identifies the wrong data at the initial stage of data |
Ans. The important Big Data analytics tools are :-
· NodeXL
· KNIME
· Tableau
· Solver
· OpenRefine
· Rattle GUI
· Qlikview
Ans. In simpler terms, data visualization is a graphical representation of information and data. It enables the users to view and
analyze data more smartly and use technology to draw them into diagrams and charts.
12. What Do You Mean by Hierarchical Clustering?
Hierarchical clustering is a data analysis method that first considers every data point as its own cluster. It then uses the
following iterative method to create larger clusters:
· Identify the values, which are now clusters themselves, that are the closest to each other.
· Merge the two clusters that are most compatible with each other.
These are standard data science interview questions frequently asked by interviewers to check your perception of the skills required.
This data analyst interview question tests your knowledge about the required skill set to become a data scientist.
To become a data analyst, you need to:
Be wellversed with programming languages (XML, Javascript, or ETL frameworks), databases (SQL, SQLite, Db2, etc.), and also have
extensive knowledge on reporting packages (Business Objects).
· Be able to analyze, organize, collect and disseminate Big Data efficiently.
· You must have substantial technical knowledge in fields like database design, data mining, and segmentation techniques.
· Have a sound knowledge of statistical packages for analyzing massive datasets such as SAS, Excel, and SPSS, to name a few.
· Proficient in using data visualization tools for comprehensible representation.
· A data analyst should be having knowledge of the data visualisation tools as well.
· Data cleaning
· Strong Microsoft Excel skills
· Linear Algebra and Calculation
Yes, data analysts should use version control when working with any dataset. This ensures that you retain original datasets and
can revert to a previous version even if a new operation corrupts the data in some way. Tools like Pachyderm and Dolt can be
used for creating versions of datasets.
Data profiling is a methodology that involves analyzing all entities present in data to a greater depth. The goal here is to provide highly
accurate information based on the data and its attributes such as the datatype, frequency of occurrence, and more.
Collaborative filtering is a simple algorithm to create a recommendation system based on user behavioral data. The most important
components of collaborative filtering are users- items- interest.
A good example of collaborative filtering is when you see a statement like “recommended for you” on online shopping sites that’s
pops out based on your browsing history
16. What are hash table collisions? How is it avoided?
A hash table collision happens when two different keys hash to the same value. Two data cannot be stored in the same slot in array.
To avoid hash table collision there are many techniques, here we list out two
· Separate Chaining:
It uses the data structure to store multiple items that hash to the same slot.
· Open addressing:
It searches for other slots using a second function and store item in first empty slot that is found
17. Explain what is the criteria for a good data model?
Criteria for a good data model includes
· It can be easily consumed
· Large data changes in a good model should be scalable
· It should provide predictable performance
· A good model can adapt to changes in requirements
This is one of the most basic data analyst interview questions. The various steps involved in any common analytics projects are as follows:
Understand the business problem, define the organizational goals, and plan for a lucrative solution.
Gather the right data from various sources and other information based on your priorities.
Clean the data to remove unwanted, redundant, and missing values, and make it ready for analysis.
Use data visualization and business intelligence tools, data mining techniques, and predictive modeling to analyze data.
Interpret the results to find out hidden patterns, future trends, and gain insights.
19. What is MapReduce?
Ans. MapReduce is a framework that enables you to write applications to process large data sets, splitting them into subsets,
processing each subset on a different server, and then blending results obtained on each. It consists of two tasks, namely Map and Reduce.
The map performs filtering and sorting while reduce performs a summary operation. As the name suggests, the Reduce process occurs after the map task.
20. What is the significance of Exploratory Data Analysis (EDA)?
· Exploratory data analysis (EDA) helps to understand the data better.
· It helps you obtain confidence in your data to a point where you’re ready to engage a machine learning algorithm.
· It allows you to refine your selection of feature variables that will be used later for model building.
· You can discover hidden trends and insights from the data.
This question is usually asking if you have a basic understanding of statistics and how you have used them in your previous data analysis work.
If you are entry-level and not familiar with statistical methods, make sure to research the following concepts:
· Standard deviation
· Variance
· Regression
· Sample size
· Descriptive and inferential statistics
· Mean
If you do have some knowledge, be specific about how statistical analysis ties into business goals. List the types of statistical calculations
you’ve used in the past and what business insights those calculations yielded.
As this is the technical part of the data analyst interview questions, you’ll likely need to demonstrate your skills to some degree.
The interviewer may give you either a problem or a selection of data, and you’ll need to write queries to store, edit, retrieve or
remove data accordingly. The difficulty of this task usually depends on the role you’re applying for and its seniority.
Ans: The various types of data validation methods used are:
1. Field Level Validation – Field level validation is done in each field as the user enters the data to avoid errors caused by human interaction.
2. Form Level Validation – In this method, validation is done once the user completes the form before a save of the information is needed.
3. Data Saving Validation – This type of validation is performed during the saving process of the actual file or database record.
This is usually done when there are multiple data entry forms.
4. Search Criteria Validation – This type of validation is relevant to the user to match what the user is looking for to a certain degree.
It is to ensure that the results are actually returned.
24. Explain what you do with suspicious or missing data.
Ans. When there is a doubt in data or there is missing data, then:
· Make a validation report to provide information on the suspected data.
· Have experienced personnel look at it so that its acceptability can be determined.
· Invalid data should be updated with a validation code.
· Use the best analysis strategy to work on the missing data like simple imputation, deletion method, or case-wise imputation.
25. What statistical methods have you used in data analysis?
What they’re really asking: Do you have basic statistical knowledge?
Most entry-level data analyst roles will require at least a basic competency in statistics and an understanding of how
statistical analysis ties into business goals. List the statistical calculations you’ve used and what business insights those calculations yielded.
If you’ve ever worked with or created statistical models, mention that. If you’re not already, familiarise yourself with the following statistical concepts:
· Mean
· Standard deviation
· Variance
· Regression
· Sample size
· Descriptive and inferential statistics
26. What are the different types of sampling techniques used by data analysts?
Sampling is a statistical method to select a subset of data from an entire dataset (population) to estimate the characteristics of the whole population.
There are majorly five types of sampling methods:
· Simple random sampling
· Systematic sampling
· Cluster sampling
· Stratified sampling
· Judgmental or purposive sampling
This is one of the most frequently asked data analyst interview questions, and the interviewer expects you to give a detailed answer here,
and not just the name of the methods. There are four methods to handle missing values in a dataset.
In the listwise deletion method, an entire record is excluded from analysis if any single value is missing.
Take the average value of the other participants' responses and fill in the missing value.
You can use multiple-regression analyses to estimate a missing value.
It creates plausible values based on the correlations for the missing data and then averages the simulated datasets by incorporating random errors in your predictions.
28. Explain the term Normal Distribution.
Normal Distribution refers to a continuous probability distribution that is symmetric about the mean. In a graph, normal distribution will appear as a bell curve.
· The mean, median, and mode are equal
· All of them are located in the center of the distribution
· 68% of the data falls within one standard deviation of the mean
· 95% of the data lies between two standard deviations of the mean
· 99.7% of the data lies between three standard deviations of the mean
A data analyst interview question and answers guide will not be complete without this question. An outlier is a term commonly
used by data analysts when referring to a value that appears to be far removed and divergent from a set pattern in a sample.
The outlier values vary greatly from the data sets. These could be either smaller, or larger but they would be away from the main data values.
There could be many reasons behind these outlier values such as measurement, errors, etc. There are two kinds of outliers – Univariate and Multivariate.
The two methods used for detecting outliers are:
1. Box plot method – According to this method, if the value is higher or lesser than 1.5*IQR (interquartile range), such that
it lies above the upper quartile (Q3) or below the lower quartile (Q1), the value is an outlier.
2. Standard deviation method – This method states that if a value is higher or lower than mean ± (3*standard deviation), it is an outlier.
Ans. Since data preparation is a critical approach to data analytics, the interviewer might be interested in knowing what path you will take up
to clean and transform raw data before processing and analysis. As an answer to such data analyst interview questions, you should discuss
the model you will be using, along with logical reasoning for it. In addition, you should also discuss how your steps would help you to
ensure superior scalability and accelerated data usage.
31. How do you treat outliers in a dataset?
An outlier is a data point that is distant from other similar points. They may be due to variability in the measurement or may indicate experimental errors.
The graph depicted below shows there are three outliers in the dataset.
To deal with outliers, you can use the following four methods:
· Drop the outlier records
· Cap your outliers data
· Assign a new value
· Try a new transformation
32. Define “Collaborative Filtering”.
Collaborative filtering is an algorithm that creates a recommendation system based on the behavioral data of a user. For instance,
online shopping sites usually compile a list of items under “recommended for you” based on your browsing history and previous purchases.
The crucial components of this algorithm include users, objects, and their interests. It is used to broaden the options the users could have.
Online entertainment applications are another example of collaborative filtering.
For example, Netflix shows recommendations basis the user’s behavior. It follows various techniques, such as-
i) Memory-based approach
ii) Model-based approach
Time Series analysis is a statistical procedure that deals with the ordered sequence of values of a variable at equally spaced time intervals.
Time series data are collected at adjacent periods. So, there is a correlation between the observations. This feature distinguishes time-series data from cross-sectional data.
Below is an example of time-series data on coronavirus cases and its graph.
This is one of the important data analyst interview questions. When two separate keys hash to a common value, a hash table collision occurs.
This means that two different data cannot be stored in the same slot.
Hash collisions can be avoided by:
· Separate chaining – In this method, a data structure is used to store multiple items hashing to a common slot.
· Open addressing – This method seeks out empty slots and stores the item in the first empty slot available.
A better way to prevent the hash collision would be to use good and appropriate hash functions. The reason is that a good hash function
would uniformly distribute the elements. Once the values would be distributed evenly over the hash table there would be lesser chances of having collisions.
There are many skills that a budding data analyst needs. Here are some of them:
· Being well-versed in programming languages such as XML, JavaScript, and ETL frameworks
· Proficient in databases such as SQL, MongoDB, and more
· Ability to effectively collect and analyze data
· Knowledge of database designing and data mining
· Having the ability/experience of working with large datasets
Again, your interviewer might seek to test your understanding of SQL principles by asking about specific SQL queries and terms
and what they do. It’s worth preparing your knowledge of terms such as:
· Clustered vs non clustered index
· Constraints
· Cursor
· DBMS vs RDMBS
· ETL
· Index
37. What are the problems that a Data Analyst can encounter while performing data analysis?
A critical data analyst interview question you need to be aware of. A Data Analyst can confront the following issues while performing data analysis:
· Presence of duplicate entries and spelling mistakes. These errors can hamper data quality.
· Poor quality data acquired from unreliable sources. In such a case, a Data Analyst will have to spend a significant amount of time in cleansing the data.
· Data extracted from multiple sources may vary in representation. Once the collected data is combined after being cleansed
and organized, the variations in data representation may cause a delay in the analysis process.
· Incomplete data is another major challenge in the data analysis process. It would inevitably lead to erroneous or faulty results.
The main advantages of version control are –
· It allows you to compare files, identify differences, and consolidate the changes seamlessly.
· It helps to keep track of application builds by identifying which version is under which category – development, testing, QA, and production.
· It maintains a complete history of project files that comes in handy if ever there’s a central server breakdown.
· It is excellent for storing and maintaining multiple versions and variants of code files securely.
· It allows you to see the changes made in the content of different files.
Version control can also be called source control. It tracks the changes that happen in software. Using certain algorithms and
functions manages those changes so the team who is responsible for the task can effectively work on the software without losing efficiency.
The version control happens with the use of certain version control tools. It is responsible to manage changes happening
in a computer program and saving them. For example, in Google Word Doc, whatever has been added in the doc, can be accessed
by the user the next time they visit
without having the need to save each change. Also, the changes or edits appeared on a real-time basis to all the users having the access to the doc.
Ans. An Affinity Diagram is an analytical tool to cluster or organize data into subgroups based on their relationships. These data or ideas
are mostly generated from discussions or brainstorming sessions and used to analyse complex issues.
Ans. In simpler terms, data visualization is a graphical representation of information and data. It enables the users to view and analyze
data more smartly and use technology to draw them into diagrams and charts.
Ans. Some of the vital Python libraries used in Data Analysis include –
· Bokeh
· Matplotlib
· NumPy
· Pandas
· SciKit
· SciPy
· Seaborn
· TensorFlow
· Keras
Map-reduce is a framework to process large data sets, splitting them into subsets, processing each subset on a different server and then blending results obtained on each.
43. What is the difference between COUNT, COUNTA, COUNTBLANK, and COUNTIF in Excel?
· COUNT function returns the count of numeric cells in a range
· COUNTA function counts the non-blank cells in a range
· COUNTBLANK function gives the count of blank cells in a range
· COUNTIF function returns the count of values by checking a given condition
Ans. To deal with multi-source problems, one can:
· Restructure the schemas to accomplish a schema integration
· Identify similar records and merge them into a single record containing all relevant attributes
There are many types of hypothesis testing. Some of them are as follows:
Analysis of variance (ANOVA): Here, the analysis is conducted between the mean values of multiple groups.
· T-test: This form of testing is used when the standard deviation is not known, and the sample size is relatively small.
· Chi-square Test: This kind of hypothesis testing is used when there is a requirement to find the level of association between the categorical variables in a sample.
Ans. Imputation is the process of replacing the missing data with substituted values. While there are many ways to approach missing data,
the most common imputation techniques are:
There are two types of imputation–single or multiple.
· Single Imputation: In this, you find a single estimate of the missing value. The following are the single imputation techniques:
· Mean imputation: Replace the missing value with the mean of that variable for all other cases.
· Hot deck imputation: Identify all the sample subjects who are similar on other variables, then randomly choose one of their values on the missing variable.
· Cold deck imputation: It works just like the hot deck but in a systematic manner. A systematically chosen value from an individual with similar values on other variables.
· Regression imputation: The predicted value obtained by regressing the missing variable on other variables.
· Stochastic regression: It works like the regression imputation and adds the average regression variance to the regression imputation.
· Substitution: Impute the value from a new variable that was not selected to be in the sample.
· Multiple Imputation: The values are estimated multiple times in the Multiple Imputation techniques.
VLOOKUP is used when you need to find things in a table or a range by row.
VLOOKUP accepts the following four parameters:
lookup_value - The value to look for in the first column of a table
table - The table from where you can extract value
col_index - The column from which to extract value
range_lookup - [optional] TRUE = approximate match (default). FALSE = exact match
Let’s understand VLOOKUP with an example.
If you wanted to find the department to which Stuart belongs to, you could use the VLOOKUP function as shown below:
Here, A11 cell has the lookup value, A2:E7 is the table array, 3 is the column index number with information about departments, and 0 is the range lookup.
If you hit enter, it will return “Marketing”, indicating that Stuart is from the marketing department.
Ans. There are various ways you can answer the question. It might be very badly formatted data when the data isn’t enough to work with,
clients provide data they have supposedly cleaned it but it has been made worse, not getting updated data or there might be factual/data entry errors.
It is a standard practice that a t-test is used when there is a sample size less than 30 and the z-test is considered when the sample size exceeds 30 in most cases.
Ans. The primary benefits of version control are –
· Enables comparing files, identifying differences, and merging the changes
· Allows keeping track of application builds by identifying which version is under development, QA, and production
· Helps to improve the collaborative work culture
· Keeps different versions and variants of code files secure
· Allows seeing the changes made in the file’s content
· Keeps a complete history of the project files in case of central server breakdown
Ans. A data collection plan collects all the critical data in a system. It covers –
· Type of data that needs to be collected or gathered
· Different data sources for analyzing a data set.
0 comments