1. What is the importance of validation of data?
From a business perspective, at any stage, data validation is a very important tool since it
ensures reliability and accuracy. It is also to ensure that the data stored in your system is
accurate, clean and useful. Improper validation or incorrect data has a direct impact on sales,
revenue numbers and the overall economy.
2. What are the various approaches to dealing with missing values?
Missing values or missing data can be dealt with by taking the following approaches-
● Encoding NAs- this used to be a very common method initially when working with
machine learning and algorithms was not very common
● Deleting missing data case wise- this method works well for large datasets with very few
missing values
● Using mean/median value to replace missing values- this method works very well for
numerical features
● Run predictive models to impute missing values- this is highly effective as it works best
with the final model
● Linear regression- works well to provide good estimates for missing values
3. How do you know if a developed data model is good or bad?
A developed data model should fulfil the following criteria to qualify as a good model-
● Whether the data is the model can be easily consumed
● If the model is scalable in spite of good data changes
● Whether performance can be predicted or not
● How good and fast can a model adapt to changes
4. What are some of the challenges I can face if I were to perform a data analysis?
Performing data analysis may involve the following challenges-
● Too much data collection which can often overwhelm data analysts or employees
● Differentiation between meaningful and useless data
● Incoherent visual representation of data
● Collating and analysing data from multiple sources
● Storing massive amounts of generated data
● Ensuring and restoring both security and privacy of stored data as well as generated
data
● Inadequate experts or lack of industry professionals who understand big data in depth
● Exposure to poor quality or inaccurate data
5. Explain the method of KNN imputation.
The term imputation means replacing the missing values in a data set with some other possible
values. Using KNN imputation in data analysis helps in dealing with missing data by matching a
particular point with its nearest K neighbours assuming that it is a multi-dimensional space. This
has been a highly popular method in pattern recognition and statistical estimation since the
beginning of the 1970s.
6. What does transforming data mean?
Data transformation involves the process of converting data or information from a different
format into the required format in a system. While mostly transforming data involves the
conversion of documents, occasionally it also means conversion of a program from one
computer language to another in a format that is readable by the system.
Data transformation comprises of two key phases, data mapping to ensure smooth
transformation, and code generation, for the actual transformation to happen and run on
computer systems.
7. State the difference between null and alternative hypothesis.
It is a null hypothesis when there is no key significance or relationship between two variables
and is something that the researcher is trying to disprove. No effects are observed as a result of
null hypothesis and neither are there any changes in actions or opinions. The observations of
the researcher are a plain result of chance.
An alternative hypothesis on the other hand is just the opposite of a null hypothesis and has a
significant relationship between two measured and verified phenomena. Some effects are
observed as a result of an alternative hypothesis; and since this is something the researcher is
trying to prove, some amount of changes in opinions and actions are involved. An alternative
hypothesis is a result of a real effect.
8. What would you mean by principal component analysis?
Principal component analysis is a method used to reduce large data sets in dimension by
transforming larger sets of variables into smaller ones, while retaining the principal information.
This is majorly done with the intent of improving accuracy since smaller data sets are easier to
explore, as a result of which data analysis gets faster and quicker for machine learning.
9. Define the term - logistic regression.
Logistic regression is a form of predictive analysis in machine learning that attempts to identify
relationships between variables. It is used to explain the relationship between a binary variable
and one or multiple nominal, ordinal, interval or ratio-level variables, while also describing the
data. Logistic regression is used for categorical dependent variables.
10. How can I deal with multi-source problems?
Storing the same data can often cause quality hindrances in analytics. Depending on what the
magnitude of the issues are, a complete data management system needs to be put in place.
Data reconciliation, elaborate and informative databases and pooling in segmented data can
help in deal with multi-source problems. Aggregation and data integration is also helpful while
dealing with multi-source data.
11. List the most important types of clustering algorithms.
The most important types of clustering algorithms are-
● Connectivity models- based on the idea that farther data points from each other exhibit
less similarity when compared to closer data points in data space
● Centroid models- the closeness of a data point to the cluster centroid derives the notion
of similarity for this model
● Distribution models- based on the probability that all data points in the same cluster are
part of the same distribution
● Density models- search for varied density areas of data points in the data space
12. Why do we scale data?
Scaling is important because sometimes your data set will have a set of features that completely
or partially vary in terms of units, range and magnitude. While certain algorithms have minimum
or zero effects, scaling can actually have positive impacts on the data. It is an important step of
data pre-processing that also helps to normalise data within a given range. Scaling of data also
often helps in speeding up algorithm calculations.