**1. What is the importance of validation of data?**

From a business perspective, at any stage, data validation is a very important tool since it

ensures reliability and accuracy. It is also to ensure that the data stored in your system is

accurate, clean and useful. Improper validation or incorrect data has a direct impact on sales,

revenue numbers and the overall economy.

**2. What are the various approaches to dealing with missing values?**

Missing values or missing data can be dealt with by taking the following approaches-

● Encoding NAs- this used to be a very common method initially when working with

machine learning and algorithms was not very common

● Deleting missing data case wise- this method works well for large datasets with very few

missing values

● Using mean/median value to replace missing values- this method works very well for

numerical features

● Run predictive models to impute missing values- this is highly effective as it works best

with the final model

● Linear regression- works well to provide good estimates for missing values

**3. How do you know if a developed data model is good or bad?**

A developed data model should fulfil the following criteria to qualify as a good model-

● Whether the data is the model can be easily consumed

● If the model is scalable in spite of good data changes

● Whether performance can be predicted or not

● How good and fast can a model adapt to changes

**4. What are some of the challenges I can face if I were to perform a data analysis?**

Performing data analysis may involve the following challenges-

● Too much data collection which can often overwhelm data analysts or employees

● Differentiation between meaningful and useless data

● Incoherent visual representation of data

● Collating and analysing data from multiple sources

● Storing massive amounts of generated data

● Ensuring and restoring both security and privacy of stored data as well as generated

data

● Inadequate experts or lack of industry professionals who understand big data in depth

● Exposure to poor quality or inaccurate data

**5. Explain the method of KNN imputation.**

The term imputation means replacing the missing values in a data set with some other possible

values. Using KNN imputation in data analysis helps in dealing with missing data by matching a

particular point with its nearest K neighbours assuming that it is a multi-dimensional space. This

has been a highly popular method in pattern recognition and statistical estimation since the

beginning of the 1970s.

**6. What does transforming data mean?**

Data transformation involves the process of converting data or information from a different

format into the required format in a system. While mostly transforming data involves the

conversion of documents, occasionally it also means conversion of a program from one

computer language to another in a format that is readable by the system.

Data transformation comprises of two key phases, data mapping to ensure smooth

transformation, and code generation, for the actual transformation to happen and run on

computer systems.

**7. State the difference between null and alternative hypothesis.**

It is a null hypothesis when there is no key significance or relationship between two variables

and is something that the researcher is trying to disprove. No effects are observed as a result of

null hypothesis and neither are there any changes in actions or opinions. The observations of

the researcher are a plain result of chance.

An alternative hypothesis on the other hand is just the opposite of a null hypothesis and has a

significant relationship between two measured and verified phenomena. Some effects are

observed as a result of an alternative hypothesis; and since this is something the researcher is

trying to prove, some amount of changes in opinions and actions are involved. An alternative

hypothesis is a result of a real effect.

**8. What would you mean by principal component analysis?**

Principal component analysis is a method used to reduce large data sets in dimension by

transforming larger sets of variables into smaller ones, while retaining the principal information.

This is majorly done with the intent of improving accuracy since smaller data sets are easier to

explore, as a result of which data analysis gets faster and quicker for machine learning.

**9. Define the term - logistic regression.**

Logistic regression is a form of predictive analysis in machine learning that attempts to identify

relationships between variables. It is used to explain the relationship between a binary variable

and one or multiple nominal, ordinal, interval or ratio-level variables, while also describing the

data. Logistic regression is used for categorical dependent variables.

**10. How can I deal with multi-source problems?**

Storing the same data can often cause quality hindrances in analytics. Depending on what the

magnitude of the issues are, a complete data management system needs to be put in place.

Data reconciliation, elaborate and informative databases and pooling in segmented data can

help in deal with multi-source problems. Aggregation and data integration is also helpful while

dealing with multi-source data.

**11. List the most important types of clustering algorithms.**

The most important types of clustering algorithms are-

● Connectivity models- based on the idea that farther data points from each other exhibit

less similarity when compared to closer data points in data space

● Centroid models- the closeness of a data point to the cluster centroid derives the notion

of similarity for this model

● Distribution models- based on the probability that all data points in the same cluster are

part of the same distribution

● Density models- search for varied density areas of data points in the data space

**12. Why do we scale data?**

Scaling is important because sometimes your data set will have a set of features that completely

or partially vary in terms of units, range and magnitude. While certain algorithms have minimum

or zero effects, scaling can actually have positive impacts on the data. It is an important step of

data pre-processing that also helps to normalise data within a given range. Scaling of data also

often helps in speeding up algorithm calculations.