Missing values in data analysis” refers to values or data that are missing from a given dataset or are not recorded for a certain variable. In this post, we will take a voyage through the complex terrain of handling missing data, a critical part of data pre-processing that requires accuracy and imagination. We’ll learn about the causes and types of missingness, as well as missing value treatment.

Common Causes of Missing Values in Data Analysis

Missing data impacts all data-related professions and can lead to a number of challenges such as lower performance, data processing difficulties, and biassed conclusions as a result of discrepancies between complete and missing information. Some of the probable causes of missing data are:

Frequent missingness has the ability to reduce overall statistical power and introduce biases into estimates. The relevance of missing values is determined by the magnitude of the missing data, its pattern, and the process that caused it. Therefore, a strategy is always necessary when dealing with missing data, as poor management might produce significantly biassed study results and lead to inaccurate conclusions.

Various Types of Missing Values in Data Analysis and the Impacts

MCAR or Missing Completely at Random

In MCAR, missingness has no relationship with either observed or unobserved values in the dataset. Simply put, the lack of data occurs at random, with no clear pattern. 

A classic example of MCAR occurs when a survey participant inadvertently misses a question. The chance of data being absent is independent of any other information in the dataset. This approach is regarded the best for data analysis since it introduces no bias.

MAR or Missing at Random

In MAR, the missingness may be explained by some of the observable dataset properties. Although the data is missing systematically, it is still deemed random since the missingness has no relationship to the unobserved values.

For example, in tobacco research, younger individuals may report their values less frequently (independent of their smoking status), resulting in systematic missingness due to age.

MNAR: Missing Not at Random

MNAR happens when the missingness is linked to the unobserved data. In this situation, the missing data is not random but rather linked to particular reasons or patterns.

Referring to the tobacco research example, individuals who smoke the most may purposefully conceal their smoking habits, resulting in systemic missingness due to missing data.

Treatment of Missing Values: Approach for Handling

Three commonly utilised approaches to address missing data include:

All these methods can be further categorised.

Furthermore, choosing the right treatment will depend on several factors:

Implications/Impacts Various Missing Data

MCAR:

MAR:

MNAR:

Final Words

Understanding the factors that cause missing data is critical for any data scientist or analyst. Each mechanism – MCAR, MAR, and MNAR – has particular challenges and consequences for data processing.

As data scientists, it is critical to determine the proper process and apply appropriate imputation or handling procedures. Failure to treat missing data appropriately can jeopardise the integrity of analysis and lead to incorrect results. Missing data’s influence can be reduced by using proper strategies.

To learn more about data science and analytics concepts, enrol into the data science course by Imarticus.