Missing values in data analysis" refers to values or data that are missing from a given dataset or are not recorded for a certain variable. In this post, we will take a voyage through the complex terrain of handling missing data, a critical part of data pre-processing that requires accuracy and imagination. We'll learn about the causes and types of missingness, as well as missing value treatment.
Common Causes of Missing Values in Data Analysis
Missing data impacts all data-related professions and can lead to a number of challenges such as lower performance, data processing difficulties, and biassed conclusions as a result of discrepancies between complete and missing information. Some of the probable causes of missing data are:
- Human errors during data collection and entry
- Equipment or software malfunctions causing machine errors;
- Participant drop-outs from the study
- Respondents refusing to answer certain questions
- Study duration and nature
- Data transmission and conversion
- Integrating unrelated datasets
Frequent missingness has the ability to reduce overall statistical power and introduce biases into estimates. The relevance of missing values is determined by the magnitude of the missing data, its pattern, and the process that caused it. Therefore, a strategy is always necessary when dealing with missing data, as poor management might produce significantly biassed study results and lead to inaccurate conclusions.
Various Types of Missing Values in Data Analysis and the Impacts
MCAR or Missing Completely at Random
In MCAR, missingness has no relationship with either observed or unobserved values in the dataset. Simply put, the lack of data occurs at random, with no clear pattern.
A classic example of MCAR occurs when a survey participant inadvertently misses a question. The chance of data being absent is independent of any other information in the dataset. This approach is regarded the best for data analysis since it introduces no bias.
MAR or Missing at Random
In MAR, the missingness may be explained by some of the observable dataset properties. Although the data is missing systematically, it is still deemed random since the missingness has no relationship to the unobserved values.
For example, in tobacco research, younger individuals may report their values less frequently (independent of their smoking status), resulting in systematic missingness due to age.
MNAR: Missing Not at Random
MNAR happens when the missingness is linked to the unobserved data. In this situation, the missing data is not random but rather linked to particular reasons or patterns.
Referring to the tobacco research example, individuals who smoke the most may purposefully conceal their smoking habits, resulting in systemic missingness due to missing data.
Treatment of Missing Values: Approach for Handling
Three commonly utilised approaches to address missing data include:
- Deletion method
- Imputation method
- Model-based method
All these methods can be further categorised.
Furthermore, choosing the right treatment will depend on several factors:
- Type of missing data: MCAR, MAR, or MNAR
- Missing value proportion
- Data type and distribution
- Analytical objectives and assumptions
Implications/Impacts Various Missing Data
MCAR:
- MCAR data can be handled efficiently with the help of simple methods such as listwise deletion or mean imputation, without compromising the integrity of the analysis;
- Statistical results originating from MCAR data are usually unbiased and reliable.
MAR:
- MAR data requires more intricate handling techniques such as multiple imputation or maximum likelihood estimation;
- Failing to account for MAR in a proper manner may introduce biases and affect the validity of statistical analyses.
MNAR:
- MNAR data is the most difficult one to handle, as the reasons for missingness are not captured within the observed data;
- Traditional imputation methods may not be applicable for MNAR data, and specialised techniques are required that would consider the reasons for missingness.
Final Words
Understanding the factors that cause missing data is critical for any data scientist or analyst. Each mechanism - MCAR, MAR, and MNAR - has particular challenges and consequences for data processing.
As data scientists, it is critical to determine the proper process and apply appropriate imputation or handling procedures. Failure to treat missing data appropriately can jeopardise the integrity of analysis and lead to incorrect results. Missing data's influence can be reduced by using proper strategies.
To learn more about data science and analytics concepts, enrol into the data science course by Imarticus.