Bank Exams Exam  >  Bank Exams Notes  >  Data Interpretation for Competitive Examinations  >  Missing Data Interpretation

Missing Data Interpretation | Data Interpretation for Competitive Examinations - Bank Exams PDF Download

Introduction

  • Missing data, or missing values, occur when you don’t have data stored for certain variables or participants. Data can go missing due to incomplete data entry, equipment malfunctions, lost files, and many other reasons.

Types of missing data

  • Missing data are errors because your data don’t represent the true values of what you set out to measure. The reason for the missing data is important to consider, because it helps you determine the type of missing data and what you need to do about it. There are three main types of missing data.

1. Missing completely at random

  • When data are missing completely at random (MCAR), the probability of any particular value being missing from your dataset is unrelated to anything else.
  • The missing values are randomly distributed, so they can come from anywhere in the whole distribution of your values. These MCAR data are also unrelated to other unobserved variables.
    Example: MCAR data
    • You note that there are a few missing values in your holiday spending dataset. Some people started answering your survey but dropped out or skipped a question.
    • However, you note that you have data points from a wide distribution, ranging from low to high values.
    • Therefore, you conclude that the missing values aren’t related to any specific holiday spending amount range.
  • Data are often considered MCAR if they seem unrelated to specific values or other variables. In practice, it’s hard to meet this assumption because “true randomness” is rare. When data are missing due to equipment malfunctions or lost samples, they are considered MCAR.

2. Missing at random

  • Data missing at random (MAR) are not actually missing at random; this term is a bit of a misnomer.
  • This type of missing data systematically differs from the data you’ve collected, but it can be fully accounted for by other observed variables.
  • The likelihood of a data point being missing is related to another observed variable but not to the specific value of that data point itself.
    Example: MAR data
    • You repeat your data collection with a new group. You notice that there are more missing values for adults aged 18–25 than for other age groups.
    • But looking at the observed data for adults aged 18–25, you notice that the values are widely spread. It’s unlikely that the missing data are missing because of the specific values themselves.
    • Instead, some younger adults may be less inclined to reveal their holiday spending amounts for unrelated reasons (e.g., more protective of their privacy).

3. Missing not at random

  • Data missing not at random (MNAR) are missing for reasons related to the values themselves.
    Example: MNAR data
    • In the new dataset, you also notice that there are fewer low values. Some participants with low incomes avoid reporting their holiday spending amounts because they are low.
  • This type of missing data is important to look for because you may lack data from key subgroups within your sample. Your sample may not end up being representative of your population.

Attrition bias

  • In longitudinal studies, attrition bias can be a form of MNAR data. Attrition bias means that some participants are more likely to drop out than others.
  • For example, in long-term medical studies, some participants may drop out because they become more and more unwell as the study continues. Their data are MNAR because their health outcomes are worse, so your final dataset may only include healthy individuals, and you miss out on important data.

Are missing data problematic?

  • Missing data are problematic because, depending on the type, they can sometimes cause sampling bias. This means your results may not be generalizable outside of your study because your data come from an unrepresentative sample.
  • In practice, you can often consider two types of missing data ignorable because the missing data don’t systematically differ from your observed values:
    • MCAR data
    • MAR data
  • For these two data types, the likelihood of a data point being missing has nothing to do with the value itself. So it’s unlikely that your missing values are significantly different from your observed values.
  • On the flip side, you have a biased dataset if the missing data systematically differ from your observed data. Data that are MNAR are called non-ignorable for this reason.

How to prevent missing data

  • Missing data often come from attrition bias, nonresponse, or poorly designed research protocols. When designing your study, it’s good practice to make it easy for your participants to provide data.
  • Here are some tips to help you minimize missing data:
    • Limit the number of follow-ups
    • Minimize the amount of data collected
    • Make data collection forms user friendly
    • Use data validation techniques
    • Offer incentives
  • After you’ve collected data, it’s important to store them carefully, with multiple backups.

How to deal with missing values

  • To tidy up your data, your options usually include accepting, removing, or recreating the missing data.
  • You should consider how to deal with each case of missing data based on your assessment of why the data are missing.
    • Are these data missing for random or non-random reasons?
    • Are the data missing because they represent zero or null values?
    • Was the question or measure poorly designed?
  • Your data can be accepted, or left as is, if it’s MCAR or MAR. However, MNAR data may need more complex treatment.

Acceptance

  • The most conservative option involves accepting your missing data: you simply leave these cells blank.
  • It’s best to do this when you believe you’re dealing with MCAR or MAR values. When you have a small sample, you’ll want to conserve as much data as possible because any data removal can affect your statistical power.
  • You might also recode all missing values with labels of “N/A” (short for “not applicable”) to make them consistent throughout your dataset.
  • These actions help you retain data from as many research subjects as possible with few or no changes.

Deletion

  • You can remove missing data from statistical analyses using listwise or pairwise deletion.

1. Listwise deletion

  • Listwise deletion means deleting data from all cases (participants) who have data missing for any variable in your dataset. You’ll have a dataset that’s complete for all participants included in it.
  • A downside of this technique is that you may end up with a much smaller and/or a biased sample to work with. If significant amounts of data are missing from some variables or measures in particular, the participants who provide those data might significantly differ from those who don’t.
  • Your sample could be biased because it doesn’t adequately represent the population.
    Example: Listwise deletion
    • You decide to remove all participants with missing data from your survey dataset. This reduces your sample from 114 to 77 participants.
    • You notice that most of the participants with missing data left a specific question about their opinions unanswered. Many of those participants were also women, so your sample now mainly consists of men.

2. Pairwise deletion

  • Pairwise deletion lets you keep more of your data by only removing the data points that are missing from any analyses. It conserves more of your data because all available data from cases are included.
  • It also means that you have an uneven sample size for each of your variables. But it’s helpful when you have a small sample or a large proportion of missing values for some variables.
  • When you perform analyses with multiple variables, such as a correlation, only cases (participants) with complete data for each variable are included.
    Example: Pairwise deletion
    • You decide to only remove missing values, while retaining the other data points for these participants. This does not reduce your overall sample size.
    • 12 people didn’t answer a question about their gender, reducing the sample size from 114 to 102 participants for the variable “gender.”
    • 3 people didn’t answer a question about their age, reducing the sample size from 114 to 11 participants for the variable “age.”
  • You are able to retain more values this way, but the sample size now differs across variables.

Imputation

  • Imputation means replacing a missing value with another value based on a reasonable estimate. You use other data to recreate the missing value for a more complete dataset.
  • You can choose from several imputation methods.
  • The easiest method of imputation involves replacing missing values with the mean or median value for that variable.

1. Hot-deck imputation

  • In hot-deck imputation, you replace each missing value with an existing value from a similar case or participant within your dataset. For each case with missing values, the missing value is replaced by a value from a so-called “donor” that’s similar to that case based on data for other variables.
    Example: Hot-deck imputation
    • In a survey, you ask participants to answer questions about how they rate a new shopping app from 1 to 5. You notice that two participants skipped Question 3, so these cells are empty.
    • You sort the data based on other variables and search for participants who responded similarly to other questions compared to your participants with missing values.
    • You take the answer to Question 3 from a donor and use it to fill in the blank cell for each missing value.

2. Cold-deck imputation

  • Alternatively, in cold-deck imputation, you replace missing values with existing values from similar cases from other datasets. The new values come from an unrelated sample.
    Example: Cold-deck imputation
    • Instead of replacing the missing values with answers from participants from the same sample, you open a different dataset from a coworker. They conducted a similar survey but used a different sample.
    • You search for participants who responded similarly to other questions compared to your participants with missing values.
    • You take the answer to Question 3 from the other dataset and use it to fill in the blank cell for each missing value.

Use imputation carefully

  • Imputation is a complicated task because you have to weigh the pros and cons.
  • Although you retain all of your data, this method can create research bias and lead to inaccurate results. You can never know for sure whether the replaced value accurately reflects what would have been observed or answered. That’s why it’s best to apply imputation with caution.
The document Missing Data Interpretation | Data Interpretation for Competitive Examinations - Bank Exams is a part of the Bank Exams Course Data Interpretation for Competitive Examinations.
All you need of Bank Exams at this link: Bank Exams
17 videos|9 docs|42 tests
Related Searches

Missing Data Interpretation | Data Interpretation for Competitive Examinations - Bank Exams

,

Objective type Questions

,

shortcuts and tricks

,

ppt

,

Exam

,

Summary

,

study material

,

Missing Data Interpretation | Data Interpretation for Competitive Examinations - Bank Exams

,

video lectures

,

Missing Data Interpretation | Data Interpretation for Competitive Examinations - Bank Exams

,

pdf

,

Previous Year Questions with Solutions

,

MCQs

,

Semester Notes

,

mock tests for examination

,

Viva Questions

,

past year papers

,

practice quizzes

,

Important questions

,

Free

,

Sample Paper

,

Extra Questions

;