Data Screening and Handling Missing Data
Data Screening and Handling Missing Data
- Before you can assess if one construct is influencing another, you need to make sure you are actually capturing the construct of interest via your observed variables/indicators.
- In this session, I will go over how to initially screen your data for problems before you even start the analysis.
Learn to Handle Missing Data
The tutorial discusses in detail how to find missing data, check data for respondent misconduct and abandonment, and finally, how to impute missing data using Series Mean and Linear Imputation Methods.
- The first step before analyzing your model is to examine your data to make sure there are no errors, outliers, or respondent misconduct. We also need to assess if you have any missing data.
- Once your data has been keyed into a data software program like Excel, SAS, or SPSS, the first thing you need to do is set up an “ID” column. I usually do this on the first column of the data, and it is simply an increasing number from 1 (on the first row) to the last row of the data.
- This is done to make it easier to find a specific case, especially if you have sorted on different columns. After forming an ID column, it is a good idea to initially examine if you have any respondent misconduct.
- The quickest and easiest way to see if respondent abandonment has occurred is simply to sort the last few columns of the data in ascending order. Hence, you could see if the respondent dropped out of the survey and stopped answering questions.
- These incomplete rows are then subject to deletion. If a respondent failed to answer the last few questions, you need to determine if this amount of missing data is sufficiently acceptable to retain the respondent’s other answers. If the respondent has an excessive amount of missing data, then you are better off just deleting that respondent from the overall data.
- After making a determination if respondents who failed to complete the survey should be deleted, the next thing you need to assess is respondent misconduct. Let’s say you have a survey asking Likert scale questions (1 to 7 scale).
- You want to assess if a respondent simply marked the same answer for every question. The likelihood that the respondent feels the exact same way for every question is small and is subject to deletion because of respondent misconduct.
- Sometimes you will also hear this called “yea-saying”, where the respondent is not reading the questions and just marks agreement at the same level for the rest of the survey.
- An additional step you can take to assess if respondent misconduct is taking place is to add attention check measures to your survey. These questions are added simply to make sure the respondent is paying attention to the questions, and they may ask the respondent to specifically select a number on the 1 to 7 scale.
Data Screening using Standard Deviation
- To see if you have a problem in your data set, examining the standard deviation of answers for each specific respondent is a good way to assess if respondent misconduct is present. While SPSS is great at analyzing data, accomplishing this task in SPSS is quite laborious.
- If your data is in SPSS, a better (and quicker) option is to use Microsoft Excel; you copy the “ID” column and the Likert scale indicator questions from SPSS and paste this data into Excel.
- Go to the last column that is blank and simply input the standard deviation function =STDEV.P(selected columns) and highlight “only” the Likert Scale items in the row (do not include the ID column). (Watch YouTube Video for Practical Demonstration)
- This will allow you to see the standard deviation for each row (respondent). Anything with a standard deviation that is less than .25 is subject to deletion because there is little to no variance among the responses across the survey.
- Saying that, it does not mean that if a standard deviation is under .25, you need to automatically delete the record. As the researcher, you need to determine what is an acceptable level of agreement (or disagreement) within the questions, and this can be a matter of how large or small the survey is as well.
- You may have an extremely hard to get sample with a short survey, and in that instance, you might want to lower the value before deleting records.
- There are no golden rules that apply to every situation, but if you have a standard deviation of a respondent that is under .25, then you need to strongly consider if this respondent’s answers are valid moving forward.
Screening for Impermissible Values in the Data
How to Find Missing Data
- We have already addressed how to find respondent abandonment, but finding missing data that takes place in a random manner can be more challenging. To initially see if any data is missing, let’s start in the SPSS data file.
- In SPSS, Select Analyze -> Descriptive Statistics -> Frequencies
- Next, Select the Variables from the Variable List Box on the left and Add them to Variable(s)
- Press OK
- The frequency table shows 5 missing values for each of the indiaction from AT1 to AT5
How to Address Missing Data
- There are two prominent ways to handle missing data:
- listwise/pairwise deletion, and
- I do not encourage deletion because you throw away a lot of data by doing this. If a respondent misses one question, the whole survey is dropped from the analysis.
- Previous research has shown that you can remedy up to 20%–30% of missing data with an imputation technique and still have good parameter estimates (Hair et al. 2009; Eekhout et al. 2013). Thus, imputation is often a better option if you do not have an excessive amount of missing data.
- Imputation is where your software program will replace each missing value with a numeric guess.
- The most popular imputation method is replacing a missing value with a series mean of the indicator. This is usually done for its ease of use, but it has the drawback of reducing the variance of the variables involved (Schafer and Graham 2002). Not to mention, this fails to account for the individual differences of the specific respondent.
- A second way to impute data is to use a linear interpolation option. This method examines the last valid value before the missing data and then examines the next valid value after the missing data and imputes a value that is between those two values.
- The linear interpolation imputes based on the idea that your data is in a line or is linear.
Method 1: Series Mean Imputation
- To use a series mean imputation and linear interpolation imputation, this can be easily accomplished in SPSS.
- To replace missing values in SPSS, you need to go to the
- Select Transform -> Replace Missing Values.
- A pop-up window will appear where you will need to select which indicators have missing values and need to be imputed.
- When you select the indicators to impute, the default imputation is “series mean”, labeled as “SMEAN”.
- SPSS will impute the series mean for these indicators and create a new variable with an underscore and “1” as the new variable name.
- For instance, AT1 is renamed AT1_1 where all the missing values in this indicator are replaced with the series mean.
Method 2: Linear Interpolation Method
- It is a very similar method to impute using the linear interpolation method.
- After selecting the “Transform” and “Replace Missing Values” options again, you need to select each indicator for imputation.
- As stated earlier, the default method is series mean, but another option under the “Name and Method” section is linear interpolation.
- You will need to highlight each indicator (individually), change the method to “Linear interpolation”, and then hit the “Change” button.
- Make sure to hit the “Change” button for each indicator.
- You will also see that SPSS will try to create a new column for the imputed indicator.
To Learn through Practical Demonstration Watch the Video Tutorial
Collier, J. E. (2020). Applied structural equation modeling using AMOS: Basic to advanced techniques. Routledge.