Is Your Data Dirty? The Importance of Conducting Frequencies First
By Jennifer Ann Morrow, Ph.D.

Data, like life, can be messy. I’ve worked with all types of data, both collected by me and by my clients, for over 25 years and I ALWAYS check my data before conducting my proposed analyses. Sometimes, this part of the analysis process is quick and easy but most of the time it’s like an investigation…you need to be thorough, take your time, and provide evidence for your decision making.
Data Cleaning Step 3: Perform Initial Frequencies
After you have drafted your codebook and analysis plan you should conduct frequencies on all of your variables in your dataset, both numeric and string variables. I typically use Excel or SPSS to do this, my colleague Dr. Louis Rocconi prefers R, but you can use any statistical software that you feel most comfortable with. At this step I conduct frequencies and request graphics (e.g., bar chart, histogram) for every variable. This output will be invaluable as you work through your next data cleaning steps.
So, what should you be looking at when reviewing your frequencies? One thing that I will make note of is any discrepancies in coding between my data and what is listed in my codebook. I’ll flag any spelling issues in my variable names/labels and note anything that doesn’t match my codebook. One thing that I always check is that my value labels (what labels are given to my numeric categories) are the same as my codebook and consistent across sets of variables. Many times, if you are using an online survey software package to collect your data there can easily have been programming mistakes when creating the survey that results in mislabeled values. Also, if you have had many individuals enter data into your database it can increase the chances that mistakes were made during the data entry process. During this step I will also check to make sure that I have properly labeled any values that I’m using to designate missing data and that this is consistent with what I have listed in my codebook.
Lastly, I will highlight when I see variables that may have extreme scores (i.e., potential outliers), variables with more than 5% missing data, and variables with very low sample size in any of their response categories. I’ll use this output in future data cleaning steps to aid in my decision making on variable modification.
Data Cleaning Step 4: Check for Coding Mistakes
At this step I will take my output that I highlighted potential issues with coding and start reviewing and making variable modification decisions at this step. Coding issues are more common when you have data that has been manually entered but you can still have coding errors in online data collection! Any variables that have coding issues I first determine if I can verify the data from the original/another source. For data that has been manually entered I’ll go back to the organization/paper survey/data form to verify the data. If it needs to be changed to the correct response I will make a note of this to fix in my next data cleaning step. If I cannot verify the datapoint (like when you have collected your data anonymously) and the value doesn’t fall in the possible values listed in my codebook then I make a note to set the value as missing when I get to the next data cleaning step.
Additional Advice
As I am going through my frequencies I will highlight/enter notes directly in the output to make things easier as I move forward through the data cleaning process. I’ll also put notes in my project notebook summarizing any issues and then once I make decisions on variable modifications, I note these in my notebook as well. You will use the output from Step 3 in the next few data cleaning steps to aid in your decision making so keep it handy!
Resources
12 Steps of Data Cleaning Handout: https://www.dropbox.com/scl/fi/x2bf2t0q134p0cx4kvej0/TWELVE-STEPS-OF-DATA-CLEANING-BRIEF-HANDOUT-MORROW-2017.pdf?rlkey=lfrllz3zya83qzeny6ubwzvjj&dl=0
https://davenport.libguides.com/data275/spss-tutorial/cleaning
https://libguides.library.kent.edu/SPSS/FrequenciesCategorical
https://www.datacamp.com/tutorial/tutorial-data-cleaning-tutorial