Common “Dirty Data” Problems I Encounter and How to Save Time Fixing Them

By M. Andrew Young

Hello, my name is M. Andrew Young. I’m a third-year Ph.D. student in the Evaluation, Statistics and Methodology program in the Educational Leadership & Policy Studies department at the University of Tennessee. For the past 4, nearly 5 years now, I have served as a higher education evaluator as a Director of Assessment. In every job I’ve had since I graduated from my undergraduate degree in 2011, I have dealt with dirty data. Now that I deal with data daily from a variety of sources and people who are content experts in their field, but not necessarily research methodologists, I encounter a lot of creative, but not useful, solutions for managing data. If you, like me, have a full plate every single day, shaving seconds and minutes off your cleaning tasks can really make your life easier.
We are often told “there is no perfect evaluation”, “there is no perfect survey”, or even “there is no perfect data set”, but what does that look like in practical terms? Even when we are the designer of the data collection instrument(s), our data can be messy, but what happens when we are coming in way after the fact into someone’s dataset for an instrument we didn’t design, administer, or manage? In those instances, we can find ourselves having to riddle out someone else’s solution to data management. Sometimes they are good, but we weren’t given the key to know how they evaluated the data, and sometimes they are downright horrible solutions because they are designed by a human to appeal to human senses instead of being interpreted by a computation device such as a computer.
I don’t have a ton of programming language experience, so I have had to rely on ChatGPT, for which I pay for a premium subscription, to help write code. CAVEAT: ChatGPT can be highly inaccurate, devise clunky or improper solutions based on the information you give it, and the Python and R packages are woefully out-of-date! I suggest contacting a local programming community. Use GitHub with the AI plugins and debuggers to help you! I had to learn how to debug and evaluate ChatGPT’s code, which took a long time and iterative rounds of testing to see what happened and where it failed.
So, let’s get right down to it. I will share the most common dirty data problems I encounter, how to identify them, and what my solution is. They are in no particular order, but I have encountered them all:
File formats that aren’t usable.
Some data repositories that I have had to analyze data from will export a file with a .xls extension, but the actual encoding is different, like in HTML. Sounds pretty trivial, but if you must download dozens of files, this can be a time-waster.
Solution: Python does some cool stuff, and if you can learn to use Pandas, openpyxl, and beautiful soup, you can get this file conversion done quickly in an entire folder. At the end of this blog post, I’ll place a share link to some extra resources including my Python script for this solution.
Merged Cells, empty rows, leading/trailing spaces, carriage returns, color formatting as data, etc.
In my workplace, and I am going to say commonly in other workplaces, Excel is the preferred place to put data. It isn’t always the best, but it is what people are used to. Sometimes people will attempt to make Excel sheets pleasing to the eye, or able to be viewed by people, but this often makes the file unreadable to Excel or other packages like R without modifications. Since I am a novice R user, and I like to be able to see my data while I’m cleaning it in a dynamic environment, I use Excel for most of my cleaning unless the dataset is too large and unwieldy to utilize Excel.
Leading and trailing spaces, carriage returns, and special characters that we can’t see in a cell can make a unique identifier such as a first/last name combo or email address “look different” to Excel, meaning it doesn’t find your match unless you use “fuzzy” matching formulae, which I tend to avoid. Cleaning the data is, in my opinion, better in the long run. I have provided a VBA script that does that. I have written it so that it allows you to choose the sheet to run the script for instead of the active sheet. You can change that chunk of code if you want it to behave differently. The carriage return remover can be modified to remove other special characters or search for all of them. Here is a link to that list: https://excelx.com/characters/list/
What about colors? I encountered a dataset where the person’s solution to designating different statuses for participant records was color-coding. Unfortunately, those color codes were not mutually-exclusive and some depended on each other in a hierarchical or funnel-flow manner. I always tell people “Columns are free!”, meaning, create an additional column and code those data with numbers, oh, and provide a key in your data journal so the person behind you can figure out what precisely you were doing.
I don’t have an elegant solution for this other than formatting the range of data as a table and using the filter and sort options to filter for color. Copy, and paste your numeric code in those spaces for each filtering option.
Reconciling mismatches due to form design
I encounter this all the time for repeated-measures designs. Participant is asked to do a pretest in one semester/year, then a posttest in a separate semester/year. How you can identify that Participant1 at the pretest is Participant1 at the posttest is having a unique identifier. The form designer asks for email. Great! It’s free text and Participant1 put two different emails in at both times, had a typo in one of those periods, and used their full name in one time period with their shortened nickname or typed their name incorrectly in the other. Removing leading and trailing space won’t help you there.
I encountered a situation where the data collectors administered a pre/posttest design in the same semester. They even used a forced-response option for the students to indicate what course they were enrolled in for the evaluation. Sounds good so far. However, I found out later that many of the courses were cross-listed, or had different names and numbers altogether depending on the enrollee’s major field of study host department. All of the cross-listed courses were options and there were no screening questions to filter for that, so Participant234 at the pretest selected one course and a different one at the posttest, even though they were the same course held at the same and time taught by the same faculty. Excel doesn’t know that. In large datasets, this can be challenging, but going back to your client and asking questions can reap a solution. My solution was to get a cross-listed course crosswalk, set a single identifier, and then use formula to replace all of the cross-listed courses (into a new column, of course) with a single descriptor.
There are more scenarios, but this is common for me to encounter.
Connecting data for participants from multiple datasets
Client A shares a folder with you with 3 different forms, all with multiple tabs, and is scratching their head on how to connect datasets with participant data because the answer to their question lies in the connection of the three sources. Unfortunately for you, there was no unique identifier created to link all three, and that’s why you are there (according to them). If I knew SQL, it might not be as big an issue, but I got my start in Excel, so I’ll show you what I do in Excel to connect those sources, before OR after I’ve created a UniqueID. Sometimes I use this method to HELP create a UniqueID:
Excel has VLOOKUP, HLOOKUP, and now, XLOOKUP, but a nested INDEX(MATCH()) formula is much faster than those in larger sets, so I always use it (Excel XLOOKUP vs INDEX MATCH, 2024).
First, using table references is much less typing than ranges, so my first step in Excel is to ALWAYS create a table AND name it.
How to use =INDEX(MATCH()) properly:
1) For when you have a SINGLE UniqueID: Start in the table or sheet you want to pull data INTO, type =INDEX(OtherTable[ColumnName of data you want to get],MATCH([@[same row, but the column where your UniqueID lives],OtherTable[Column where the same UniqueID lives],0))
This will bring over the data you want, matching on a SINGLE criteria using an exact match (that’s done by that “0” before the closing parentheses).
2) For when you need to match on multiple criteria: {=INDEX(OtherTable[ColumnName of data you want to get],MATCH(1,([@[criteria col1]=OtherTable[matching criteria column]1)* ([@[criteria col2]=OtherTable[matching criteria column2])*(etc.),0,))} <– you get the {} by pressing CTRL + SHIFT + ENTER at the end of the formula to designate an array formula. It will return a whole column of #N/A’s if you don’t! Also, you need to set your table to auto-write or flash-fill formulae to save time.
Finding duplicate entries
Some of the most common mistakes I find in dirty data are duplicate entries the original owners didn’t know they had. This is common when data collectors don’t set their survey platform to not allow duplicate entries. The result is that you will have two different answers for the same person within days or weeks for the same form. If Participant30 took the pretest twice and the posttest once, which pretest entry do you keep?
a) Look for completion first, and if there is a deep disparity, keep the more complete submission.
b) If they are both equally-complete, negotiate with the client on what they believe is the more “valid” response. In my references is a cool study about how this is done in a manufacturing process environment. That is the article by Eckert et al. (2022). If you don’t have access to an institutional library, you may not be able to view it.
Pairwise, listwise, or analysis-specific deletion, and why
When do you use pairwise, listwise, or analysis-specific “deletion”? I will say, in the famous words of Dr. Morrow (https://faculty.utk.edu/Jennifer.Morrow) “It depends”. Each case calls for different handling, and there are several ways to go about this, but these two resources may help:
https://www.ibm.com/support/pages/pairwise-vs-listwise-deletion-what-are-they-and-when-should-i-use-them
TWELVE STEPS OF QUANTITATIVE DATA CLEANING: STRATEGIES FOR DEALING WITH DIRTY DATA by Morrow & Skolits (2017)
Trust, but verify
Last, but not least: what if, just by chance, the original data owners made a data entry error themselves?
*GASP* “NEVER!” It happens, trust me.
I have encountered cases where, in the same column for the same survey item the categorical data in the cells had “5 – Strongly Agree”, and “5 – Strongly Disagree”, and “1 – Strongly Disagree”. Well, which is the right entry for those participants? The client did not have a copy of the originally developed form, and we had to go back and figure out the original scale, and since there were many entries where the categorical data were overwritten with straight numerical data in the same column (probably an errant “find & replace” operation), it was even harder to determine whether 5’s were positive or negative, and if the “5 – Strongly Disagree” entries were supposed to be “5 – Strongly Agree” or “1 – Strongly Disagree”.
Again, it was a negotiation with the client and a bit of data inference (using Morrow & Skolits, 2017) to help along with (Enders, 2022) to infer their responses.
All in all, a lot of dealing with dirty data, especially when that data isn’t your own, is, in my opinion, making collaborative choices with the owners of the data, documenting those choices, and defending those choices. The phrase “garbage in, garbage out” may feel overused, but it is, nevertheless, true. While data cleaning, particularly in the light of data equity concerns, is a much larger topic than this tiny little blog post can cover. I hope this helps you along your journey of tidy data, and if you have solutions that I just am not aware of (very likely), then feel free to pass them along to myoung96@vols.utk.edu (my UTK email). I love learning time-saving techniques, and I am willing to share my dirty data secrets too!
Additional Resources
Link to additional resources: Dirty Data
Eckert, C., Isaksson, O., Hane-Hagström, M., & Eckert, C. (2022). My Facts Are not Your Facts: Data Wrangling as a Socially Negotiated Process, A Case Study in a Multisite Manufacturing Company. Journal of Computing and Information Science in Engineering, 22(6), 060906. https://doi.org/10.1115/1.4055953
Enders, C. K. (2022). Applied missing data analysis (Second Edition). The Guilford Press.
Excel XLOOKUP vs INDEX MATCH: Which is better and faster? (2024, January 24). Ablebits.Com. https://www.ablebits.com/office-addins-blog/xlookup-vs-index-match-excel/
JanChaPatGud36850. (2019, August 13). Characters in Excel. Excel. https://excelx.com/characters/list/
Jennifer Ann Morrow Profile | University of Tennessee Knoxville. (n.d.). Retrieved September 9, 2024, from https://faculty.utk.edu/Jennifer.Morrow
Morrow, J. A., & Skolits, G. (2017). TWELVE STEPS OF QUANTITATIVE DATA CLEANING: STRATEGIES FOR DEALING WITH DIRTY DATA. AEA 2017.
Pairwise vs. Listwise deletion: What are they and when should I use them? (2020, April 16). [CT741]. https://www.ibm.com/support/pages/pairwise-vs-listwise-deletion-what-are-they-and-when-should-i-use-them