Solving a large data cleaning problem: an experiential live blog

10:11 p.m. About five hours ago, a dear colleague said to me, “We need to have this data cleaned for the reports, oh, and it’s also going to be shown at a [high-level government function.]

Here’s a fact. This data has been a disaster for a long time. Well, not the data itself but the cleaning process. Google forms, which is what we used, has a habit of naming each column exactly what the question is. On the one hand, this makes sense; what else would it be called? But this creates a number of problems when the question is something like 50 words long. Added on this, there were about 20 different datasets that had to be merged, all with slightly different wording.

For the longest time, I had been struggling with this; merging columns and then trying to exact and recoded based on the information inside. However, I think I have a better, more crude, yet simpler solution that I’m going to try to do to make this data nice and shiny.

Going back into the google forms

10:17 p.m. So, if the problem was in the original column names, I am going to go back into the google forms and just make the simpler. The first thing I have to do is copy the data over to keep the originals nice and clean.

10:55 p.m. I did the easiest version as a test case first–the daily outreach report. The relabeling went well, and there weren’t any major problems. As a proof of concept, it worked? There was a little bug where the changes in the google form didn’t save, but I couldn’t tell if this was me or THEM.

On to the next chunks!

July 12

12:02 am. This thing where the changes don’t save in google sheets is driving me nuts. I can’t tell why these aren’t persisting. Going back into version control isn’t doing any good. But, ten minutes later, I’ve made the manual changes, and the outreach forms are done.

1:30 a.m. Dude, those forms took forever. I ended up having to manually cut and paste, which was one of the problems that my colleagues had diagnosed, but hadn’t really been specifically identified before.

All original forms cleaned

2:03 am. Okay, so these should be good to go. Now the moment of truth in merging them all together…..

4:15 a.m. I zoned out for a bit making this graph

A marked improvement. But I’ve lost steam and need to sleep, I think….