Tuesday, February 5, 2013

"Tidy data"

http://vita.had.co.nz/papers/tidy-data.pdf

The article linked above talks about a typical but undiagnosed source of unnecessary effort in data analysis, untidy data, explains what 'tidy data' looks like, and illustrates some tools that help you make the change.

Keeping data tidy saves a lot of effort. "Tidy data" is not a table format that is visually pleasing for a presentation. It is the format you'd most like data to be in for manipulations. In fact, storing data in formats that make for visually pleasing tables usually makes them especially difficult for other folks to use within programming-style analysis tools like R and Matlab. I was reminded of this when a coworker asked for help turning his manual Excel workflow into an automated Matlab workflow.

After trying to get all kinds of different types of data incorporated into an analysis related to my current project, often from the Supplementary Info in scientific papers, I've found that the less creative the authors are with their data presentation, the easier the job is.

Excel unintentionally encourages the basic problem. Since you constantly see the data, and there are all kinds of features to make borders, change fonts, join cells and pretty things up, it is hard to resist the temptation to make it into a pretty table. So your workflow looks like this:

data  -->   presentable table   ( usually stored in an Excel file and given as Supplementary Info.)
presentable table  -->   Analysis and Graphics

That last step is hard because most presentable data is not readily amenable to downstream analysis. If instead you program your data analysis (or make use of Excel's more advanced features like pivot tables), your workflow can look like this:

data  -->   presentable table
data  -->   Analysis and Graphics

As it turns out, liberating yourself from the need to have your data look presentable on its own, lets you structure it in a way that makes for rapid and painless plotting and analysis. Optimize the data format for manipulability, and save your time and others'.

- Gautham

No comments:

Post a Comment