Data Organization in Spreadsheets

Date Created: July 5, 2024

Date Modified:

Article Image

Organizing data in spreadsheets may seem straightforward, but improper techniques can lead to errors and inefficiencies. This blog post explores some fundementals for organizing spreadsheet data, drawing insights from W. Broman and Kara H. Woo's comprehensive guide in "Data Organization in Spreadsheets," published in The American Statistician. These principles, in my opinion, are essential for anyone handling data. It will ensure clarity, consistency, and ease of analysis.

Be Consistent

Consistency is crucial in data organization. Here's how you can maintain it across your spreadsheets:

Succinct Name for Subjects

Choosing clear and concise names for your subjects is vital for readability and analysis.

Example of Good and Bad Subject Names
Good Name Good Alternative Avoid
Max_temp_C MaxTemp Maximum Temp (◦C)
Precipitation_mm Precipitation precmm
Mean_year_growth MeanYearGrowth Mean growth/year
sex sex M/F
weight weight w.
cell_type CellType Cell Type
Observation_01 first_observation 1st Obs.

Write Dates in a Consistent Manner

Other ways of writing dates is to use seperate columns for day, month and year or a straight 8-digit integer of form YYYYMMDD.

Example of Consistent Date Entry
id date glucose
101 2015-06-14 149.3
102 2015-06-15 95.6
103 2015-06-18 97.5
104 2015-06-19 121.2
105 2015-06-20 111.6
106 2015-06-21 108.6
107 2015-06-22 169.4

No Empty Cells

Never leave cells empty unintentionally. Use NA or a hyphen (-) to indicate missing data. Define categorical variables explicitly and fill them in, avoiding merged cells which complicate data analysis in programming languages.

Example of Proper Data Entry without Empty Cells
strain genotype min replicate response
A normal 1 1 370
A normal 1 2 160
B normal 1 1 356
B normal 1 2 355
A mutant 1 1 252

One Type of Data per Cell

Each cell should contain only one type of data. Avoid combining multiple data types in a single cell.

Good Example: One Type of Data per Cell
Product Price (USD) Quantity
Apples 1.20 10
Bananas 0.80 5
Cherries 3.00 7
Bad Example: Multiple Types of Data in a Single Cell
Product Price and Quantity
Apples $1.20 for 10
Bananas $0.80 for 5
Cherries $3.00 for 7

Use a Clear Layout

The best layout for data within a spreadsheet is a single, large rectangle where rows correspond to subjects and columns correspond to variables. For CSV files, keep each table in one file. If using multiple tables, place them in separate worksheets but maintain consistency.

So, instead of complicated layouts with merged cells:

1 min 5 min
strain normal mutant normal mutant
A 261 379 192 340 329 335 443 294
B 407 388 350 257 268 305 276 451

Use a tidy version:

strain genotype min replicate response
A normal 1 1 370
A normal 1 2 160
B normal 1 1 356
B normal 1 2 355
A mutant 1 1 252
A mutant 1 2 320
B mutant 1 1 397
B mutant 1 2 314
A normal 5 1 227
A normal 5 2 187
B normal 5 1 453
B normal 5 2 283
A mutant 5 1 267
A mutant 5 2 425
B mutant 5 1 283
B mutant 5 2 273

Create a Metadata Table

Include a separate table explaining all the variables. This metadata table should have:

Metadata Table

Additional Tips

In adhering to these practices, you can ensure that your data is consistent and well-organized. Proper data organization not only facilitates accurate analysis but also enhances the reproducibility and integrity of your work.

More Examples of Bad and Good Data Organization

Bad Data Organization
Some layouts that are causing problems
Good Data Organization
Reorganization of the table D above
Bad Data Organization
Instead of this, we can use a single header and name them date_W4, date_W6, …
Good Data Organization
Or like this

Back to Home