Data Organization in Spreadsheets
Date Created: July 5, 2024
Date Modified:

Organizing data in spreadsheets may seem straightforward, but improper techniques can lead to errors and inefficiencies. This blog post explores some fundementals for organizing spreadsheet data, drawing insights from W. Broman and Kara H. Woo's comprehensive guide in "Data Organization in Spreadsheets," published in The American Statistician. These principles, in my opinion, are essential for anyone handling data. It will ensure clarity, consistency, and ease of analysis.
Be Consistent
Consistency is crucial in data organization. Here's how you can maintain it across your spreadsheets:
- Consistent Values for Categorical Variables: Ensure that values for categorical variables are consistent.
For example, use "male" and "female" or "M" and "F" without mixing these altogether.
Similarly, for missing values, use a standard representation such as
NA
(a favourite in R) or a hyphen (-
). Never use different notations interchangeably. And never explain why it is missing in place of the missing value. You should make a seperate column for that. - Consistent Naming and Layouts:
- Variable and Subject Names: Use the same format throughout. Choose either underscores or hyphens for spaces in names, but not both.
- Date Formats: Use a consistent date format, preferably the ISO 8601 standard (
yyyy-mm-dd
). - Phrases in Notes: Ensure phrases used in notes are uniform.
- No Extra Spaces: Avoid extra spaces within cells as they can cause inconsistencies and errors in data analysis.
Succinct Name for Subjects
Choosing clear and concise names for your subjects is vital for readability and analysis.
- Avoid Spaces: Spaces in names can lead to errors and inconsistencies when importing data into software for analysis. Use underscores (_) or hyphens (-) instead.
- Be Consistent: Again, decide on either underscores or hyphens and stick to that choice throughout your dataset. Inconsistencies can make data processing scripts more complicated and error-prone.
- Avoid Special Characters: Special characters can interfere with data parsing and cause unexpected issues. Stick to letters, numbers, underscores, and hyphens. "
Good Name | Good Alternative | Avoid |
---|---|---|
Max_temp_C | MaxTemp | Maximum Temp (◦C) |
Precipitation_mm | Precipitation | precmm |
Mean_year_growth | MeanYearGrowth | Mean growth/year |
sex | sex | M/F |
weight | weight | w. |
cell_type | CellType | Cell Type |
Observation_01 | first_observation | 1st Obs. |
Write Dates in a Consistent Manner
- Use a Single Date Format: Choose a single date format and stick to it throughout your dataset. Mixing date formats can lead to confusion and errors.
- Use ISO 8601 Standard: The ISO 8601 standard (
yyyy-mm-dd
) is widely accepted and avoids ambiguity. It is also compatible with most software and programming languages. - Prevent Automatic Conversions: To prevent automatic conversions by spreadsheet software, set the text format for the date column before entering dates, or add an apostrophe (
'
) before the date ('2014-06-14).
Other ways of writing dates is to use seperate columns for day, month and year or a straight 8-digit integer of form YYYYMMDD.
id | date | glucose |
---|---|---|
101 | 2015-06-14 | 149.3 |
102 | 2015-06-15 | 95.6 |
103 | 2015-06-18 | 97.5 |
104 | 2015-06-19 | 121.2 |
105 | 2015-06-20 | 111.6 |
106 | 2015-06-21 | 108.6 |
107 | 2015-06-22 | 169.4 |
No Empty Cells
Never leave cells empty unintentionally. Use NA
or a hyphen (-
) to indicate missing data.
Define categorical variables explicitly and fill them in, avoiding merged cells which complicate data analysis in programming languages.
strain | genotype | min | replicate | response |
---|---|---|---|---|
A | normal | 1 | 1 | 370 |
A | normal | 1 | 2 | 160 |
B | normal | 1 | 1 | 356 |
B | normal | 1 | 2 | 355 |
A | mutant | 1 | 1 | 252 |
One Type of Data per Cell
Each cell should contain only one type of data. Avoid combining multiple data types in a single cell.
Product | Price (USD) | Quantity |
---|---|---|
Apples | 1.20 | 10 |
Bananas | 0.80 | 5 |
Cherries | 3.00 | 7 |
Product | Price and Quantity |
---|---|
Apples | $1.20 for 10 |
Bananas | $0.80 for 5 |
Cherries | $3.00 for 7 |
Use a Clear Layout
The best layout for data within a spreadsheet is a single, large rectangle where rows correspond to subjects and columns correspond to variables. For CSV files, keep each table in one file. If using multiple tables, place them in separate worksheets but maintain consistency.
So, instead of complicated layouts with merged cells:
1 min | 5 min | |||||||
---|---|---|---|---|---|---|---|---|
strain | normal | mutant | normal | mutant | ||||
A | 261 | 379 | 192 | 340 | 329 | 335 | 443 | 294 |
B | 407 | 388 | 350 | 257 | 268 | 305 | 276 | 451 |
Use a tidy version:
strain | genotype | min | replicate | response |
---|---|---|---|---|
A | normal | 1 | 1 | 370 |
A | normal | 1 | 2 | 160 |
B | normal | 1 | 1 | 356 |
B | normal | 1 | 2 | 355 |
A | mutant | 1 | 1 | 252 |
A | mutant | 1 | 2 | 320 |
B | mutant | 1 | 1 | 397 |
B | mutant | 1 | 2 | 314 |
A | normal | 5 | 1 | 227 |
A | normal | 5 | 2 | 187 |
B | normal | 5 | 1 | 453 |
B | normal | 5 | 2 | 283 |
A | mutant | 5 | 1 | 267 |
A | mutant | 5 | 2 | 425 |
B | mutant | 5 | 1 | 283 |
B | mutant | 5 | 2 | 273 |
Create a Metadata Table
Include a separate table explaining all the variables. This metadata table should have:

- The exact variable name as in the data file.
- A version of the variable name that might be used in data visualizations.
- A longer explanation of what the variable means.
- The measurement units.
- Expected minimum and maximum values.
Additional Tips
- Create README file: Keep the description of the dataset, its source, and any important details such as explaining the structure of the data, including the layout of the spreadsheet and the meaning of each column.
- No Calculations in Raw Data Files: Keep calculations separate from raw data to avoid accidental alterations.
- Use Data Validation: This helps prevent data entry errors by restricting the type of data that can be entered.
- No Formatting in Data: Avoid using colors or highlighting in raw data files. Formatting should be applied in a copy of the data if needed for presentation.
- Make Backups: Store backups in multiple locations. Consider using a formal version control system and write-protect files to prevent accidental changes.
In adhering to these practices, you can ensure that your data is consistent and well-organized. Proper data organization not only facilitates accurate analysis but also enhances the reproducibility and integrity of your work.
More Examples of Bad and Good Data Organization



