Folio III Field Notes
Plate III.9 — Data Organisation in Spreadsheets

Data organisation in spreadsheets

Organising data in spreadsheets seems straightforward, but improper techniques lead to errors and inefficiencies fast. This plate covers some fundamentals for organising spreadsheet data, drawing from W. Broman and K. H. Woo's "Data Organization in Spreadsheets" in The American Statistician. These principles ensure clarity, consistency, and ease of analysis.

Be consistent

Consistency is crucial. Maintain it across your sheets:

  • Consistent values for categorical variables. Use "male" and "female" OR "M" and "F" — never mix. Same for missing values: pick one notation (e.g. NA, a favourite in R, or a hyphen -) and stick to it. Never explain why a value is missing in place of the value — use a separate column for that.
  • Consistent naming and layouts.
    • Variable and subject names. Same format throughout. Choose underscores OR hyphens for spaces in names, but not both.
    • Date formats. Use a consistent format — preferably ISO 8601 (yyyy-mm-dd).
    • Phrases in notes. Uniform.
    • No extra spaces. Extra spaces inside cells cause subtle errors in analysis.

Succinct names for subjects

Clear, concise names are vital for readability and analysis.

  • Avoid spaces. Spaces cause errors and inconsistencies when importing data. Use underscores (_) or hyphens (-) instead.
  • Be consistent. Pick one of underscore or hyphen and stick to it. Mixing makes data-processing scripts more complicated and error-prone.
  • Avoid special characters. They interfere with parsing and cause unexpected issues. Stick to letters, numbers, underscores, and hyphens.
Good and bad subject names
Good nameGood alternativeAvoid
Max_temp_CMaxTempMaximum Temp (◦C)
Precipitation_mmPrecipitationprecmm
Mean_year_growthMeanYearGrowthMean growth/year
sexsexM/F
weightweightw.
cell_typeCellTypeCell Type
Observation_01first_observation1st Obs.

Write dates in a consistent manner

  • Use a single date format. Pick one and stick to it. Mixing formats leads to confusion and errors.
  • Use ISO 8601. The yyyy-mm-dd standard is widely accepted and unambiguous. It's also compatible with most software and programming languages.
  • Prevent automatic conversions. To stop spreadsheet software from auto-converting dates, set the column to text format before entering data, or prefix the date with an apostrophe ('2014-06-14).

You can also write dates as separate day / month / year columns, or as a single 8-digit integer in YYYYMMDD form.

Consistent date entry
iddateglucose
1012015-06-14149.3
1022015-06-1595.6
1032015-06-1897.5
1042015-06-19121.2
1052015-06-20111.6
1062015-06-21108.6
1072015-06-22169.4

No empty cells

Never leave cells empty unintentionally. Use NA or a hyphen (-) to mark missing data. Define categorical variables explicitly and fill them in. Avoid merged cells — they complicate analysis in any programming language.

Proper data entry without empty cells
straingenotypeminreplicateresponse
Anormal11370
Anormal12160
Bnormal11356
Bnormal12355
Amutant11252

One type of data per cell

Each cell should contain only one type of data. Avoid combining multiple types in a single cell.

Good — one type of data per cell
ProductPrice (USD)Quantity
Apples1.2010
Bananas0.805
Cherries3.007
Bad — multiple types of data in one cell
ProductPrice and Quantity
Apples$1.20 for 10
Bananas$0.80 for 5
Cherries$3.00 for 7

Use a clear layout

The best layout for data within a spreadsheet is a single large rectangle where rows are subjects and columns are variables. For CSV files, keep each table in one file. If you must use multiple tables, place them in separate worksheets but maintain consistency.

Instead of complicated layouts with merged cells:

Wide format with merged headers (avoid)
1 min 5 min
strain normalmutant normalmutant
A261379192340329335443294
B407388350257268305276451

Use a tidy version:

Tidy long format (prefer)
straingenotypeminreplicateresponse
Anormal11370
Anormal12160
Bnormal11356
Bnormal12355
Amutant11252
Amutant12320
Bmutant11397
Bmutant12314
Anormal51227
Anormal52187
Bnormal51453
Bnormal52283
Amutant51267
Amutant52425
Bmutant51283
Bmutant52273

Create a metadata table

Include a separate table explaining all the variables. This metadata table should have:

Example metadata table
  • The exact variable name as it appears in the data file.
  • A display-friendly version of the variable name (for plots).
  • A longer explanation of what the variable means.
  • The measurement units.
  • Expected minimum and maximum values.

Additional tips

  • Create a README. Describe the dataset, its source, and important details — the structure of the data, the layout of the spreadsheet, and the meaning of each column.
  • No calculations in raw-data files. Keep calculations separate to avoid accidental alterations.
  • Use data validation. Restrict the type of data that can be entered so errors are caught at entry.
  • No formatting in data. Avoid colours or highlighting in raw-data files. Format a copy if needed for presentation.
  • Make backups. Multiple locations. Consider a formal version-control system and write-protect files to prevent accidental changes.

Adhering to these practices keeps your data consistent and well-organised. Proper data organisation facilitates accurate analysis and enhances the reproducibility and integrity of the work.

More examples — bad and good

Layouts that cause problems
Layouts that are causing problems
Reorganised version of bad example D
Reorganisation of table D above
Bad column-per-week layout
Instead of this, use a single header — date_W4, date_W6, …
Tidy long-format alternative
Or like this
back to folio III · field notes