Space Cat, Prince Among Thieves

Falsehoods Programmers Believe About CSVs

Much of my professional work for the last 10+ years has revolved around handing, importing and exporting CSV files. CSV files are frustratingly misunderstood, abused, and most of all underspecified. While RFC4180 exists, it is far from definitive and goes largely ignored.

Partially as a companion piece to my recent post about how CSV is an encoding nightmare, and partially an expression of frustration, I've decided to make a list of falsehoods programmers believe about CSVs. I recommend my previous post for a more in-depth coverage on the pains of CSVs encodings and how the default tooling (Excel) will ruin your day.

Everything on this list is a false assumption that developers make.

  1. All CSVs are ASCII
  2. All CSVs are Win1252
  3. All CSVs are in 8-bit encodings
  4. All CSVs are UTF-8
  5. All CSVs are UTF-16
  6. All CSVs contains a single consistent encoding
  7. All records contain a single consistent encoding
  8. All fields contain a single consistent encoding
  9. All CSVs contain records
  10. All records contain fields
  11. Fields never contain record separators
  12. Fields never contain delimiters
  13. Fields never contain control characters
  14. Delimiters are escaped with a \
  15. All fields are enclosed by double quotes
  16. All records are a single line
  17. All lines contain a single record
  18. All records contain the same number of fields
  19. All records contain the same number of fields as the header
  20. All records contain the same number of fields or fewer than the header
  21. All CSVs contain a header
  22. All record separators are CRLF
  23. All record separators are LF
  24. All record separators are a single byte
  25. All record separators are a single rune
  26. All newlines are a single byte
  27. All CSVs are delimited with a comma
  28. All CSVs are delimited with a comma, tab or semicolon
  29. TSV isn't CSV
  30. All delimiters are a single byte
  31. All commas are a single byte
  32. All CSVs are readable with Excel
  33. Excel is a good tool for working with CSVs
  34. Excel is an OK tool for working with CSVs
  35. Excel can losslessly save CSVs it opens
  36. Using ="{value}" is a good way to get around Excel auto-formatting
  37. The first line will never be a poorly supported instruction header
  38. Using sep={char} is a good way to get Excel to accept your delimiter
  39. Prepending a BOM is a good way to get Excel to read your encoding
  40. You can safely name your first column "ID"
  41. All CSVs follow RFC4180
  42. Most CSVs follow RFC4180
  43. All CSVs follow the same defined standard
  44. All CSVs follow a defined standard
  45. All CSVs have a .csv extension
  46. All CSV is human readable

Please take these into consideration next time you find yourself working with CSV. If you can think of anything I may have missed I'd be happy to add it.

As a suggested further reading, "The Art of Unix Programming" http://www.catb.org/esr/writings/taoup/html/ch05s02.html#id2901882 section on DSV style which notably says "CSV is a textbook example of how not to design a textual file format"

Updates:

  • Thanks to Max for pointing out the Excel supported sep={value} header I was strangely entirely unaware of.
  • Thanks to Don Hopkins for the note about not being able to start a header with ID

Read More / Comment »

Recent Comments

I use this all the time - thank you!
Link

Thanks this will help create my GTA V in Minecraft!!
Link

Thank you Man i searched for this for about 2 Days!! Good Tutorial! Recommended!
Link

Coincidentally, we can use this calculator to approximate pi. Set the calculator to filled circle. Force it to a circle. Do the ratio on how many where sha…
Link

Sorry for this multicomment, the computer has frozen :P
Link