Space Cat, Prince Among Thieves

Falsehoods Programmers Believe About CSVs

Much of my professional work for the last 10+ years has revolved around handing, importing and exporting CSV files. CSV files are frustratingly misunderstood, abused, and most of all underspecified. While RFC4180 exists, it is far from definitive and goes largely ignored.

Partially as a companion piece to my recent post about how CSV is an encoding nightmare, and partially an expression of frustration, I've decided to make a list of falsehoods programmers believe about CSVs. I recommend my previous post for a more in-depth coverage on the pains of CSVs encodings and how the default tooling (Excel) will ruin your day.

Everything on this list is a false assumption that developers make.

  1. All CSVs are ASCII
  2. All CSVs are Win1252
  3. All CSVs are in 8-bit encodings
  4. All CSVs are UTF-8
  5. All CSVs are UTF-16
  6. All CSVs contains a single consistent encoding
  7. All records contain a single consistent encoding
  8. All fields contain a single consistent encoding
  9. All CSVs contain records
  10. All records contain fields
  11. Fields never contain record separators
  12. Fields never contain delimiters
  13. Fields never contain control characters
  14. Delimiters are escaped with a \
  15. All fields are enclosed by double quotes
  16. All records are a single line
  17. All lines contain a single record
  18. All records contain the same number of fields
  19. All records contain the same number of fields as the header
  20. All records contain the same number of fields or fewer than the header
  21. All CSVs contain a header
  22. All record separators are CRLF
  23. All record separators are LF
  24. All record separators are a single byte
  25. All record separators are a single rune
  26. All newlines are a single byte
  27. All CSVs are delimited with a comma
  28. All CSVs are delimited with a comma, tab or semicolon
  29. TSV isn't CSV
  30. All delimiters are a single byte
  31. All commas are a single byte
  32. All CSVs are readable with Excel
  33. Excel is a good tool for working with CSVs
  34. Excel is an OK tool for working with CSVs
  35. Excel can losslessly save CSVs it opens
  36. Using ="{value}" is a good way to get around Excel auto-formatting
  37. The first line will never be a poorly supported instruction header
  38. Using sep={char} is a good way to get Excel to accept your delimiter
  39. Prepending a BOM is a good way to get Excel to read your encoding
  40. You can safely name your first column "ID"
  41. All CSVs follow RFC4180
  42. Most CSVs follow RFC4180
  43. All CSVs follow the same defined standard
  44. All CSVs follow a defined standard
  45. All CSVs have a .csv extension
  46. All CSV is human readable

Please take these into consideration next time you find yourself working with CSV. If you can think of anything I may have missed I'd be happy to add it.

As a suggested further reading, "The Art of Unix Programming" http://www.catb.org/esr/writings/taoup/html/ch05s02.html#id2901882 section on DSV style which notably says "CSV is a textbook example of how not to design a textual file format"

Updates:

  • Thanks to Max for pointing out the Excel supported sep={value} header I was strangely entirely unaware of.
  • Thanks to Don Hopkins for the note about not being able to start a header with ID

Comment by: chris on

chris GravatarInteresting points, but nothing to back them up.

E.g. Using ="{value}" is an acceptable way to get around Excel auto-formatting.

I've seen that done before -- why isn't it acceptable?

Comment by: Jesse G. Donat on

Jesse G. Donat GravatarChris, the biggest reason not to use ="{value}" is simply that your CSV will then *only* work in Excel and nowhere else.

Equations are not supported by any spec anywhere and the fact that this works is just a really gross hack.

I've done it, I've had to do it, but it's really really gross.

Comment by: Andrew on

Andrew Gravatar@chris It may be acceptable on a case by case basis, like if someone were using a version of Excel that had issues with CSVs without this code, and they were the main person who would be using the file. But what if:

Someone with a different version of excel opens the file, and it fails to load because it uses a different parsing mechanism.

I get the file (developer) and now I've got to parse out all this "={value}" junk that means nothing to me.

Someone years later opens the file and gets the literal string instead of the function value, and is seriously confused.

It's a hack, and should be avoided. It's not that it doesn't work, but a CSV is NOT an Excel file, and shouldn't be treated as such. Generate a real Excel file if the client needs it! There are libraries for that!

Comment by: Tom on

Tom GravatarWhat do you mean "All CSV is human readable"?
Can you give me an example?

Comment by: Max on

Max GravatarYou should add:

- All CSVs have column headers
- All CSVs start with a line of data
- The first line of a CSV file is either a header or a row of data
- The first line will indicate the line separator via the Excel friendly sep= pragma
- The first line will never include the Excel friendly sep= pragma
- All CSVs from the same source will have a consistent first line of either headers or data

http://superuser.com/questions/773644/what-is-the-sep-metadata-you-can-add-to-csvs

For what it's worth, I think Excel is fine for CSV manipulation so long as you turn off the magic it does. For really large files I've found its column oriented engine makes global editing/find and replace faster than vim and other text oriented tools while being more user friendly (for non-devs/command line experts) than head, sed, find, grep, and sort.

Comment by: dan on

dan Gravatarall rows have the same number of delimiters as the rest of the rows.

this is a fun one :)

Comment by: Don Hopkins on

Don Hopkins GravatarFalsehood: You can safely name your first column "ID".

https://support.microsoft.com/en-us/kb/323626

https://news.ycombinator.com/item?id=12041210

Comment by: Lyndsy Simon on

Lyndsy Simon GravatarI'd add:

- CSV files contain data

An empty file is valid CSV :)

Comment by: Alex on

Alex GravatarI don't know about programmers, but I've heard many a designer and product manager assume that Excel could export UTF-8 CSV, or some other ASCII-superset encoding that would be readable by a Unicode-aware program.

Nope, Excel is the only CSV-exporting program I've ever seen that doesn't export UTF-8 at all. And other programs generally export only UTF-8. There's no solution that works the same in both Excel and other programs.

Comment by: jhi on

jhi Gravatar* UTF-8 CSV must have a BOM
* UTF-8 CSV must not have a BOM

Comment by: Alex on

Alex GravatarOh, here's another set of problems (false assumptions) I've run into:

- If there is a CSV header, then it's one record in length.
- If there is a header and it's one record in length, then it's the first line of the file.
- OK, it's at least the first non-empty line of the file.
- OK, well surely it's the first CSV record with non-empty fields...

Nope, figuring out the "header" of a CSV file is a black art. The only way to be sure is to ask the user. The CSV RFC doesn't define even any such thing as a "header".

- So we asked the user which line is the header. At least now we know that all the values in it will be unique (so we can use them as dictionary keys).
- Well, at least none of them are the empty string, right?
- OK, they can be anything at all. At least when I load them into this CSV-processing program, and save them again, they'll all be in the same order as before, so I can use column position as a key...

Comment by: Eze on

Eze GravatarMost of this over my head...but some of it sinks in. Thx!

Comment by: David McKee on

David McKee Gravatar* Field N has the same meaning in all records of this CSV file.

If it's a CSV file that's been exported from Excel, there's nothing stopping a CSV file from having multiple independent sub-spreadsheets in one file...

(If you find yourself having to deal with this sort of stuff and you know python, https://sensiblecodeio.github.io/quickcode-ons-docs/ and the xypath library might help. Disclaimer -- I wrote them.)

Comment by: David McKee on

David McKee GravatarDiscovered that SQLite didn't output RFC-compliant CSV files on Linux (it was using system line endings, not CRLF). That was a surprise. I believe that behaviour has been patched.

Comment by: Gwen Patton on

Gwen Patton GravatarI used to do "meatball programming" for a company that sent out junk mail. Yeah, yeah, it paid the bills, don't judge me.

We had to take in data in CSV all the time. (Among other formats) It was a complete and utter pain in the rump, because none of these files had ANY consistency whatsoever. Over 50% of my job was analyzing these friggerty files to see just how bad bad could get, to slap together some code to read the garbage data we got from our clients. Another 25% of my job was cleaning up the complete mess of slop data we managed to coax out of the mishmash of characters they charitably called "data" so it could be used for something useful, such as addressing a post card.

The rest of the time was spent laboriously and sometimes repeatedly trying to get their data to work with our expensive and complex USPS-specified software to put their gawd-awful, nearly sense-free data into a consistent enough form that the US Snail Postal Disservice would interpret it well enough to grant us a full Zip+4 zip code and a correctly-formatted address and barcode suitable for presorted mail.

A nearly insignificant amount of time was spent running print jobs and making certain the data actually printed in legible form on the various envelopes, cards, letters, and labels. Sometimes we'd find that required data was only partially present after all, and that the error-checking routines in the Expensive USPS Software had missed the fact that required fields were missing, resulting in completely nonsensical information being printed in the wrong place on the material. Certainly the barcoding was complete gibberish, and the whole mess had to be done all over again.

But sure, send us your "data" in CSV from your Excel spreadsheet. Why don't you beat me with a crowbar while you're at it? You certainly couldn't cause me any MORE pain.

What I wouldn't give for someone with half the sense God gave a spastic gerbil who might possibly send me an actual file designed to hold data, along with an accurate data specification for the file, in a format compatible with an actual database. Of course, as soon as someone like that came along, the US Snail would come out with an update to their software, requiring that all presorted mail be printed using Sanskrit or some such.

Comment by: Text munging monkey on

Text munging monkey Gravatar* 35a. Excel can correctly open CSVs which were saved in Excel

(not if you've got a line break within a field, iirc - probably some other circumstances too)

* you can write a VBA macro to load a CSV file setting specific columns to Text format to force leading zeroes, things which look like dates or are dates in a different locale, numbers in parentheses to load losslessly in Excel

(You can, but not if the file extension is ".csv"; for some reason using a known file extension silently overrides all the parameters you've carefully provided to the OpenText loading function. That was a painful morning of debugging...)

Comment by: Nigel Heffernan on

Nigel Heffernan GravatarThe worst thing that excel does is read save your csv *without the encapsulating quotes you had around the text fields". If any field contains a comma, you're hosed.

Lets not talk about Excel reading (and then saving down) any field that looks like it can be converted to a date.

After that, try the falsehood "OLEDB database drivers will read your csv file as a table for your SQL": not if there's a BOM, they won't!

Comment by: Patrick O'Beirne on

Patrick O'Beirne GravatarAlso, don't assume that text fields never begin with a dash, which Excel will interpret as a minus sign and treat as a formula. +1 to the header bytes comment, I have to swap two header bytes on one survey website download to get real UTF-8. +1 to the date comment, see the recent articles on gene name corruption in published papers. Yes, there is a Text File Import, as described in https://sysmod.wordpress.com/2016/08/28/excel-gene-mutation-and-curation/

Comment by: SteveBETL on

SteveBETL GravatarOne that caught me:
Numbers you've written won't contain commas.
I wrote out a config matrix to a csv file, then read it in later. It worked on my machine, but failed on the client machine.

Turns out that my Italian client was using "decimal commas" instead of "decimal points". It's easy to fix but, because I never even knew that different cultures might do something like this, hard to debug.

Comment by: sauce on

sauce GravatarTesting the sauce

Email address will never be publicly visible.