Space Cat, Prince Among Thieves

CSV: An Encoding Nightmare

The company I work for manages a self-service data import system, handling information from school districts “Student Information Systems”. There are hundreds if not thousands of SIS’s out there, but what the vast majority have in common is the ability to export CSV. It is the lingua franca, everyone can get us CSVs.

CSV looks like a simple portable format from the outside, but its looks are deceiving. If the data is pure ASCII (bytes 0-127) you’ll be fine. Yet, if there is a need for any sort of non-ASCII character, there is some work ahead.

In our case, handling school data from around the world, correctly handling non-ASCII characters is of the utmost importance.

Excel: A World of Hurt

The biggest problem is not CSV itself, but that the primary tool used to interact with it is Excel. Excel handles CSV encodings badly.

Creating a new document in Excel and saving as ”Comma Separated Values (.csv)” it uses your locale’s Windows or Mac codepage. Win-1252 and MacRoman respectively in America. If your codepage doesn’t support a character in your document, it will be silently replaced with an underscore _ character.

Because it uses codepages and not a Unicode encoding, it makes processing a painful chore. There is no way to tell the difference between different 8-bit codepages programatically.

One can use heuristics to sort them into an order of likelihood, but there is no way to ever know for sure. We chose to present the user with previews of the most likely codepages, and let them pick the correct one.

Excel for Mac: Broken Beyond Belief

The Macintosh version of Microsoft Excel is particularly harrowing.

I mentioned above that Excel saves your CSV in your locales codepage. One should note that the Mac codepages fell out of use with OS 9, almost 15 years ago. Microsoft did not get that memo.

While that by definition makes them unusable on Windows, the problem is more unfortunate than that. The Mac version can only read your locales Windows codepage. If you had any extended characters when you saved, even if they’re in your Mac codepage they are scrambled when you reopen it. This problem has persisted in every version of Mac Excel up to the current Excel 2016.

One simply cannot safely open a CSV created with the Mac version of Excel, on any platform, anywhere.

UTF-8? What’s that?

UTF-8 is the encoding of the 21st Century. It was quick in overtaking every other encoding. Almost every modern application supports it. This will be easy and we’ll be fine? Not so fast.

If one attempts to open a CSV file encoded as UTF-8 without a Byte Order Mark (BOM) as recommended, any non-ASCII characters are again scrambled.

This is the CSV format Apple’s Numbers exports by default, UTF-8 sans BOM.

If we try it again with a UTF-8 BOM prepended to the file and Excel will read it. This is deceptive because once saved the text will remain correctly encoded UTF-8, but bizarrely the BOM will be stripped causing the file to no longer be correctly readable.

Many naïve application output UTF-8 + BOM CSVs, and they read correctly but do not write correctly. I’ve seen this cause all sorts of headaches because it appears to work but does not.

The Right Way Functional Workaround

As of this writing, there exists a single usable CSV format that Microsoft Excel can both read and write safely across platforms. Tab delimited UTF-16LE with leading Byte Order Mark.

Using Excel to output into this format to begin with, you have to use the Save As dialog and choose “UTF-16 Unicode Text (.txt)”. Frustratingly, it includes a .txt extension by default which can be confusing for non-technical users. However, if you provide clear instructions to Windows users to save their filename in quotes with a CSV extension such as "example.csv" it is workable.

If you operate an application with a CSV exporter please use this as your default export format. It will save us all many headaches in the long run.

Comment by: Sam on

Sam GravatarSimple instructions:
Use google docs or libre office. Easy and FREE
Also: when your system already exports a csv file and you frequently need to edit them, your doing it wrong.

Comment by: Jesse G. Donat on

Jesse G. Donat GravatarYou forget how much mistrust there is for free software, particularly when dealing with confidential information. On top of that, most of the people we're dealing with aren't at liberty to install things on their computers.

Comment by: evonsdesigns on

evonsdesigns GravatarI was able to import correctly to excel by adding `\uFEFF` to the beginning of the file I was sending to the client from the server.

Comment by: Jesse G Donat on

Jesse G Donat Gravatar`\uFEFF` is a UTF-16BE (Big Endian) Byte Order Mark (BOM) used to indicate encoding.

Was your CSV content similarly UTF-16BE? I just tested a sample UTF-16BE CSV on Windows and Mac. Both come out scrambled, Mac being far more so.

I suspect that the content you prepended the BOM to are either Latin-1 or UTF-8 encoded text, in which case it might open correctly because you've forced Excel into an error mode, but it will not save/reopen correctly.

Comment by: David Giorgi on

David Giorgi GravatarIf you have an old version of Excel on Mac (eg Microsoft Office 2008), use LibreOffice instead

1] Open XLSX file in libre office
2] Export as CSV set filters to UTF8 encoding
3] Open again in Libre office and save as XLSX

Comment by: J Peraita on

J Peraita GravatarIf, instead of double clicking to open the file, you use Excel's "Data", "From text" option, to import the file, you can select the encoding. Usually you'll get the results you want... It handles nicely UTF-8 with no BOM (in Windows, at least)

Email address will never be publicly visible.