Post: mysqdump csv NULLs, \N, newlines and double-quote escaping.

So, i was trying to export a CSV from mysql to be processed by ruby and I was having a lot of trouble getting FasterCSV to parse the output. It turns out that MySQL cannot comply with the type described in RFC 4180 http://www.apps.ietf.org/rfc/rfc4180.html

After some doc reading trial and error, I eventually got to this:

mysqldump -u username --tab=/Users/ablohowiak/dumpoutput --fields-enclosed-by=\" --fields-terminated-by=, db_name table_name1 table_name2

Which got me pretty close.. The downside is that mysql escapes newlines and quotes inside fields using the reverse solidus ( the "\" character ). This posed a real problem because CSVs don't need escaping! The only thing that you need to "escape" is the double-quote character, which should wrap all fields that contain newlines or double quotes. The proper way to escape a double-quote in CSV is to double your double quote, so " becomes "".

This seemed to produce a good document, however I kept getting FasterCSV::MalformedCsvError from the mysqldump'ed csvs. ARGH! Upon closer inspection, I learned something... odd. When dumping to a flat file, mysql wants to preserve the difference between an empty string and a NULL string. I could argue about the merit in this choice until the cows come home. Regardless, they do not have any options that control the format of NULL values in the output of their --tab dump. The only format that NULL values can take is {escape}N. So, when the escape is its default \, you get \N. When the escape is ", you get "N, which is absolutely terrible.

AAARRRRRRRGH!

So, I ended up googling the internets. I even considered writing my own scanner-based csv parser for ruby (then, I looked at the clock =/ ) Surely, someone had solved this problem for me already!??!? Turns out, someone else got me most of the way there. I saw this post at StackOverflow: http://stackoverflow.com/questions/1105882/parsing-csv-files-with-escaped-newlines-in-ruby and it lead me to looking into Python's csv library, which is very, very well done! While I wasn't having the newline problem that was mentioned in the post (yet) it did make me realize that i could have a separate wrapper and escape character in python, which is what I needed. So, I wrote a quick python script that could take the output of the dump specified above (enclosed by " and escaped with the default \ ) and turn it into a RFC 4180-compliant CSV.

Here is that script:

import os
import csv
import glob
files = glob.glob("/Users/ablohowiak/dumpoutput/*.txt")

for f in files:
  myWriter = csv.writer(open('csvs/'+ os.path.basename(f), 'wb'), lineterminator='\r\n', escapechar='"', delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
  myReader = csv.reader(open(f, 'rb'), delimiter=',', quotechar='"', escapechar='\\')
  for row in myReader:
    myWriter.writerow(row)

So, I ran the output files from here through my ruby script. Everything worked OK for the first few files, and then I got another FasterCSV::MalformedCsvError. Crap! Once again, I inspected the file. This time, there was a newline in one of the fields. Referencing the SO post above, i changed from FasterCSV to CSV and everything worked. But it worked really, really slowly. Forget that. When I was reading through the FasterCSV documentation, i noticed that it tries to guess the line terminator for you. UGH. Don't guess! Use the standard!

Basically, what was happening is that the FasterCSV scanner was hitting the first \r that was in the middle of a field, and *then* it tried to parse the line. Since the end of the field (and appropriate closing " ) hadn't been reached, what it thought was the first row was invalid. The fix was easy, I just had to pass in the :row_sep => "\r\n" option when calling FasterCSV.foreach and everything "just worked".

So I never really solved the MySQL NULL thing. Now I have a bunch of CSVs that have "N" to indicate a NULL value. I went back and checked the original MySQL dumps and it turns out that in my 2million rows, I don't have any columns that exclusively contain an "N". So while it is brittle, it is a sufficient solution for me to just throw out any column containing "N", and instead use NULL.

Tip of the hat to Python's csv library. Wag of the finger at mysql for their lack of options around NULL and wag of the finger to FasterCSV for letting magic ruin spec conformance.

In all fairness to FasterCSV, if the first line is a header row that lacks \r or \n then it will correctly pick up the \r\n at the end.

mysqdump csv NULLs, \N, newlines and double-quote escaping. October 18, 2010