For a long time, I've enjoyed a handy method in Delphi's TStringList class to read in and write out a CSV-formatted line of text. For example,
- "Fred A.", "Thimsfrabble", "Lake Oswego", "OR"
is a typical CSV string with each field delimited by commas and each string delineated by double quotes. The CommaText function turns that string into a list of 4 strings:
- Fred A.
- Lake Oswego
Notice that both the field delimiting commas and the string quotes are removed from the parsed fields.
Now what happens if a string has an embedded quote?
- "Fred "buddy" A.", "Thimsfrabble", "Lake Oswego", "OR"
TStringList.CommaText falls apart. And well it should because the double-quotes are being used for two purposes: both as a field delimiters and as string quotes. Unfortunately, CommaText doesn't have any way to handle this.
There's another drawback to using CommaText, but it is a minor one. When exporting text, it only adds the string quotes if it really needs to. For example if I have the following fields in a TStringList:
- Fred A.
- Lake Oswego
and use it to export a CSV line, only the first and third would actually contain the string quote marks:
- "Fred A.", Thimsfrabble, "Lake Oswego", OR
This probably won't be a problem in most cases, but is inconsistent and if the application that needs to import this text expects string quotes, it will have problems on the fields without. Worse, if importing or viewing with a spreadsheet, unquoted numbers may be formatted or even interpreted incorrectly.
If you're still using Delphi 5, you're stuck with CommaText, using a third-party string library, or writing your own routines to handle these deficiencies.
However in Delphi 6, a new method appeared in the TStringList class: DelimitedText. This now allows the programmer to set the QuoteChar, a quote character that encompasses a string, and the Delimiter, a character to separate the fields. The default QuoteChar is a double-quote (") and the default Delimiter is a comma (,). Thus, by simply replacing all instances of CommaText with DelimitedText and doing nothing else, you'll have the same functionality as before (easy migration).
But with DelimitedText, you now have the capability to handle embedded quotes by using a different string QuoteChar. For example, with the pipe character (|) as QuoteChar, the following text will now be read in and parsed correctly:
- |Fred "buddy" A.|, |Thimsfrabble|, |Lake Oswego|, |OR|
And if you need to, you can even replace the field Delimiter with something else:
- |Fred "buddy" A.|~|Thimsfrabble|~|Lake Oswego|~|OR|
So why am I mentioning this now, when Delphi 6 came out 8 years ago in 2000? Because having used Delphi since Delphi 1 and Turbo Pascal before that, I've developed a set of routines for doing a lot of different things over the years and don't always read all of the What's New list in each version--especially since I don't always purchase each version. DelimitedText is actually a new find for me and just in case someone else finds themselves in the same position, I thought I'd share my late discovery.
Have a look at RFC4180 (http://tools.ietf.org/html/rfc4180) that attempts to standardize CSV formats.
If you have "s in your fields, you should double them up.
"Fred "buddy" A.", "Thimsfrabble", "Lake Oswego", "OR"
is incorrect, it should be:
"Fred ""buddy"" A.", "Thimsfrabble", "Lake Oswego", "OR"
Presumably, you should also get rid of the enclosing spaces, which can confuse parsers further:
"Fred ""buddy"" A.","Thimsfrabble","Lake Oswego","OR"
I'm fairly sure Delphi DelimitedText will read this just fine (with strictdelimiter set to true). Writing it? Try setting strictdelimiter to false, and you might only get in trouble with embedded tab characters etc.
When you have data like:
where the writers have enclosed a string in brackets because it contains a delimiter, you find a limitation in QuoteChar - it does not handle quotes that are not the same on the left and right. My example is real - it's a fragment of a national cattle database. Also, text editors now automate left and right quotes in unicode.