lazy-csv

What is lazy-csv?
Cmdline tool: csvSelect
Downloads

What is lazy-csv?

lazy-csv is a library in Haskell, for reading CSV (comma-separated value) data. It is lazier, faster, more space-efficient, and more flexible in its treatment of errors, than any other extant Haskell CSV library on Hackage.

Detailed documentation of the lazy-csv API is generated automatically by Haddock directly from the source code.

You can choose between String and ByteString variants of the API, just by picking the appropriate import. The API is identical modulo the type. Here is an example program:

module Main where

import Text.CSV.Lazy.String
import System

-- read a CSV file, select the 3rd column, and print it out again.

main = do
  [file]  <- getArgs
  content <- readFile file
  let csv = parseCSV content
  case csvErrors csv of
    errs@(_:_)  -> print (unlines (map ppCSVError errs))
    []          -> do content <- readFile file
                      let selection = map (take 1 . drop 2)
                                          (csvTable (parseCSV content))
                      putStrLn $ ppCSVTable selection

There are two useful things to note about the API, arising out of this example. First, parseCSV does not directly give you the value of the CSV table, but rather gives a CSVResult. You must project out either the errors (with csvErrors) or the values (with csvTable). Secondly, because the result of parseCSV is lazy, it is in fact more space-efficient (and also faster) to get hold of the valid table contents by reopening and reparsing the file after checking for errors. This also means, of course, that you can simply omit the step of checking for errors and ignore them if you wish.

To illustrate the performance of lazy-csv, here is a micro-benchmark. We compare the same program (the example above) recoded with all of the CSV libraries available on Hackage. The libraries are:

library	string type	parsing/lexing	results	error-reporting
csv	String	Parsec	strict	first error
bytestring-csv	ByteString	Alex	strict	Nothing (Maybe)
spreadsheet	String	custom parser	lazy	first error
lazy-csv	String	custom lexer	lazy	all errors
lazy-csv	ByteString	custom lexer	lazy	all errors
lazy-csv	ByteString	custom lexer	lazy	discarding errors

The main differences are shown in the table. As far as error-reporting goes, The Parsec-based CSV parser will report only the first error encountered. The Alex-based bytestring-csv will stop at the first error, but not give you any information about it. The Spreadsheet library has a lazy parser, allowing you to retrieve the initial portion of valid data, as far as the first error. The lazy-csv library will notify all errors in the input in addition to returning all the well-formed data it can find. Many of the possible CSV formatting errors are easily recoverable, such as incorrect number of fields in a row, bad quoting, etc. Thus, with the lazy-csv library one can choose to halt on errors, or display them as warnings whilst continuing with good data, or ignore the errors completely, continuing to process the retrievable data.

The choice of lazy vs strict input is extremely important when it comes to large file sizes. Here are some indicative performance figures, using as input a series of files of increasing size: 1Mb, 10Mb, 100Mb, 1Gb. For the purposes of comparison, I include three sets of numbers for the lazy library - two with error-reporting using each of String and ByteString types, the third for ByteString but ignoring errors. In all cases the good data is processed anyway, but the difference in reporting leads to significant performance differences.

Finally, the nearest non-Haskell comparison I could think of is to use the Unix tool 'cut'. Of course, 'cut' (with comma as delimiter) is not a correct CSV-parser, but it does have the benefit of simplicity, and most closely resembles the lazy-csv library in terms of ignoring errors and continuing with good data. It also has lazy streaming behaviour on very large files.

library	1Mb	10Mb	100Mb	1Gb
spreadsheet	runtime failure	runtime failure	runtime failure	runtime failure
csv	0.542	20.483	stack overflow	stack overflow
bytestring-csv	0.273	2.656	27.187	out of memory
lazy-csv (String)	0.196	1.890	18.845	189.978
lazy-csv (ByteString)	0.148	1.399	13.936	139.379
lazy-csv (ByteString, discard errors)	0.087	0.817	8.102	80.835
cut -d',' -f3	0.052	0.462	4.576	45.726

All timings are in seconds, measured best-of-3 with the unix time command, on a 2.26Gz Intel Core 2 Duo MacBook with 4Gb RAM, compiled with ghc-6.10.4 using -O optimisation.

How much maximum live heap do these implementations use, for different input sizes? (All measured from ghc heap profiles.)

library	1Mb	10Mb	100Mb	1Gb
csv	8Mb	120Mb	stack overflow	stack overflow
bytestring-csv	empty profile	52Mb	700Mb	out of memory
lazy-csv (String)	12kb	12kb	12kb	12kb
lazy-csv (ByteString)	3kb	3kb	3kb	3kb

My conclusions are these:

If you want good performance that scales across very large inputs, make sure you use lazy I/O.
Lazy Strings can be faster, and scale better, than strict ByteStrings.
This is mainly because a hand-written lexer can be significantly faster than machine-generated lexers.
Lazy ByteStrings perform best of all (in Haskell).
A correct CSV parser need not be much slower than a fast incorrect one (like unix cut).
Good error-handling is not in any way detrimental to performance.

Cmdline tool: csvSelect

The package distribution contains a command-line tool called csvSelect. It is a fuller and more useful version of the demo program used to illustrate performance above. csvSelect chooses and re-arranges the columns of a CSV file as specified by its command-line arguments. Columns can be chosen by number (counting from 1) or by name (as in the header row of the input). Columns appear in the output in the same order as the arguments. A different delimiter than comma can be specified. If input or output files are not specified, then stdin/stdout are used.

Usage: csvSelect [OPTION...] (num|fieldname)...
    select numbered/named columns from a CSV file
  -v, -V   --version      show version number
  -o FILE  --output=FILE  output FILE
  -i FILE  --input=FILE   input FILE
  -u       --unchecked    ignore CSV format errors
  -d @     --delimiter=@  delimiter char is @

Downloads

Development version:

darcs get http://code.haskell.org/lazy-csv

Current released version:
lazy-csv-0.5, release date 2013.05.24 - on Hackage

Or just cabal install lazy-csv

Older versions:
lazy-csv-0.5, release date 2013.05.24 - Fifth release, public.
lazy-csv-0.4, release date 2013.02.25 - Fourth release, first public.
lazy-csv-0.3, release date 2011.12.12 - Third (non-public) release.
lazy-csv-0.2, release date 2011.10.11 - Second (non-public) release.
lazy-csv-0.1, release date 2009.11.20 - First (non-public) release.

Recent news

Version 0.5 fixes a bug when handling (rare) CR-only line-endings.

Version 0.4 is the first public release.

Version 0.3 adds duplicate-header detection and repair.

Version 0.2 adds repairing of blank lines and short rows.

Version 0.1 is the first (but non-public) release of lazy-csv.
Complete Changelog

Contacts

Malcolm.Wallace@me.com