Use Reducers/Transducers for better performance & resource handling
One problem when using the clojure.data.csv library is that it's built upon lazy sequences which can lead to inefficiencies when processing large amounts of data, for example even before any transformation is done the base-line parsing of 1gb of data of CSV takes about 50s on my machine. Other parsers available on the JVM can parse this quantity of data in less than 4 seconds.
I'd like to discuss how we might port clojure.data.csv to use a reducer/transducer model, for improved performance and resource handling. Broadly speaking I think there are a few options:
1. Implement this as a secondary alternative API in c.d.csv leaving the existing API and implementation as is for legacy users.
2. Replace the API entirely with no attempt at retaining backwards compatibility.
3. Retain the same public API contracts, whilst trying to reimplement it underneath in terms of reducers/transducers. Use transducers underneath but use `sequence` to retain the current parse-csv lazy-seq contract, whilst offering access into a new pure transducer/reducer based API for non legacy users or those who don't require a lazy-seq based implementation.
1 and 3 are essentially the same idea, except in 3 users get the benefit of a faster underlying implementation, there may also be other options.
I think 3, if possible, would be the best option.
Options 1 and 2 raise the question, of making no attempt at backwards compatibility or improving the experience for legacy users.
Before delving into the details of how the reducer/transducer implementation, I'm curious what the core team think of exploring
Here is what I meant last night, for anyone who is interested: .
The only caveat that I can see is the fact that newlines in quoted cells cannot possibly work down this path. That said, I’ve worked with csv files a fair bit, and I’ve never encountered quoted cells with newlines - not to mention that the existing lazy implementation can still be used exactly as before.
A pretty straight forward way to support transducers would be to extend the Read-CSV-From protocol to IReduceInit (allowing for an :xform param obviously). This way one could call read-csv against the result of lines-reducible(https://blog.michielborkent.nl/2018/01/17/transducing-text/). The actual implementation would be literally a one liner. I can prepare a PR if there is interest.
I agree not loading data into memory is a huge benefit, but we shouldn't necessarily conflate that streaming property with laziness/eagerness.
By using reducers/transducers you can still stream through a CSV file row by row and consume a constant amount of memory, e.g. reducing into a count of rows wouldn't require memory to be consumed, even though it is eager. Likewise if we used a transducer will a `CollReduce`able `CSVFile` object by using `transduce` you could request a lazy-seq of results with `sequence` where the parsing itself paid no laziness tax; alternatively you could request that results are loaded into memory eagerly by transducing into a vector.
Apologies for not providing any benchmark results with this ticket; it was actually Alex Miller who suggested I write this ticket after discussing things briefly with him on slack - and he'd suggested that I needn't provide the timings because the costs of laziness are well known. Regardless, I'll tidy up the code I used to take the timings and put them into a gist or something - maybe later on today.
Can you share this benchmark? I did some comparisons when I initially wrote the lib and I didn't see such big differences.
I think that the lazy approach is an important feature in many cases where you don't want all those gigabytes in memory.
If we add some non-lazy parsing for performance reasons I would argue it should be additions to the public api.