Pushback buffer overflow reading large json file with clojure.data.json 2.1.0
The following works with clojure.data.json version 1.0.0, but throws an exception in 2.0.2 and 2.1.0 (other versions not tested):
Execution error (IOException) at java.io.PushbackReader/unread (PushbackReader.java:179). Pushback buffer overflow`
It also throws the exception when reading from the file locally.
The problem stems from read-quoted-string which tries to read (for efficiency) a chunk of 64 bytes.
It then runs through these 64 bytes, and if it finds a end of string (\”) before it reaches 64 bytes, it will push back the remaining bytes onto the pushbackreader.
Now, you cannot push back more bytes to a pushbackreader than what is declared as its buffer size.
The reason the tests in json_test which exercises the Reader path doesn’t expose this bug, is that it doesn’t end up in read-quoted-string . Also, you’d need to have a situation where there are chars to be pushed back, which is now done in the test in the patch.
This bug was introduced as an oversight by me, since you’ll see in the code path where we read the string directly, I allocate a buffer of 64 for the pushbackreader (https://github.com/clojure/data.json/blob/master/src/main/clojure/clojure/data/json.clj#L302 ) and you will also notice that the repro gets fixed if you parse the json as a string as not as a stream (as mentioned as a workaround in the post on ask.clojure).
Applied for release.
Notifications should be going to original reporter, current assigned, and any watchers based on the notification scheme, and it seems like you are on the watch list. So I would expect you to be getting emails, unless you just added yourself to that.
Sorry I haven’t responded to this, but I haven’t got any emails about the comments
I have the same question as Andy... What's the problem, why does this fix address it? What I'm worried about is that this is related to the depth of open data arrays/maps and we are potentially recursively unreading where that matters. (Note that 1.1.0 used the default PushbackReader buffer size 1.)
Also, are there any perf implications?
Is it clear from the data.json code what the upper limit is on pushback operations, assuming it is somewhere between the default of 1 and the new proposed value in the patch of 64? I’ve dealt with pushback operations in tools.reader, and I know it can be difficult to analyze all of the cases – mainly curious if it was easy to tell in this case what the actual maximum number is, or whether 64 is just “big enough that it will work in more cases”.