Serializing+deserializing lists breaks their hash

Description

clojure.lang.PersistentList implements Serializable but a serialization roundtrip results in a hash of 0. This appears to be caused by ASeq marking its _hash property as transient; other collection base classes don't do this. I don't know if there is a rationale for the distinction, but the result is that serializing, then deserializing, a list results in its _hash property being 0 instead of either its previous, calculated value, or -1, which would trigger recalculation.

This means that any code that relies on a list's hash value can break in subtle ways.

Examples:

Approach: Use 0 as the "uncomputed" hash value, not -1. The auto initialized value on serialization construction will then automatically trigger a re-computation.

Alternate approaches:

  • Implement a readObject method that sets the _hash property to -1. This is somewhat complicated in the face of subclasses, etc.

Note: Also need to consider other classes that use -1 instead of 0 as the uncomputed hash value: APersistentMap, APersistentSet, APersistentVector, PersistentQueue - however I believe in those cases they are not transient and thus may avoid the issue. Perhaps they should be transient though.

Patch: clj-1766-2.patch
Screened by: Alex Miller

Environment

Tested on OS X 10.10.3, Clojure 1.6.0, Java 1.8.0_45-b14, but don't think this is a factor.

Activity

Show:
Alex Miller
August 18, 2015, 6:26 PM

Latest patch is identical, just does not remove the commented out code.

Alex Miller
August 9, 2015, 4:33 AM

I agree with Ghadi - there is no change in implementation here, just some commented (non-used) code was removed. Moving back to Screened.

Ghadi Shayban
August 8, 2015, 5:59 PM

The only change I see is the removal of a commented-out impl.

Rich Hickey
August 8, 2015, 3:46 PM

This patch both fixes the serialization problem and changes the implementation for no reason related to the problem. The implementation works in place in order to avoid head-holding, which the implementation change reintroduces. Screeners - please make sure patches contain the minimal code to address the problem and don't incorporate other extraneous 'improvements'.

andrewhr
August 4, 2015, 2:21 PM

As recomended, I followed the change-default-to-zero path.
Not only it sidesteps the serialization issue, it looks more aligned with
the semantics of transient. Of course, there is no guarantees
about how different JVM implementations will deal with it - sometimes
they will be serialized, sometimes they will not.

So, together with the patch I've made some manual tests between some versions. The script used is attached for further inspection.

Running on a 1.6 REPL has shown on the original described issue:

Running on 1.8 master. Despite of the = behavior looking ok, the
hashes are different between roundtrips, like we saw on 1.6:

Running on 1.8 after patch the hashes are correctly computed on both cases:

Please note I've dumped each serialized data into different files, so I could test the interoperability between those versions. What I found:

  • 1.6 serialization is already incompatible with 1.8, on both directions;

  • 1.8 data could be exchange between master and patched versions, not affecting version behavior (hashes are computed only on patched REPL).

Did I miss something or everything looks correct for you too? I'm open to do some more exhaustive testing if anyone could give me some directions about where to explore.

Completed
Your pinned fields
Click on the next to a field label to start pinning.

Assignee

Unassigned

Reporter

Christopher Vermilion

Labels

Approval

Ok

Patch

Code and Test

Priority

Major

Affects versions

Fix versions