Transient vectors use more memory than equivalent persistent vector

Description

Using clj-memory-meter to measure the data structures:

For us, this caused a 373.1MiB data structure to grow to 842.3MiB and OOM errors followed. We have keys like `[:a [:b :c]]` in our data structure, and by a rough estimate we have around 640,000 of these.

This code path is triggered by `into` and by extension `walk/postwalk`. So common core functions can trigger this higher memory usage. In fact, you can trigger it by doing `(walk/postwalk identity [[]])`.

Environment

None

Activity

Show:
Andy Fingerhut
December 4, 2020, 12:36 AM

Results from a quick reminding myself of the current implementation of PersistentVector and their transients shows that for vectors with at most 32 elements, all elements are stored in a tail array, and the root node always points at an array of 32 Object references. In most cases all vectors with at most 32 elements point at the same JVM object called clojure.lang.PersistentVector.EMPTY_NODE, so the fact that this array is being references by all vectors with at most 32 elements does not consume any noticeable amount of memory.

When you call transient! on such a vector, it immediately makes a copy of the JVM object pointed at by the root field in the PersistentVector object. Tree nodes “owned” by a transient vector are distinguished from tree nodes that might be shared among many vectors by making their edit field points at an AtomicReference object that is non-nil. When checking whether an entire transient vector has had persistent! called on it, or never has, it does so using the method ensureEditable() in class clojure.lang.PersistentVector$TransientVector. That check as written today would fail if transient! did not make a new root node, even for short vectors.

I could imagine a different implementation of class clojure.lang.PersistentVector$TransientVector such that it contained a new edit field that was used in place of root.edit everywhere that root.edit is used today, and when method asTransient was called on a PersistentVector instance it simply copied the reference to the existing root, that has root.edit.get()equal to null. It could continue sharing that root node until and unless it needed to copy it, in case any modification was made to the transient vector that needed to make such a change to the root node array contents.

I may create a patch with such a different implementation of class TransientVector, in order to think about it more in detail and see if there are any problems with it. So far, after about 10 minutes of thinking about it, the only issue I know of is that it would break the current implementation of the clojure.core.rrb-vector library, which depends upon many internal implementation details of PersistentVector and TransientVector. That would be easily updated to match, most easily by requiring developers to use a compatible version of that library depending upon the Clojure version. That library has bugs I do not know how to fix, anyway, so unless there is another library with similar dependencies on the details of class TransientVector, I would not worry about that library too much.

Andy Fingerhut
December 4, 2020, 3:15 PM

I created this small public git repo with some notes on this issue:

Andy Fingerhut
December 4, 2020, 3:22 PM

The attached patch clj-2594-v1.patch avoids allocating a new root node when transient is called on a persistent vector, no matter what size it is. This is achieved by adding a new field ‘edit’ to instances of the TransientVector class, and using that field in place of ‘root.edit’ everywhere in the code. As long as no changes are made to elements of the vector in the tree, only those in the tail, this reduces the memory allocation in the transient → persistent! round trip.

Assignee

Unassigned

Reporter

Dominic Monroe

Labels

Approval

Triaged

Patch

None

Affects versions

Priority

Minor