Range realization has a race during concurrent execution

Description

When a range instance is enumerated via next concurrently, some threads may not see the entire range sequence. When this happens next will return nil prematurely.

The demo above will reliably produce a failure every few executions like the one above.

The vast majority of the time, range is used either single-threaded or in a non-competing way where this scenario will never happen. This failure only occurs when two or more threads are enumerating rapidly through the same range instance.

Cause:

Each LongRange instance acts in several capacities - as a seq, a chunked seq, and a reducible, all of which represent independent enumeration strategies (multiple of which may be used by the user). LongRange holds 2 pieces of (volatile, non-synchronized) state related to chunking: _chunk and _chunkNext. That state is only updated in forceChunk(). forceChunk() uses the "racy single-check idiom" to tolerate multiple threads racing to create the chunk. That is, multiple threads may detect that the chunk has not been set (based on null _chunk) and BOTH threads will create the next chunk and write it. But both threads have good local values, compute the same next value, and set the same next values in the fields, so the order they finish is unimportant.

The problem here is that there are actually two fields, and they are set in the order _chunk then _chunkNext. Because the guard is based on _chunk, it's possible for a thread to think the chunk values have been set but _chunkNext hasn't yet been set.

Approach:

Moving the set for _chunkNext before the set for _chunk removes that narrow window of race opportunity.

Patch: clj-1914-3.patch

Thanks to Kyle Kingsbury for the initial reproducing case https://gist.github.com/aphyr/8746181beeac6a728a3aa018804d56f6

Environment

None

Activity

Show:
Alex Miller
July 27, 2016, 10:27 PM

Updated per request.

Stuart Halloway
July 27, 2016, 9:27 PM

This ticket would benefit from an explanation of the failure mode caused by the race, plus an explanation of the subtle reasoning behind the fix.

Ghadi Shayban
April 15, 2016, 5:22 AM

Good call. That's super subtle.

It appears all mutable assignment occurs in forceChunk() except for https://github.com/clojure/clojure/blob/master/src/jvm/clojure/lang/LongRange.java#L145 which should be OK (famous last words).

Alex Miller
April 15, 2016, 5:13 AM

It's not necessary to synchronize here - just swapping these two lines should address the race I think: https://github.com/clojure/clojure/blob/master/src/jvm/clojure/lang/LongRange.java#L131-L132

Completed

Assignee

Unassigned

Reporter

Ghadi Shayban

Labels

Approval

Ok

Patch

Code and Test

Priority

Critical

Affects versions

Fix versions