Range realization has a race during concurrent execution

Description

When a range instance is enumerated via next concurrently, some threads may not see the entire range sequence. When this happens next will return nil prematurely.

(defn enumerate [r] (loop [rr r c []] (let [f (first rr)] (if f (recur (next rr) (conj c f)) c)))) (defn demo [size threads] (let [r (range size) futures (doall (repeatedly threads #(future (enumerate r)))) res (doall (map deref futures))] (if (apply = res) (println "Passed") (println "Failed, counts=" (map count res))))) (demo 300 4) Failed, counts= (300 64 300 64)

The demo above will reliably produce a failure every few executions like the one above.

The vast majority of the time, range is used either single-threaded or in a non-competing way where this scenario will never happen. This failure only occurs when two or more threads are enumerating rapidly through the same range instance.

Cause:

Each LongRange instance acts in several capacities - as a seq, a chunked seq, and a reducible, all of which represent independent enumeration strategies (multiple of which may be used by the user). LongRange holds 2 pieces of (volatile, non-synchronized) state related to chunking: _chunk and _chunkNext. That state is only updated in forceChunk(). forceChunk() uses the "racy single-check idiom" to tolerate multiple threads racing to create the chunk. That is, multiple threads may detect that the chunk has not been set (based on null _chunk) and BOTH threads will create the next chunk and write it. But both threads have good local values, compute the same next value, and set the same next values in the fields, so the order they finish is unimportant.

The problem here is that there are actually two fields, and they are set in the order _chunk then _chunkNext. Because the guard is based on _chunk, it's possible for a thread to think the chunk values have been set but _chunkNext hasn't yet been set.

Approach:

Moving the set for _chunkNext before the set for _chunk removes that narrow window of race opportunity.

Patch: clj-1914-3.patch

Thanks to Kyle Kingsbury for the initial reproducing case https://gist.github.com/aphyr/8746181beeac6a728a3aa018804d56f6

Environment

None

Attachments

3
  • 15 Apr 2016, 05:37 AM
  • 15 Apr 2016, 05:16 AM
  • 15 Apr 2016, 05:11 AM

Activity

Show:

Alex MillerJuly 27, 2016 at 10:27 PM

Updated per request.

Stuart HallowayJuly 27, 2016 at 9:27 PM

This ticket would benefit from an explanation of the failure mode caused by the race, plus an explanation of the subtle reasoning behind the fix.

Ghadi ShaybanApril 15, 2016 at 5:22 AM

Good call. That's super subtle.

It appears all mutable assignment occurs in forceChunk() except for https://github.com/clojure/clojure/blob/master/src/jvm/clojure/lang/LongRange.java#L145 which should be OK (famous last words).

Alex MillerApril 15, 2016 at 5:13 AM

It's not necessary to synchronize here - just swapping these two lines should address the race I think: https://github.com/clojure/clojure/blob/master/src/jvm/clojure/lang/LongRange.java#L131-L132

Completed

Details

Assignee

Reporter

Labels

Approval

Ok

Patch

Code and Test

Priority

Affects versions

Fix versions

Created April 15, 2016 at 5:09 AM
Updated August 19, 2016 at 6:35 PM
Resolved August 19, 2016 at 6:35 PM

Flag notifications