Alt Leaks Memory

Description

clojure.core.async/alts! leaks memory

The go macro in core.async transforms linear code into a series of callbacks to attach to channels. Those callbacks close over any required data. Because of this transformation channels become what keep references alive. As long as channel C is alive, it references callback F, which references data D, etc. alts! attaches a single callback to multiple channels. This means that ultimately the lifetime of any closed over data in your go blocks is the longest lifetime of a channel used in an alts!, regardless of if the alts! actually selects that channel.

For example:

This example go block loops receiving some large value, then waits for either a read-to-write signal, or a one hour timeout. If the read-to-write signal comes it writes the large value to a file, then recurs, if the timeout occurs it recurs. After macroexpansion when the alts! actually runs, what happens is the same callback is attached to both the timeout channel and the read-to-write channel. This callback closes over the large value. Even if the read-to-write signal is chosen by the alt, and the write is executed and the code loops, the timeout is still ticking away for an hour, holding a reference to the callback which is holding a reference to large value.

This leads to extra gc pressure which leads to high cpu load which leads to tickets like which is asking for a non-shared timeout channel, so they can close it and release the reference it has to the callback.

I believe the correct solution is for alts! to clear the reference to the callback once it has chosen an alternative.

Environment

None

Activity

Show:
Kevin Downey
December 4, 2020, 7:22 PM
Edited

The attached async-234-test.clj is some code that demonstrates the leak. It effectively runs a loop in a kind of complicated way, allocating 10M every time through the loop, waiting on a long and short timer, and the 10M is leaked everytime as it goes through the loop because it is kept in memory by the long timer.

The code prints out the heap usage once a second as it runs.

Kevin Downey
December 4, 2020, 7:33 PM

alts-memory.png is a graph showing the difference running the test code with a patched and unpatched core.async. The patched version still has a slight leak, which I suspect is the various bookkeeping bits of timeouts. Those still exist and are ticking away with out a way to get rid of them.

I don't think it is possible to automatically get rid of those booking structures without some substantial retooling of core.async internals.

Kevin Downey
December 17, 2020, 9:36 PM

0002-ASYNC-234-add-a-nack-mechanism.patch builds on 0001-ASYNC-234-reduce-leaked-memory-on-alts-not-taken.patch

0001-ASYNC-234-reduce-leaked-memory-on-alts-not-taken.patch does as much clearing of memory as it can using existing mechanisms.

0002-ASYNC-234-add-a-nack-mechanism.patch adds a new `nack` mechanism (not used yet) which can be used to immediately signal to channels that a handler is no longer active, allowing for any resource clean up

Kevin Downey
December 17, 2020, 10:01 PM

0003-ASYNC-234-timeout2.patch builds on the nack mechanism from 0002.

0003 adds a new timeout2 function.

timeout2 provides the same interface as timeout, but internally is
implemented rather differently. It returns a custom ReadPort which
doesn't add any references to the global timeouts-queue unless a
handler is register (via take!). The handler goes on the
timeouts-queue directly, and the NackableHandler protocol is used to
remove any handlers from the queue that become inactive before being
timedout.

Kevin Downey
December 18, 2020, 11:17 PM

updated 0002 and 0003 to remove a call to satisfies?

Assignee

Kevin Downey

Reporter

Kevin Downey

Labels

None

Approval

None

Patch

Code

Priority

Major