The go macro in core.async transforms linear code into a series of callbacks to attach to channels. Those callbacks close over any required data. Because of this transformation channels become what keep references alive. As long as channel C is alive, it references callback F, which references data D, etc. alts! attaches a single callback to multiple channels. This means that ultimately the lifetime of any closed over data in your go blocks is the longest lifetime of a channel used in an alts!, regardless of if the alts! actually selects that channel.
This example go block loops receiving some large value, then waits for either a read-to-write signal, or a one hour timeout. If the read-to-write signal comes it writes the large value to a file, then recurs, if the timeout occurs it recurs. After macroexpansion when the alts! actually runs, what happens is the same callback is attached to both the timeout channel and the read-to-write channel. This callback closes over the large value. Even if the read-to-write signal is chosen by the alt, and the write is executed and the code loops, the timeout is still ticking away for an hour, holding a reference to the callback which is holding a reference to large value.
This leads to extra gc pressure which leads to high cpu load which leads to tickets like which is asking for a non-shared timeout channel, so they can close it and release the reference it has to the callback.
I believe the correct solution is for alts! to clear the reference to the callback once it has chosen an alternative.
The attached async-234-test.clj is some code that demonstrates the leak. It effectively runs a loop in a kind of complicated way, allocating 10M every time through the loop, waiting on a long and short timer, and the 10M is leaked everytime as it goes through the loop because it is kept in memory by the long timer.
The code prints out the heap usage once a second as it runs.
alts-memory.png is a graph showing the difference running the test code with a patched and unpatched core.async. The patched version still has a slight leak, which I suspect is the various bookkeeping bits of timeouts. Those still exist and are ticking away with out a way to get rid of them.
I don't think it is possible to automatically get rid of those booking structures without some substantial retooling of core.async internals.
0002-ASYNC-234-add-a-nack-mechanism.patch builds on 0001-ASYNC-234-reduce-leaked-memory-on-alts-not-taken.patch
0001-ASYNC-234-reduce-leaked-memory-on-alts-not-taken.patch does as much clearing of memory as it can using existing mechanisms.
0002-ASYNC-234-add-a-nack-mechanism.patch adds a new `nack` mechanism (not used yet) which can be used to immediately signal to channels that a handler is no longer active, allowing for any resource clean up
0003-ASYNC-234-timeout2.patch builds on the nack mechanism from 0002.
0003 adds a new timeout2 function.
timeout2 provides the same interface as timeout, but internally is
implemented rather differently. It returns a custom ReadPort which
doesn't add any references to the global timeouts-queue unless a
handler is register (via take!). The handler goes on the
timeouts-queue directly, and the NackableHandler protocol is used to
remove any handlers from the queue that become inactive before being
updated 0002 and 0003 to remove a call to satisfies?