Add a closable-timeout function that always return a fresh channel
This is my first patch in a core Clojure project. I will try to be as clear and verbose in the problem statement,
the solution the patch brings and the value proposition.
The timeout function returns a channel that maybe unsafe to call close! on. It's because in an application that creates
several timeout channels very closely in time, the newly created timeout channels will in fact be the previously created channels
if the call to `timeout` happens less than clojure.core.async.impl.timers/TIMEOUT_RESOLUTION_MS after the previous ones (currently 10 ms).
It means that manually closing a channel causes code that creates new ones shortly after will in fact retrieves a closed channel.
Taking from it, using alt! etc hence exhibits unwanted behavior.
It happened to us when we tried to create a safer take in the context of a high traffic website (10+ millions page views/day), which uses timeout channels
against normal channels for calling external services and ensures that no go block is indefinitely waiting for a channel to clause.
Here's the code, to give more context to the discussion:
Using timeout channels (with long timeouts of 2 minutes) without closing them solved our core problem (too many TCP sockets in CLOSE_WAIT state)
but immediately caused the application to use approximately 10%+ CPU and way more memory than before.
After analyzing the problem, we concluded that leaving the timeout channels as is resulted in them continuously executing the code
that checks if their time had passed, even if the "safe take" go block had returned. And most of the time, it's not the external
services that time out or fail, they just work and respond fast.
So for us, closing (with close!) the timeout channels immediately after the alt! failed because of the TIMEOUT_RESOLUTION_MS sharing.
After the first <?, subsequent calls to <? that happens shortly after would fails since previous timeout channels were reused and already closed.
In the context of a web application it's very common to do several requests to external services (and coordinates them with core.async)
before being able to send the response. I believe many other kind of applications would behave the same way.
Our solution consists in creating another timeout function that:
doesn't try to reuse channels created in the TIMEOUT_RESOLUTION_MS window
always returns a fresh channel
doesn't add the newly created channel to the ^ConcurrentSkipListMap timeouts-map to avoid collaboration with other normal timeout channels.
Current name is closable-timeout, to be validated. It could be safe-timeout or something else.
Please note that we already implemented and deployed the solution to production. We measured precisely the impact and are
confident that this is potentially a good patch that the wider Clojure community would benefit from. What we measured:
When using the normal timeout channels (without closing them manually), memory usage (committed heap) reached our maximum level very quickly (9GB)
on all our 4 machines and stabilized at this level. Our understand is that it's because each of our requests would create
on average 4 timeout channels that would be garbage collected at least 2 minutes after.
During this 2 minutes window we would get lots of requests, so lot of garbage would be be created before being collected.
The garbage collector (G1) would use on average 3-4% CPU time working (young generation).
When using our custom timeout channels, memory usage (committed heap) immediately returned to normal (2-5GB) depending on the machine
and stabilized at this level. We believe that it's because garbage channels are collected much faster so they don't pollute the heap
and create much less data to collect for the GC. CPU time spent in collection decreased back to our normal levels of operation (less than 0.5%).
Our understanding of G1 is that one of its specialty is doing much more frequent garbage collections of small amount of data,
and doing so in a much more efficient way than previous GC. Being able to close timeout channels manually allowed our application
to align with G1 behaviour closely.
The patch includes a test that demonstrates the problem, as well as another test that just use instead the new closable-timeout channel
that doesn't exhibit the problem.
Why submitting a patch while it can be done in user-space? Because to implement this functionality we had to rely on the
clojure.core.async.impl.timers namespace which should be an implementation detail, and also had to add the channel
to the ^DelayQueue timeouts-queue from the same namespace by using the @#' trick (like this (def ^DelayQueue async-timeouts-queue @#'clojure.core.async.impl.timers/timeouts-queue)).
Hence any refactoring of core.async could break our code or anyone doing the same thing.
Things to consider:
We haven't tested with other garbage collectors, nor on the GraalVM.
We didn't change at all the current timeout implementation, understand that this TIMEOUT_RESOLUTION_MS trick is a performance optimization.
We added a note in the docstring of the normal timeout about the fact that normal timeout channels are unsafe to close manually.
We wonder if it would be better, for normal timeout channels to throw an Exception when calling close! on them since it's unsafe.
I haven't given lots of thoughts about CLJS core.async, since this repo contains only Clojure code but both CLJ and CLJS tests (no idea how the thing works...)
PS: I have no idea if I should have run some script from the script folder.
PS2: Not sure if this is an enhancement or a defect, but as it bit us hard I chose defect.
PS3: I'm open to all comments, no string attached.
$ java -version
openjdk version "1.8.0_192"
OpenJDK Runtime Environment (build 1.8.0_192-b26)
OpenJDK 64-Bit Server VM (build 25.192-b26, mixed mode)
the attached 0001-ASYNC-225-reduce-leaked-memory-on-alts-not-taken.patch replaces a direct reference to the handler (fret), with a mutable reference (an atom) and clears the atom once the alt commits, avoiding keeping references to fret (and any large data it closes over) longer then needed.
The patch doesn't add new tests, but mvn test continues to pass
if we had a notion of nacks (like cml does) I believe we could actually do referencing count on delayqueue entries and clear them before the timeout expires when all the alts a timeout channel is part of don't select the timeout.
I'll try and get a proper patch and test together but something like:
should solve a lot of these kinds of issues
This has nothing to do with the shared nature of channels, and everything to do with the way timeout channels are kept alive by the scheduler, and in turn potentially keep alive references to the fns closing over data representing go blocks.
The real solution is to patch alts to clear out that closed over data when it comits to a handler.