Unrolled small vectors
As discussed on the mailing list , this patch has two unrolled variants of vectors and maps, with special inner classes for each cardinality. Currently both grow to six elements before spilling over into the general versions of the data structures, which is based on rough testing but can be easily changed. At Rich's request, I haven't included any integration into the rest of the code, and there are top-level static create() methods for each.
The sole reason for this patch is performance, both in terms of creating data structures and performing operations on them. This can be seen as a more verbose version of the trick currently played with PersistentArrayMap spilling over into PersistentHashMap. Based on the benchmarks, which can be run by cloning cambrian-collections  and running 'lein test :benchmark', this should supplant PersistentArrayMap. Performance is at least on par with PAM, and often much faster. Especially noteworthy is the creation time, which is 5x faster for maps of all sizes (lein test :only cambrian-collections.map-test/benchmark-construction), and on par for 3-vectors, but 20x faster for 5-vectors. There are similar benefits for hash and equality calculations, as well as calls to reduce().
This is a big patch (over 5k lines), and will be kind of a pain to review. My assumption of correctness is based on the use of collection-check, and the fact that the underlying approach is very simple. I'm happy to provide a high-level description of the approach taken, though, if that will help the review process.
I'm hoping to get this into 1.7, so please let me know if there's anything I can do to help accomplish that.
Screener Notes: The approach is clear and understandable. Given the volume of generated code, I believe that best way to improve confidence in this code is to get people using it asap, and add collection-test  to the Clojure test suite. I would also like to get the generator  included in the Clojure repo. We don't need to necessarily automate running it, but would be nice to have it nearby if we want to tweak something later.
Here's a relevant issue from Google's Guava project, where they also found serious performance degradation with fixed-size collections: https://github.com/google/guava/issues/1268. It has a lot of interesting detail about how the performance breaks down.
Rich these are good insights - do you have a benchmark that you are using as representative of real world code?
I agree that it is great if we can avoid call sites becoming megamorphic, though I also believe the ship has sailed on that one already when you consider the multiple types of IPersistentVector that already exist (MapEntry, PersistentVector, SubVector plus any library-defined IPersistentVector instances such as clojure.core.rrb-vector). As a consequence, the JVM is usually not going to be able to prove that a specific IPersistentVector interface invocation is monomorphic, which is when the really big optimisations happen.
In most of the real world code that I've been working with, the same size/type of vector gets used repeatedly (Examples: iterating over map entries, working with a sequence of size-N vectors), so in such cases we should be able to rely on the polymorphic inline cache doing the right thing.
The idea of a single Tuple class for sizes 0-4 is interesting, though I can't help thinking that a lot of the performance gain from this may stem from the fact that a lot of code does stuff like (reduce conj  .....) or the transient equivalent which is a particularly bad use case for Tuples, at least from the call site caching perspective. There may be a better way to optimise such cases rather than simply trying to make Tuples faster.... e.g. calling asTransient() on a Tuple0 could perhaps switch straight into the PersistentVector implementation.
I'm pretty sure that all of my real-world applications of the tuples (via clj-tuple) have been fixed cardinality, and wouldn't have surfaced any such issue. Thanks for putting the idea through its paces.
I think everyone should back off on their enthusiasm for this approach. After much analysis, I am seeing definite negative impacts to tuples, especially the multiple class approach proposed by Zach. What happens in real code is that the many tuple classes cause call sites that see different sized vectors to become megamorphic, and nothing gets adequately optimized. In particular, call sites that will see tuple-sized and large vectors (i.e. a lot of library code) will optimize differently depending upon which they see often first. So, if you run your first tight loop on vector code that sees tuples, that code could later do much worse (50-100%) on large vectors than before the tuples patch was in place. Being much slower on large collections is a poor tradeoff for being slightly faster on small ones.
Certainly different tuple classes for arity 0-6 is a dead idea. You get as good or better optimization (at some cost in size) from a single class e.g. one with four fields, covering sizes 0-4. I have a working version of this in a local branch. It is better in that sites that include pvectors are only bi-morphic, but I am still somewhat skittish given what I've seen.
The other takeaway is that the micro benchmarks are nearly worthless for detecting these issues.
Hi Zach, thanks for that!
Have updated the results -
Google docs: https://goo.gl/khgT83
Note that I've added an extra sheet/tab to the Google doc for your own numbers at https://gist.github.com/ztellman/3701d965228fb9eda084.
Am still struggling to produce results that show any consistent+significant post-JIT benefit to either of the tuple implementations against the micro benchmarks and one larger small-vec-heavy system I had handy.
It's looking to me like it's maybe possible that the JIT's actually optimising away most of the non-tuple inefficiencies in practice?
Of course it's very possible that my results are off, or my expectations wrong. The numbers have been difficult to pin down.
It definitely helped to have a standardised reference micro benchmark to work against (https://github.com/ztellman/tuple-benchmark). Could I perhaps suggest a similar reference macro benchmark (maybe something from core.matrix, Mike?)
Might also be a good idea to define a worthwhile target performance delta for ops like these that run in the nanosecond range (or for the larger reference macro benchmark)?
Just some thoughts from someone passing through in case they're at all useful; know many of you have been deeply involved in this for some time so please feel free to ignore any input that's not helpful
Appreciate all the efforts, cheers!