subseq, rsubseq enhancements to support priority maps?

Description

See dev thread at http://groups.google.com/group/clojure-dev/browse_thread/thread/fdb000cae4f66a95.

Excerpted from Mark Engelberg's message of July 14, 2010 in that discussion thread:

It will only be possible to properly implement subseq on priority maps with a patch to the core. subseq achieves its polymorphic behavior (over sorted maps and sorted sets) by delegating much of its behavior to the Java methods seqFrom, seq, entryKey, and comparator. So in theory, you should be able to extend subseq to work for any new collection by customizing those methods.

However, the current implementation of subseq assumes that the values you are sorting on are unique, which is a reasonable assumption for the built-in sorted maps which sort on unique keys, but it is then impossible to extend subseq behavior to other types of collections. To give library designers the power to hook into subseq, this assumption needs to be removed.

Approaches:

1. A simple way is to allow for dropping multiple equal values when subseq massages the results of seqFrom into a strictly greater than sequence.

2. A better way is for subseq to delegate more of its logic to seqFrom (by adding an extra flag to the method call as to whether you
want greater than or equal or strictly greater than). This will allow for the best efficiency, and provide the greatest possibilities for future extensions.

Note: subseq currently returns () instead of nil in some situations. If the rest of this idea dies we might still want to fix that.

Patch clj-428-code-v5.patch implements approach #2 above. It preserves the current behavior of returning () instead of nil.

Patch: clj-428-code-v5.patch tests in clj-428-tests-v5.patch

Environment

None

Attachments

2

Activity

Show:

Alex Miller June 27, 2018 at 6:03 AM

So you could have for example a Subseqable interface and if subseq/rsubseq detect it, they just pass control to the collection.

Alex Miller June 27, 2018 at 6:02 AM

I think that is a helpful restatement of the issue. I feel like "fixing" subseq with the drop logic is just doing more papering over this disconnect when it would be more helpful (and more efficient) to allow collections to directly do what the user needs.

Mark Engelberg June 27, 2018 at 5:55 AM

The crux of the problem is that subseq allows <=, <, >=, > but seqFrom only allows <=, >= and then subseq "massages" the output of seqFrom into <, > in a way that drops up to one item, which doesn't work if there is more than one equal item. The reason I consider this a design flaw is that you can create an implementation of seqFrom that does exactly what the implementation requires (generating appropriate <= and >= subsequences) and subseq and rsubseq may still not work correctly.

Way back when, the thinking seemed to be, "Hey, this mismatch between seqFrom and subseq/rsubseq is kind of screwy. Let's let the collection do all the subseq/rsubseq logic in its implementation of seqFrom so that there is no need for massaging two possible subseqs into four possible."

It is possible to fix subseq and rsubseq to drop as many items as necessary so no change is required to the interface, but if the collection could do all the logic directly it may be possible for the collection to create <, > subsequences more efficiently than a fix at the subseq and rsubseq level.

I see your point about a Sorted2 interface.

Alex Miller June 27, 2018 at 5:21 AM

Let me just mention that this is the first time I ever remember looking at this ticket as it largely predates my involvement in Clojure core. I know it's been years and all that and I'm trying to actually take an honest stab at thinking through this. But I'll probably ask some dumb questions as I rewalk whatever path you've been down.

I'm definitely in agreement wrt making good collection traits able to be utilized by impls both inside and outside Clojure core.

With respect to the feasiblity of changing the protocol, I think maybe you mistook my intent there. We can certainly change / extend the protocols, we just need to do it in a way that grows (leaving existing stuff in place) rather than breaks (by changing existing method signatures). My objection is to changing the signature of Sorted.seqFrom() which breaks all existing impls - that's just a non-starter. But it's totally ok to extend Sorted to Sorted2 (or whatever more meaningful name) with a new arity of seqFrom. The subseq code can then do the best thing if a coll is Sorted2 and fall back to something else if just Sorted. Various ways to do that, but you get the intent hopefully - existing stuff continues to work without change, new stuff is leveraged if available.

Backing way up, it seems like seqFrom and friends are not rich enough to cover all the things that might be desired across different data structures. I continue to think that NavigableSet has a lot of good prior thinking (not saying it's a direct answer, but it's really the same problem area) and it would be profitable to look a little more broadly at some of the other data structures that have been added in the intervening years to see what we could enable.

Really, the whole impl of subseq is kind of gross and relies on a lot of moving parts and assumptions. Maybe that whole subseq op should actually be given to the collection to the collection to do directly.

Mark Engelberg June 27, 2018 at 1:46 AM

I believe that one important measure of a language is the degree to which its internal abstractions, used for its own data structures, are open for extension. For that reason, I have long believed this is an issue worth addressing. The assumption baked into subseq and rsubseq that Sorted collections can only have unique sorted values was surely an oversight in the initial code, one that has not only prevented priority-map from hooking into subseq and rsubseq, but also several clever data structures by Marczyk. I don't think there's something interesting beyond what is in Sorted that we need to try to understand – the problem is that implementing Sorted's seqFrom interface is not currently sufficient for subseq and rsubseq, as they are currently defined, to provide their specified behavior. You can view this as either a deficiency in Sorted or a deficiency in subseq/rsubseq.

Marco, the latest version of clojure.data.priority-map contains a patched version of subseq and rsubseq that doesn't rely on changes to the underlying Sorted interface to fix its incorrect behavior on sorted collections with duplicate sorted values. Feel free to use that implementation with priority-map or any other similar data structures you might be building.

I can't speak for the Clojure team, but I would think that if someone were to take the initiative to do extensive benchmarking on the subseq and rsubseq in clojure.data.priority-map and establish that there is no performance hit on existing sorted Clojure data structures, then they might be more inclined to move those patched versions into Clojure's core, since there'd be a clear benefit with zero downside. (Or alternatively, if there is a performance hit with respect to Clojure's existing sorted data structures, figure out how to tweak it so there is no performance penalty, which should be easy to do). The needed changes to subseq and rsubseq were simple: where they currently change a <= subsequence into a < subsequence by calling rest to drop the first item, I merely changed them to call drop-while to drop all equal items off the front of the sequence, since there might be more than one. When this issue was first raised, core team members indicated they would rather see this fixed at the protocol level, but if that is no longer feasible, I would like to see this change to subseq and rsubseq considered. It seems like an easy way to fix this longstanding issue that prevents subseq and rsubseq from working with all data structures that fulfill Sorted's seqFrom implementation.

Details

Assignee

Reporter

Patch

Code and Test

Priority

Affects versions

Created August 20, 2010 at 2:01 PM
Updated June 27, 2018 at 6:04 AM