Invalid calls to clojure.set functions return an incorrect answer rather than error

Description

There are a number of tickets concerned with the fact that the set functions in clojure.set misbehave when passed arguments that are not sets.

This set of issues include CLJ-810, CLJ-1087, CLJ-1682, and CLJ-1954

The functions affected by this are:

  1. difference

  2. intersection

  3. union

  4. subset?

  5. superset?

as these are known to produce unexpected results when passed non-set arguments.

Problem
As the above mentioned issues suggest, todays implementation of these functions leads to confusion and erroneous results when called with non-set input. The user is given no warning or indications of the error he's making.

Possible solutions

  1. Add a coercion to set on the arguments said functions

  2. Throw an exception when the arguments are not sets

  3. Handle this with clojure.spec

  4. Leave it as is

Tradeoffs

  1. Given CLJ-2362, which makes a call to set close to a noop, the coercion should not incur much of a performance penalty. It has been argued that the code might even be faster, as type hints can be given and the compiler/jit might make better choices. For the common mistakes (passing vectors/lists instead of sets) it should be backwards compatible

  2. Throwing an exception on non-set arguments would break programs which work correctly today (although by chance), such as data.diff.

  3. Handling it with clojure.spec seems like a viable option, but again, this would break data.diff if the functions were spec'ed to both receive and return sets.

  4. Leaving it as it is, and we will continue to surprise both new and old clojurists.

Evidence of this being a problem

  1. The tickets mentioned above seem to indicate that people stumble upon this often enough to file issues

  2. https://clojuredocs.org/clojure.set/superset_q#example-5b5acd38e4b00ac801ed9e39

Environment

None

Activity

Show:
Alex Miller
November 16, 2018, 7:34 PM

Before being considered, this ticket needs:

  • a good problem statement

  • an assessment of where existing code calls these functions with something other than sets

  • a table of alternatives and their tradeoffs. presumably alternatives are: add specs, add validation checks, add coercions, etc. Tradeoffs may include effects on existing callers (known or unknown), performance, etc.

Decisions to make:

  • Are existing calls (with inputs that are not sets) valid or invalid?

  • What change should be made in the functions

Patches are probably not super useful until those things are decided.

Michiel Borkent
November 17, 2018, 12:52 PM

Interesting data from a repo by Andy Fingerhut comparing the performance of:

  • versions of set functions with pre-conditions (no coercion)

  • instrumented set functions

The functions with pre-conditions are significantly faster than the instrumented ones, but not much slower than the originals.

https://github.com/jafingerhut/funjible#wait-isnt-this-what-clojurespec-is-for

Stuart Halloway
November 18, 2018, 1:40 PM

A spec test could compare inputs and outputs and not break any existing code.

Jan Rychter
February 15, 2019, 9:20 PM

I'll attempt a first (minimal) step here. Since set functions have docstrings that begin with "Return a set", one can reasonably expect that they will always return a set.

The specific case that I encountered was `(clojure.set/difference nil #{1})` returning nil. The reason why the nil appeared was because of optionality: a set operation was performed and the source data was optional in the map. Data passed spec validations (because the key was optional), and only later tripped other validations because of the nil returned by set/difference. I realize that the arguments are incorrect and I do not necessarily expect auto-coercion.

What I would expect is a post-condition error that would show up in code compiled with assertions.

Specifically, I think a `{ost [(set? %)]}` post-condition on set functions is something that can be agreed upon. It doesn't address coercion, and it avoids the discussion of what are valid arguments, it just places a restriction on the return value according to documentation.

Given what the docstrings say, I do not think there is code out there that relies on set functions returning anything else than a set.

The performance impact of a post-condition would only be present if compiled with assertions.

I realize this doesn't address all issues mentioned, but it's perhaps a way to start moving forward.

I do not think this is a "Major" issue as currently marked.

Assignee

Unassigned

Reporter

Erik Assum

Labels

None

Approval

Triaged

Patch

None

Priority

Major
Configure