Clojure reader should treat non-breaking space as whitespace character

Description

Right now, Clojure uses Character.isWhitespace(ch) || ch == ',' as the definition of whitespace in the reader. Character.isWhitespace, however, for obscure reasons (it has been defined long time ago), intentionally excludes U+00A0 (no-break space), U+2007 (figure space), U+202F (narrow no-break space). Logically, though, these characters should be treated as normal whitespace for all reasons except text formatting (e.g. the newer Character.isSpaceChar fixed that and does treat them as space chars).

Why is this important: if non-breaking space is inserted by accident (e.g. by pressing Option+Space on Mac), it'll be very hard to find the source of the error in a otherwise very innocent-looking code.

The attached patch implements Util.isWhitespace method that returns true for all characters treated as whitespace by Character.isWhitespace AND for those 3 exceptions. All cases where reading used Character.isWhitespace was referenced are modified to call new Util.isWhitespace instead.

Patch: clj-2207-nbsp-v3.patch

Prescreened by: Alex Miller

Environment

None

Activity

Show:
Andy Fingerhut
July 27, 2017, 4:13 PM

A quick test with Java shows that it allows non-breaking spaces inside of comments and strings, but not any of a few other places in a Java source file that I have tested with (e.g. in the middle of an identifier, in the middle of other white space).

Nikita Prokopov
July 28, 2017, 4:08 AM

> Would it be reasonable to treat nonbreaking spaces in Clojure source as an error

It might work, but why? Why use of some whitespace characters is allowed but other, also whitespace, characters should be forbidden? Nothing magical or special about non-breaking spaces. The intent of them is that they are just like normal spaces. Treating them differently would just confuse people.

Andy Fingerhut
July 28, 2017, 4:46 AM

Treating them as errors wouldn't be confusing at all – the compiler tells you where they are, and you change them to normal spaces in your Clojure source code and move on. How would that confuse people?

Nikita Prokopov
July 28, 2017, 4:52 AM

It’s like saying you can’t use letter A in code, only in strings. What’s wrong with letter A? Why can’t I use it. If there’s special rule about it there better be a reason too.

Andy Fingerhut
July 28, 2017, 4:55 AM

Maybe I can clarify my last question a little bit. Here are 3 alternatives (not intended to be exhaustive):

1. treat non-breaking spaces and similar characters as ones that can be part of var names and symbols

2. treat them as compiler errors, unless they are in comments, strings, or regexes

3. treat them as other whitespace characters are treated.

Alternative #3 is what this ticket proposed, and it was declined.

I think alternative #1 is where Clojure is now, and my guess is that among the 3 alternatives, it is the one most likely to cause confusion for people who accidentally introduce such a character in a Clojure source file.

I believe #2 is what the Java compiler does when compiling Java source files (based only on a few quick experiments, not complete knowledge of the subject). Alex Miller mentioned taking our cue from Java, so I thought I would propose it as an alternative to #1 and #3. If a separate JIRA ticket for this idea is desirable, I'm happy to create one.

Declined

Assignee

Unassigned

Reporter

Nikita Prokopov

Labels

Approval

None

Patch

Code

Priority

Minor