Add optional predicate to string trim functions that determines if a character should be trimmed

Description

The proposal is that the trim functions (trim, triml, and trimr) would get a second arity with a function trim?:

trim? comes first to support partial.

New doc string would be:

Example test:

Similar to Python's strip - https://docs.python.org/2/library/stdtypes.html#str.strip

Approach: The proposed solution isn't very DRY but it follows the design guidelines at the top of the file, more exactly point 3:

"3. Functions take advantage of String implementation details to
write high-performing loop/recurs instead of using higher-order
functions. (This is not idiomatic in general-purpose application
code.)"

First I had a solution in which I replaced Character/isWhitespace from the current implementation by calling pred. pred was defaulted to an is-whitespace? function.
That code is of course nicer, even trim-newline could just call into trimr, removing a lot of duplication, but it adds the overhead of always calling a function, instead of calling Character/isWhitespace directly.

The only way I can see to have optimised and DRYer code is to use macros, but I don't think that it will necessary lead to nicer code.

Given the existing design style of the other functions in string.clj I felt that the best solution would be to just simply duplicate in favour of optimised code.

Environment

None

Activity

Show:
Tamas Szabo
January 27, 2016, 9:42 PM

Added new patch that renames pred to trim?

Andy Fingerhut
January 28, 2016, 12:27 AM

Note that Java, and thus Clojure/Java, uses UTF-16 encoding for strings in memory. Thus if you wanted to trim a set of Unicode code points from the beginning and/or end of a string, the API of trim? taking a single 16-bit Java character is not enough information to determine whether it should be trimmed or not.

If you want to handle that generality, it would require a more complex implementation, which checks whether the first/last character is one half of a code point that is encoded as 2 16-bit Java characters, and pass a 32-bit int to trim?, or something similar to that.

I have no objections if these API enhancements are made without enabling testing against an arbitrary Unicode code point. In the past, similar suggestions have been rejected in Clojure's built-in lib, e.g. CLJ-945

Tamas Szabo
January 28, 2016, 7:56 AM

Yes, the UTF-16 encoding and Character representing either a codepoint or a half-codepoint is a bit of a mess, isn't it?

In the Java String and Character API's the methods that accept char, handle only characters in the Basic Multilingual Plane.
trim? accepts a character, so following the same behavior it will work only for removing characters in the Basic Multilingual Plane.

I think even this would be fine, but additionally because the high/low surrogates and the BMP characters are disjoint, you could actually use the same implementation to remove Unicode code points that aren't in the BMP. You can just say that both the high and low code unit of the codepoint are "unwanted".

Ex:
𝄞 is "\uD834\uDD1E"

Andy Fingerhut
January 28, 2016, 11:11 AM

Agreed, but probably better to anti-recommend such an implementation of trimr for removing such things, because it would also remove only one UTF-16 Java character out of 2 high/low surrogates if it matched a member of the set, even if the other surrogate didn't match anything in the set, which would leave behind a malformed UTF-16 string.

Again, probably best to either not include this in the implementation at all, and at most warn about it in the docs, or to handle it in the implementation by checking for high/low surrogates in the loop(s).

Tamas Szabo
January 28, 2016, 12:00 PM

Yes, you're right. That solution won't work in all cases, so it can't be recommended.

I am slightly inclined towards having trim? accept chars and work only for removing BMP characters. This will arguably be enough for the majority of the use cases.
The other solution can be used for all use cases, but then trim? will have to accept int, or 2 chars, or a string, so trim? would be less intuitive (although closer to the real world ), and writing those trim? functions would be less user friendly.

That being said, I am happy to change the implementation to do that if it is required.

Currently, I'm not even sure if the enhancement will be accepted or rejected or what the process for that is.

Assignee

Unassigned

Reporter

Tamas Szabo

Labels

Approval

Triaged

Patch

Code and Test

Affects versions

Priority

Minor
Configure