Support Unicode chararacters for char based generators
Description
Environment
Activity
gfredericks October 5, 2019 at 2:53 PM
Some ideas here: https://twitter.com/gfredericks_/status/1180307358297346053
Matthew Smith April 11, 2016 at 6:00 PM
I listed all the new generators I was wanting to build. Basically, I want to map the normal string based generators to have similar behavior to current ones. For example, keywords and symbols have a ukeyword and usymbol for unicode keywords and symbols.
Adding the apply-to from TCHECK-99 will make it easier for people to create a Unicode string generators.
I expect the Unicode versions of the functions to have a very similar distribution to the current versions. The exception is the ones based on "choices" which distributes even across each range, regardless of the size of the range.
gfredericks April 10, 2016 at 11:53 PM
Are you thinking that these generators will generally have uniform distributions, and that the problem of mostly-unprintable-values is not a big enough problem to do anything special about?
Should the second group of generators include analogs for keyword, symbol, etc. as well?
I think anything that involves dozens of new generators I'll be inclined to put in a separate namespace.
Matthew Smith April 2, 2016 at 9:44 PM
;;
;; Unicode support for test.check
;;
;; Unicode support is divided into 2 sections: char based and code-point/int based
;;
;; Ranges and choices
;; Ranges are a vector of range defs
;; A range def is either
;; A single character
;; A pair (vector) of the start and end of a range
;;
;; choices is a generator that choose from a vector of ranges. For example,
;; (choices [1 2 [100 200])
;; would return 1 and 2 and the numbers from 100 to 200. The members of the range pair 100 and 200 in this
;; example, can be anything accepted by choose.
;;
;;
;; The char based Unicode support mirrors the normal char and string generators
;;
Standard Generator | Unicode Generator | Generates |
char | uchar | valid Unicode characters (char) from \u0000 to \uFFFF. |
char-asciii | uchar-alpha | letter Unicode characters. |
| uchar-numeric | digit Unicode characters |
char-alphanumeric | uchar-alphanumeric | letter and digit Unicode characters |
string | ustring | Unicode strings consisting of only chars |
string-alphanumeric | ustring-alphanumeric | Unicode alphanumeric strings. |
| ustring-choices | Unicode strings in the given ranges. |
namespace | unamespace | Unicode strings suitable for use as a Clojure namespace |
keyword | ukeyword | Unicode strings suitable for use as a Clojure keyword |
keyword-ns | ukeyword-ns | Unicode strings suitable for use as a Clojure keyword with optional namespace |
symbol | usymbol | Unicode strings suitable for use as a Clojure symbol |
symbol-ns | usymbol-ns | Unicode strings suitable for use as a Clojure symbol with optional namespace |
;; Code-point or int based characters
Standard Generator | Unicode Generator | Unicode Desc |
---|---|---|
string | ustring-from-code-point | Generates Unicode strings consisting of any valid code point. |
char | code-point | Generates a valid Unicode code point |
Matthew Smith March 7, 2016 at 1:38 PM
You make some great points. I will also review the Java Character class as it seems to have some Unicode information encoded that could be put to good use.
Details
Details
Assignee
Reporter
Priority

Currently the default char generator is only in the range from 0 to 255. Java chars can range from \0000 to \FFFF. If this is something of interest, I will add a patch as I need to do this anyway.