Support Unicode chararacters for char based generators

Description

Currently the default char generator is only in the range from 0 to 255. Java chars can range from \0000 to \FFFF. If this is something of interest, I will add a patch as I need to do this anyway.

Environment

None

Activity

Show:

gfredericks October 5, 2019 at 2:53 PM

Matthew Smith April 11, 2016 at 6:00 PM

I listed all the new generators I was wanting to build. Basically, I want to map the normal string based generators to have similar behavior to current ones. For example, keywords and symbols have a ukeyword and usymbol for unicode keywords and symbols.

Adding the apply-to from TCHECK-99 will make it easier for people to create a Unicode string generators.

I expect the Unicode versions of the functions to have a very similar distribution to the current versions. The exception is the ones based on "choices" which distributes even across each range, regardless of the size of the range.

gfredericks April 10, 2016 at 11:53 PM

Are you thinking that these generators will generally have uniform distributions, and that the problem of mostly-unprintable-values is not a big enough problem to do anything special about?

Should the second group of generators include analogs for keyword, symbol, etc. as well?

I think anything that involves dozens of new generators I'll be inclined to put in a separate namespace.

Matthew Smith April 2, 2016 at 9:44 PM

;;
;; Unicode support for test.check
;;
;; Unicode support is divided into 2 sections: char based and code-point/int based
;;
;; Ranges and choices
;; Ranges are a vector of range defs
;; A range def is either
;; A single character
;; A pair (vector) of the start and end of a range
;;
;; choices is a generator that choose from a vector of ranges. For example,
;; (choices [1 2 [100 200])
;; would return 1 and 2 and the numbers from 100 to 200. The members of the range pair 100 and 200 in this
;; example, can be anything accepted by choose.
;;
;;
;; The char based Unicode support mirrors the normal char and string generators
;;

Standard Generator

Unicode Generator

Generates

char

uchar

valid Unicode characters (char) from \u0000 to \uFFFF.

char-asciii

uchar-alpha

letter Unicode characters.

 

uchar-numeric

digit Unicode characters

char-alphanumeric

uchar-alphanumeric

letter and digit Unicode characters

string

ustring

Unicode strings consisting of only chars

string-alphanumeric

ustring-alphanumeric

Unicode alphanumeric strings.

 

ustring-choices

Unicode strings in the given ranges.

namespace

unamespace

Unicode strings suitable for use as a Clojure namespace

keyword

ukeyword

Unicode strings suitable for use as a Clojure keyword

keyword-ns

ukeyword-ns

Unicode strings suitable for use as a Clojure keyword with optional namespace

symbol

usymbol

Unicode strings suitable for use as a Clojure symbol

symbol-ns

usymbol-ns

Unicode strings suitable for use as a Clojure symbol with optional namespace

;; Code-point or int based characters

Standard Generator

Unicode Generator

Unicode Desc

string

ustring-from-code-point

Generates Unicode strings consisting of any valid code point.

char

code-point

Generates a valid Unicode code point

Matthew Smith March 7, 2016 at 1:38 PM

You make some great points. I will also review the Java Character class as it seems to have some Unicode information encoded that could be put to good use.

Details

Assignee

Reporter

Priority

Created March 6, 2016 at 7:15 PM
Updated October 5, 2019 at 2:53 PM