The 32 control characters U+0000 through U+001F are never allowed in raw form in JSON strings.
All characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark (U+0022), reverse solidus (U+005C), and the control characters U+0000 to U+001F.
From RFC 7159:
A string begins and ends with quotation marks. All Unicode characters may be placed within the quotation marks, except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).
When :escape-unicode true (the default), all characters outside the 32-127 range are escaped using \uCAFE syntax (or for the special whitespace cases, using named escapes).
However, when :escape-unicode false is supplied to the write or write-str functions, some of the control characters are written in raw form, resulting in invalid JSON. This is improper behavior; the library should never produce JSON that violates the specification(s), no matter what options the user supplies.
This patch escapes the control characters even when :escape-unicode false is supplied.
There is a bit of special handling to exclude the named escapes in the control character range — the write-string function always escapes the characters (8, 9, 10, 12, 13) which have special escaped names and thus require special treatment.
I did not add any control character validation to the parsing functionality, following Postel's law:
[TCP] implementations should follow a general principle of robustness: be conservative in what you do, be liberal in what you accept from others.
Why use :escape-unicode false at all if I'm worried about compliance? Well, Unicode is a really good idea, and pairs very nicely with the UTF-8 character encoding, which is also a really good idea. UTF-8 encodes text much more efficiently than spelling out literal escapes. The default (:escape-unicode true) does not leverage the compression benefits of UTF-8 — which is a trade-off, since ASCII is nearly impossible to screw up, compared to UTF-8, if you aren't expecting UTF-8 (but you should be expecting UTF-8).
So, in short, I want to be able to leverage UTF-8 and remain confident that I'll get valid JSON output, without having to sanitize the (unusual) control characters out of all the strings in my data.