Open issues

Escape control characters 0-1F even if :escape-unicode false

Description

The 32 control characters U+0000 through U+001F are never allowed in raw form in JSON strings.

From ECMA-404:

All characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark (U+0022), reverse solidus (U+005C), and the control characters U+0000 to U+001F.

From RFC 7159:

A string begins and ends with quotation marks. All Unicode characters may be placed within the quotation marks, except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).

When :escape-unicode true (the default), all characters outside the 32-127 range are escaped using \uCAFE syntax (or for the special whitespace cases, using named escapes).

However, when :escape-unicode false is supplied to the write or write-str functions, some of the control characters are written in raw form, resulting in invalid JSON. This is improper behavior; the library should never produce JSON that violates the specification(s), no matter what options the user supplies.

This patch escapes the control characters even when :escape-unicode false is supplied.

There is a bit of special handling to exclude the named escapes in the control character range — the write-string function always escapes the characters (8, 9, 10, 12, 13) which have special escaped names and thus require special treatment.

I did not add any control character validation to the parsing functionality, following Postel's law:

[TCP] implementations should follow a general principle of robustness: be conservative in what you do, be liberal in what you accept from others.


Why use :escape-unicode false at all if I'm worried about compliance? Well, Unicode is a really good idea, and pairs very nicely with the UTF-8 character encoding, which is also a really good idea. UTF-8 encodes text much more efficiently than spelling out literal escapes. The default (:escape-unicode true) does not leverage the compression benefits of UTF-8 — which is a trade-off, since ASCII is nearly impossible to screw up, compared to UTF-8, if you aren't expecting UTF-8 (but you should be expecting UTF-8).

So, in short, I want to be able to leverage UTF-8 and remain confident that I'll get valid JSON output, without having to sanitize the (unusual) control characters out of all the strings in my data.

Environment

Observed on CentOS 7, Mac OS X

Status

Assignee

Unassigned

Reporter

import

Labels

Approval

None

Patch

Code and Test

Priority

Major