Uploaded image for project: 'data.json'
  1. DJSON-28

Escape control characters 0-1F even if :escape-unicode false

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Labels:
    • Environment:

      Observed on CentOS 7, Mac OS X

    • Patch:
      Code and Test

      Description

      The 32 control characters U+0000 through U+001F are never allowed in raw form in JSON strings.

      From ECMA-404:

      All characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark (U+0022), reverse solidus (U+005C), and the control characters U+0000 to U+001F.

      From RFC 7159:

      A string begins and ends with quotation marks. All Unicode characters may be placed within the quotation marks, except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).

      When :escape-unicode true (the default), all characters outside the 32-127 range are escaped using \uCAFE syntax (or for the special whitespace cases, using named escapes).

      However, when :escape-unicode false is supplied to the write or write-str functions, some of the control characters are written in raw form, resulting in invalid JSON. This is improper behavior; the library should never produce JSON that violates the specification(s), no matter what options the user supplies.

      This patch escapes the control characters even when :escape-unicode false is supplied.

      There is a bit of special handling to exclude the named escapes in the control character range — the write-string function always escapes the characters (8, 9, 10, 12, 13) which have special escaped names and thus require special treatment.

      I did not add any control character validation to the parsing functionality, following Postel's law:

      [TCP] implementations should follow a general principle of robustness: be conservative in what you do, be liberal in what you accept from others.


      Why use :escape-unicode false at all if I'm worried about compliance? Well, Unicode is a really good idea, and pairs very nicely with the UTF-8 character encoding, which is also a really good idea. UTF-8 encodes text much more efficiently than spelling out literal escapes. The default (:escape-unicode true) does not leverage the compression benefits of UTF-8 — which is a trade-off, since ASCII is nearly impossible to screw up, compared to UTF-8, if you aren't expecting UTF-8 (but you should be expecting UTF-8).

      So, in short, I want to be able to leverage UTF-8 and remain confident that I'll get valid JSON output, without having to sanitize the (unusual) control characters out of all the strings in my data.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              alex+import import
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated: