java.io URL to File coercion and encoding of non-ASCII characters

Description

clojure.java.io/resource corrupts path containing UTF-8 characters without issuing warning. (The behavior in the example below is not specific to JDK 8 or Clojure 1.5.0. It is seen with the latest Clojure master as of Sep 15, 2013, and with JDK 6 and JDK 7.)

Analysis:

The implementation of method as-file of protocol Coercions for class java.net.URL transforms each occurrence of '%xy', where x and y are hex digits in ASCII, to a separate character in the result. The correct behavior is to treat sequences of more than one '%xy' as a byte sequence encoded in UTF-8, where single Unicode code points (i.e. 'Unicode characters') are encoded with anywhere from 1 to 4 bytes.

Patch: clj-1177-patch-v2.diff

Approach:

Change method as-file for class java.net.URL to use method java.net.URLDecoder.decode to decode the contents of a URL string.

http://docs.oracle.com/javase/6/docs/api/java/net/URLDecoder.html#decode%28java.lang.String,%20java.lang.String%29

The only issue with java.net.URLDecoder.decode's behavior is that it changes plus-sign characters to spaces, which according to at least one of the existing unit tests should not happen in as-file. To work around this, first explicitly encode any plus-sign characters in the given URL string, using method java.net.URLEncoder.encode. After that, pass the result to method decode.

http://docs.oracle.com/javase/6/docs/api/java/net/URLEncoder.html#encode%28java.lang.String,%20java.lang.String%29

Other approaches:

Patch clj-1177-patch-v1.txt represents an alternate approach that does its own 'unescaping' of UTF-8 encoded URL strings, without relying on class java.net.URLDecoder. As a result, it is longer and more detailed.

Screened by: Alex Miller

Environment

None

Activity

Show:
Andy Fingerhut
December 24, 2014, 6:15 PM

Chris, I may be missing something in your question, but this bug was due to clojure.java.io/resource returning a value that was incorrect when the resource name contained non-ASCII characters.

After getting a correct return value form clojure.java.io/resource, you can choose to call clojure.java.io/reader on it if you want to read it as text, with UTF-8, UTF-16, etc. encoding, or you can choose instead to call clojure.java.io/input-stream on it if you want to read it as a byte sequence.

However, neither of those second steps can work unless the resource can be found by name somehow.

If that doesn't address your question, please try again.

import
December 26, 2014, 11:06 AM

Comment made by: ctford

Hi Andy,

My understanding of the reason for io/resource returning a bad value is that the file path is URL-encoded in the return type, which of class Url. This is because the Java .getResource() (http://docs.oracle.com/javase/7/docs/api/java/lang/ClassLoader.html#getResource(java.lang.String)) method called by io/resource returns a URL, so the encoding happens even before we get back to Clojure-land.

.getResourceAsStream() (http://docs.oracle.com/javase/7/docs/api/java/lang/ClassLoader.html#getResourceAsStream(java.lang.String)) is a similar method to .getResource(), but it returns an InputStream. As it doesn't return a Url, the URL-encoding that causes our issue never happens, and so does not need to be decoded.

As it happens, io/reader works with either an InputStream or a Url, so it happily consumes both the output of .getResource() and .getResourceAsStream().

Avoiding unwanted encoding seems like a more robust solution than encoding and decoding, especially in cases where e.g. the path appears to already have been encoded, perhaps already containing a %20.

import
December 26, 2014, 12:26 PM

Comment made by: ctford

I checked whether there would be a problem with paths already containing escape sequences e.g. "strange%20namespace.clj", but Clojure 1.6 does the right thing.

Here's a proof-of-concept for how we could use .getResourceAsStream():

Andy Fingerhut
December 27, 2014, 5:04 PM

So you are not saying that there is a bug in the current implementation in Clojure 1.6.0, but that with some new functions implemented and published as part of the API, a developer could get from a resource name to an input stream more efficiently than with the current API?

Alex Miller
December 28, 2014, 4:40 PM

I'm not sure why this discussion is here - if there is a request for enhancement, please file a new ticket that we can assess and target.

Completed

Assignee

Unassigned

Reporter

import

Labels

Approval

Ok

Patch

Code and Test

Fix versions

Affects versions

Priority

Minor