URL Encoding Is Material

Lots of OkHttp and Retrofit users have reported bugs complaining that URL special characters (like + ; | = * ; ; | or *) weren’t encoded as they expected.

Why can’t HttpUrl just encode `;` as `%3B`?

Extra escaping is safe in most programming languages and document formats. For example, in HTML there’s no behavior consequence to replace a non-delimiter " character with its escape sequence ". Or in JSON, it’s safe to replace the string "A" with "\u0041".

But URLs are different because URL encoding is semantic: you cannot encode a URL without changing it. This is weird!

Too Much Encoding

Suppose we’re looking up 100% on DuckDuckGo. Since the code point for % is 0x25, that character encodes as %25 and the whole URL is https://duckduckgo.com/?q=100%25.

But what if we encode the already-encoded URL of that query? We would double-encode the % as %2525 and end up searching for 100%25. Yuck: https://duckduckgo.com/?q=100%2525.

Too Little Encoding

Next we’ll search for #1 on Google. We’ll encode # as %23 and get this URL: https://www.google.ca/search?q=%231.

What if we forget to encode the # in the query? Since # is used as a delimiter for the URL’s fragment, we’ll end up with an empty query and a fragment of 1: https://www.google.ca/search?q=#1.

Web servers define their own URLs

Ultimately it’s up to the web server to interpret the URLs requested of it. For example, since ; encodes as %3B some web servers will interpret paths like /foo;bar and /foo%3Bbar to be equal. But others can interpret these differently! Both strategies have consequences for security and performance.

These two URLs differ only in whether the ; character is encoded. Click through them to see that they serve different content.

Browsers and Specs

The best URL documents are the IETF’s RFC 3986 and WHAT-WG’s URL Standard. Browsers also do their thing, and I’ve built my own little catalog of what gets encoded where.

My advice

If you’re defining your own URLs, you’ll save a lot of trouble by avoiding characters like <, >, {, }, +, ^, &, |, and ;.

Avoid attempting to decode a URL without also decomposing it. Otherwise delimiters like /, ?, and # are made ambiguous.

Servers own their URLs. If a server gives you a link to /foo;bar.html, don’t canonicalize it to /foo%3Bbar.html. It’s equally broken to do the opposite, converting /foo%3Bbar.html to /foo;bar.html.

HttpUrl a = HttpUrl.parse("https://publicobject.com/%3B.html");
HttpUrl b = HttpUrl.parse("https://publicobject.com/;.html");

// The decomposed form is decoded.
assertEquals(";.html", a.pathSegments().get(0));
assertEquals(";.html", b.pathSegments().get(0));

// And yet the encoded form is preserved!
assertEquals("/%3B.html", a.encodedPath());
assertEquals("/;.html", b.encodedPath());

Clients and servers should use a class like OkHttp’s HttpUrl to compose and encode URLs. This class explicitly separates your application data (like 100% and #1) from their encoded forms.

Why can’t HttpUrl just encode ; as %3B?