Lots of OkHttp and Retrofit users have reported bugs complaining that URL special characters (like +
;
|
=
*
;
;
|
or *
) weren’t encoded as they expected.
Why can’t HttpUrl just encode ;
as %3B
?
Extra escaping is safe in most programming languages and document formats. For example, in HTML there’s no behavior consequence to replace a non-delimiter "
character with its escape sequence "
. Or in JSON, it’s safe to replace the string "A"
with "\u0041"
.
But URLs are different because URL encoding is semantic: you cannot encode a URL without changing it. This is weird!
Too Much Encoding
Suppose we’re looking up 100% on DuckDuckGo. Since the code point for %
is 0x25, that character encodes as %25
and the whole URL is https://duckduckgo.com/?q=100%25.
But what if we encode the already-encoded URL of that query? We would double-encode the %
as %2525
and end up searching for 100%25
. Yuck: https://duckduckgo.com/?q=100%2525.
Too Little Encoding
Next we’ll search for #1 on Google. We’ll encode #
as %23
and get this URL: https://www.google.ca/search?q=%231.
What if we forget to encode the #
in the query? Since #
is used as a delimiter for the URL’s fragment, we’ll end up with an empty query and a fragment of 1
: https://www.google.ca/search?q=#1.
Web servers define their own URLs
Ultimately it’s up to the web server to interpret the URLs requested of it. For example, since ;
encodes as %3B
some web servers will interpret paths like /foo;bar
and /foo%3Bbar
to be equal. But others can interpret these differently! Both strategies have consequences for security and performance.
These two URLs differ only in whether the ;
character is encoded. Click through them to see that they serve different content.
Browsers and Specs
The best URL documents are the IETF’s RFC 3986 and WHAT-WG’s URL Standard. Browsers also do their thing, and I’ve built my own little catalog of what gets encoded where.
My advice
If you’re defining your own URLs, you’ll save a lot of trouble by avoiding characters like <
, >
, {
, }
, +
, ^
, &
, |
, and ;
.
Avoid attempting to decode a URL without also decomposing it. Otherwise delimiters like /
, ?
, and #
are made ambiguous.
Servers own their URLs. If a server gives you a link to /foo;bar.html
, don’t canonicalize it to /foo%3Bbar.html
. It’s equally broken to do the opposite, converting /foo%3Bbar.html
to /foo;bar.html
.
HttpUrl a = HttpUrl.parse("https://publicobject.com/%3B.html");
HttpUrl b = HttpUrl.parse("https://publicobject.com/;.html");
// The decomposed form is decoded.
assertEquals(";.html", a.pathSegments().get(0));
assertEquals(";.html", b.pathSegments().get(0));
// And yet the encoded form is preserved!
assertEquals("/%3B.html", a.encodedPath());
assertEquals("/;.html", b.encodedPath());
Clients and servers should use a class like OkHttp’s HttpUrl to compose and encode URLs. This class explicitly separates your application data (like 100%
and #1
) from their encoded forms.