Lots of OkHttp and Retrofit users have reported bugs complaining that URL special characters (like + ; | = * ; ; | or *) weren’t encoded as they expected.
Why can’t HttpUrl just encode ; as %3B?
Extra escaping is safe in most programming languages and document formats. For example, in HTML there’s no behavior consequence to replace a non-delimiter " character with its escape sequence ". Or in JSON, it’s safe to replace the string "A" with "\u0041".
But URLs are different because URL encoding is semantic: you cannot encode a URL without changing it. This is weird!
Too Much Encoding
Suppose we’re looking up 100% on DuckDuckGo. Since the code point for % is 0x25, that character encodes as %25 and the whole URL is https://duckduckgo.com/?q=100%25.
But what if we encode the already-encoded URL of that query? We would double-encode the % as %2525 and end up searching for 100%25. Yuck: https://duckduckgo.com/?q=100%2525.
Too Little Encoding
Next we’ll search for #1 on Google. We’ll encode # as %23 and get this URL: https://www.google.ca/search?q=%231.
What if we forget to encode the # in the query? Since # is used as a delimiter for the URL’s fragment, we’ll end up with an empty query and a fragment of 1: https://www.google.ca/search?q=#1.
Web servers define their own URLs
Ultimately it’s up to the web server to interpret the URLs requested of it. For example, since ; encodes as %3B some web servers will interpret paths like /foo;bar and /foo%3Bbar to be equal. But others can interpret these differently! Both strategies have consequences for security and performance.
These two URLs differ only in whether the ; character is encoded. Click through them to see that they serve different content.
Browsers and Specs
The best URL documents are the IETF’s RFC 3986 and WHAT-WG’s URL Standard. Browsers also do their thing, and I’ve built my own little catalog of what gets encoded where.
My advice
If you’re defining your own URLs, you’ll save a lot of trouble by avoiding characters like <, >, {, }, +, ^, &, |, and ;.
Avoid attempting to decode a URL without also decomposing it. Otherwise delimiters like /, ?, and # are made ambiguous.
Servers own their URLs. If a server gives you a link to /foo;bar.html, don’t canonicalize it to /foo%3Bbar.html. It’s equally broken to do the opposite, converting /foo%3Bbar.html to /foo;bar.html.
HttpUrl a = HttpUrl.parse("https://publicobject.com/%3B.html");
HttpUrl b = HttpUrl.parse("https://publicobject.com/;.html");
// The decomposed form is decoded.
assertEquals(";.html", a.pathSegments().get(0));
assertEquals(";.html", b.pathSegments().get(0));
// And yet the encoded form is preserved!
assertEquals("/%3B.html", a.encodedPath());
assertEquals("/;.html", b.encodedPath());
Clients and servers should use a class like OkHttp’s HttpUrl to compose and encode URLs. This class explicitly separates your application data (like 100% and #1) from their encoded forms.