Why can’t HttpUrl just encode
Extra escaping is safe in most programming languages and document formats. For example, in HTML there’s no behavior consequence to replace a non-delimiter
" character with its escape sequence
". Or in JSON, it’s safe to replace the string
But URLs are different because URL encoding is semantic: you cannot encode a URL without changing it. This is weird!
Too Much Encoding
Suppose we’re looking up 100% on DuckDuckGo. Since the code point for
% is 0x25, that character encodes as
%25 and the whole URL is https://duckduckgo.com/?q=100%25.
But what if we encode the already-encoded URL of that query? We would double-encode the
%2525 and end up searching for
100%25. Yuck: https://duckduckgo.com/?q=100%2525.
Too Little Encoding
Next we’ll search for #1 on Google. We’ll encode
%23 and get this URL: https://www.google.ca/search?q=%231.
What if we forget to encode the
# in the query? Since
# is used as a delimiter for the URL’s fragment, we’ll end up with an empty query and a fragment of
Web servers define their own URLs
Ultimately it’s up to the web server to interpret the URLs requested of it. For example, since
; encodes as
%3B some web servers will interpret paths like
/foo%3Bbar to be equal. But others can interpret these differently! Both strategies have consequences for security and performance.
These two URLs differ only in whether the
; character is encoded. Click through them to see that they serve different content.
Browsers and Specs
If you’re defining your own URLs, you’ll save a lot of trouble by avoiding characters like
Avoid attempting to decode a URL without also decomposing it. Otherwise delimiters like
# are made ambiguous.
Servers own their URLs. If a server gives you a link to
/foo;bar.html, don’t canonicalize it to
/foo%3Bbar.html. It’s equally broken to do the opposite, converting
HttpUrl a = HttpUrl.parse("https://publicobject.com/%3B.html"); HttpUrl b = HttpUrl.parse("https://publicobject.com/;.html"); // The decomposed form is decoded. assertEquals(";.html", a.pathSegments().get(0)); assertEquals(";.html", b.pathSegments().get(0)); // And yet the encoded form is preserved! assertEquals("/%3B.html", a.encodedPath()); assertEquals("/;.html", b.encodedPath());
Clients and servers should use a class like OkHttp’s HttpUrl to compose and encode URLs. This class explicitly separates your application data (like
#1) from their encoded forms.