URL Encoding Guide: Percent-Encoding Rules Explained

What URL Encoding Is (And Why URLs Break)

This url encoding guide starts with the fundamental problem: URLs were designed in 1994 for ASCII text. RFC 3986 defines a small set of "unreserved" characters that can appear in a URL without special treatment: A-Z, a-z, 0-9, and four symbols (- _ . ~). Everything else — spaces, Chinese characters, emoji, even common punctuation like & and = — must be percent-encoded to travel safely in a URL.

Percent-encoding converts each byte of a character into %XX where XX is the hexadecimal value. A space becomes %20. A forward slash becomes %2F. The Japanese character 日 in UTF-8 is three bytes (E6 97 A5), so it becomes %E6%97%A5. This is why URLs with non-ASCII text look like garbage — each character explodes into 3-9 characters of percent-encoded bytes.

The confusing part: some characters are "reserved" (:, /, ?, #, @, &, =, +) and have special meaning in URLs. Whether you encode them depends on where they appear. A / in the path separates segments (don't encode it). A / inside a query parameter value is data (encode it). Getting this wrong is the source of about 80% of URL-related bugs I've debugged over the past decade.

The URL Encoding Rules That Actually Matter

RFC 3986 splits URLs into components: scheme (https), authority (user:pass@host:port), path (/a/b/c), query (?key=value), and fragment (#section). Each component has different encoding rules. The path allows / as a separator but must encode ? and #. The query allows ? but must encode # and &. The fragment can contain almost anything since it's never sent to the server.

In practice, you only need to remember three rules. Rule 1: Always encode user-supplied data before inserting it into any URL component. Rule 2: Encode the value, not the structural characters. If you're building ?name=John Doe&age=30, encode "John Doe" to "John%20Doe" but leave the ? and & alone. Rule 3: Use the right function for the right component (more on this below).

The + sign is a special headache. In query strings (application/x-www-form-urlencoded), + means space. In path segments, + is just a literal plus sign. So /search?q=C++ encodes the query as q=C%2B%2B, but /path/C++ doesn't need encoding. HTML forms submit spaces as + in query strings (a convention from 1995), which is why some decoders treat + as space and others don't. When in doubt, use %20 for spaces — it works everywhere.

Another trap: the # character. Browsers interpret everything after # as a fragment identifier and never send it to the server. If your API key contains a # and you put it in a URL without encoding, the server receives a truncated key. I once spent two days debugging an OAuth integration that failed for ~4% of users — turns out their client secrets contained # characters that got silently stripped.

URL structure and what gets encoded where:

  https://user:[email protected]:8080/path/to/page?key=value&q=hello world#section
  ├─────┤├────────┤├──────────────┤├───────────┤├──────────────────────┤├──────┤
  scheme  userinfo     host:port       path              query           fragment

Characters that MUST be encoded in each component:
  Path:     space ? # [ ] & = + (and all non-ASCII)
  Query:    space # & = + (and all non-ASCII)  
  Fragment: (almost nothing — browser-only, not sent to server)

Examples:
  Space in path:   /my documents/file  → /my%20documents/file
  Space in query:  ?q=hello world      → ?q=hello%20world (or ?q=hello+world)
  & in value:      ?company=AT&T       → ?company=AT%26T
  # in value:      ?color=#ff0000      → ?color=%23ff0000

encodeURIComponent vs encodeURI (The Difference That Breaks Things)

JavaScript gives you two functions and most developers pick the wrong one. encodeURI() encodes a complete URL — it leaves reserved characters (: / ? # @ & = +) alone because they're structural. encodeURIComponent() encodes a single component value — it encodes everything except unreserved characters. Using encodeURI() on a query parameter value leaves & and = unencoded, breaking your URL structure.

The rule is simple: use encodeURIComponent() for values, use encodeURI() for complete URLs (which you rarely need to encode). In practice, you almost always want encodeURIComponent(). If you're building a URL from parts, encode each value separately and concatenate with the structural characters. Never encode the entire URL after assembly — that would double-encode the %20 sequences into %2520.

Python has urllib.parse.quote() (encodes a path segment) and urllib.parse.quote_plus() (encodes a query value, turning spaces into +). Go has url.PathEscape() and url.QueryEscape(). Every language has this split because the encoding rules genuinely differ between path and query components. Pick the right one or you'll get subtle bugs that only appear with certain input characters.

Double-encoding is the most common mistake. It happens when you encode a value, then pass it through a framework that encodes it again. The string "hello world" becomes "hello%20world" after the first encoding, then "hello%2520world" after the second (the % gets encoded to %25). If you see %25 in your URLs, someone is encoding twice. Our url-encoder tool shows you the encoding step by step so you can spot where the double-encoding happens.

// ❌ WRONG: encodeURI doesn't encode & and = in values
const query = "company=AT&T&city=New York";
encodeURI(query);
// "company=AT&T&city=New%20York" — AT&T is split into two params!

// ✅ RIGHT: encode each value separately
const params = new URLSearchParams({
  company: "AT&T",
  city: "New York",
});
params.toString();
// "company=AT%26T&city=New+York"

// ✅ Also right: manual encoding with encodeURIComponent
const url = `/search?q=${encodeURIComponent("C++ programming")}&lang=en`;
// "/search?q=C%2B%2B%20programming&lang=en"

// ❌ Double-encoding trap
const encoded = encodeURIComponent("hello world"); // "hello%20world"
encodeURIComponent(encoded); // "hello%2520world" — broken!

URL Encoding in Different Contexts

HTML forms: When you submit a form with method="GET", the browser encodes form values using application/x-www-form-urlencoded format. Spaces become +, special characters become %XX. This is the oldest encoding convention on the web (defined in HTML 2.0, 1995) and it's why + means space in query strings. With method="POST" and enctype="multipart/form-data", no URL encoding happens — binary data goes in MIME boundaries instead.

REST APIs: Most frameworks auto-decode URL parameters before your handler sees them. Express gives you req.query.name already decoded. But if you're building URLs client-side for fetch() calls, you must encode values yourself. The Fetch API does NOT auto-encode URL strings. fetch("/api?name=John Doe") sends a literal space, which most servers reject with 400 Bad Request.

Redirects and OAuth: OAuth 2.0 flows pass tokens and callback URLs as query parameters. The redirect_uri parameter is itself a URL, so it gets encoded: redirect_uri=https%3A%2F%2Fexample.com%2Fcallback. If the OAuth provider decodes it and then your app decodes it again, you get double-decoding issues. I've seen this break with redirect URIs that contain query parameters of their own — the nested ? and & get decoded prematurely.

File paths in URLs: Windows paths with backslashes (C:\Users\file.txt) need special handling. Backslash isn't a reserved character in RFC 3986, but many servers normalize it to forward slash. Spaces in filenames are the classic problem — "My Documents" becomes "My%20Documents" in the URL. macOS and Linux allow almost any character in filenames (including newlines), so always encode path segments from user input.

Unicode and International Domain Names

Modern URLs can contain any Unicode character thanks to IRIs (Internationalized Resource Identifiers, RFC 3987). Browsers display Unicode in the address bar but send percent-encoded UTF-8 bytes on the wire. The URL https://example.com/日本語 is actually sent as https://example.com/%E6%97%A5%E6%9C%AC%E8%AA%9E. The browser just renders it nicely for humans.

International domain names (IDN) use a different encoding called Punycode (RFC 3492). The domain 日本語.jp is encoded as xn--wgv71a309e.jp at the DNS level. This is separate from percent-encoding — it's a way to represent Unicode in the ASCII-only DNS system. You can't percent-encode a domain name; it must be Punycode. JavaScript's URL constructor handles this automatically: new URL("https://日本語.jp").hostname returns "xn--wgv71a309e.jp".

Emoji in URLs work but are ugly when encoded. The 🎉 emoji is 4 UTF-8 bytes, becoming %F0%9F%8E%89 in a URL. Some URL shorteners and social platforms display emoji URLs nicely, but email clients and older software often break them. My advice: avoid emoji in URLs you expect people to copy-paste. Use them in fragment identifiers (#🎉) if you must — those stay client-side and don't hit encoding issues on the server.

One real-world trap: Chinese and Japanese characters in filenames uploaded to cloud storage. AWS S3 keys are UTF-8 strings, but the URLs to access them must be percent-encoded. If you store a file as "报告.pdf" and generate a presigned URL without encoding the key, the URL breaks. The AWS SDK handles this, but if you're building URLs manually (for CloudFront or custom domains), you must encode the path yourself.

Common URL Encoding Bugs (And How to Fix Them)

Bug 1: Spaces showing as + in paths. Your framework is using query-string encoding (application/x-www-form-urlencoded) for path segments. In paths, spaces must be %20, not +. Fix: use encodeURIComponent() for path segments, not URLSearchParams (which produces + for spaces). Or use the URL constructor which handles this correctly.

Bug 2: Broken pagination with special characters. You have ?page=2&filter=price>100 and the > breaks the URL. The filter value needs encoding: filter=price%3E100. This often surfaces when users type filter expressions into search boxes and the frontend builds URLs from their input without encoding.

Bug 3: OAuth callback URLs failing intermittently. The redirect_uri contains query parameters (?source=app) that get decoded during the OAuth flow, turning the single redirect_uri parameter into multiple parameters. Fix: always encode the entire callback URL as a single value, and verify your OAuth library doesn't decode it before comparing with the registered redirect URI.

Bug 4: API returning 404 for resources with slashes in their ID. If your resource ID is "2024/Q1" and you build the URL as /api/reports/2024/Q1, the server sees three path segments instead of one. Fix: encode the ID with encodeURIComponent() so the slash becomes %2F: /api/reports/2024%2FQ1. Note: some web servers (Apache, nginx) decode %2F in paths by default, which can re-break things. Check your server config.

When URL Encoding Is Not the Answer

If you're passing large binary data (images, files) through URLs, don't percent-encode them. A 100 KB file would become a 300 KB URL (each byte triples in size as %XX). Use multipart form upload, or Base64-encode the data and send it in a POST body. URLs have practical length limits — Internet Explorer capped at 2,083 characters, and while modern browsers handle longer URLs, many servers and proxies truncate at 8 KB.

For structured data in URLs, consider alternatives to query strings. If you need to pass a complex filter object, you could percent-encode a JSON string (?filter=%7B%22price%22%3A%7B%22gt%22%3A100%7D%7D) but that's unreadable and fragile. Better options: use path segments for simple hierarchies (/products/electronics/phones), use POST with a JSON body for complex queries, or use a dedicated query language parameter (?filter=price:>100).

Don't URL-encode data that's going into a database, a file, or a non-URL context. I've seen codebases where someone encoded user input "for safety" before storing it, then decoded it on display. This leads to double-encoding bugs when the data later gets put into a URL (it gets encoded again). Encode at the boundary — right before inserting into a URL, and decode right after extracting from one. Never store encoded data.

Security note: URL encoding is not sanitization. Encoding <script> as %3Cscript%3E doesn't prevent XSS if the server decodes it before inserting into HTML. URL encoding protects URL structure, not your application. For XSS prevention, use HTML entity encoding (<script>) or a proper templating engine that auto-escapes. These are different encoding schemes for different purposes.

URL Encoding Quick Reference by Language

JavaScript: encodeURIComponent(value) for query/path values. new URLSearchParams({key: value}).toString() for building query strings (handles encoding automatically). new URL(path, base) for constructing full URLs safely. Decoding: decodeURIComponent(encoded). Never use escape()/unescape() — they're deprecated and handle Unicode incorrectly.

Python: urllib.parse.quote(string, safe="") for path segments. urllib.parse.urlencode(dict) for query strings. requests library handles encoding automatically when you pass params={"key": "value"}. For decoding: urllib.parse.unquote(encoded). The safe parameter controls which characters are NOT encoded — set it to "/" for path encoding that preserves slashes.

Go: url.QueryEscape(value) for query values (spaces become +). url.PathEscape(value) for path segments (spaces become %20). The url.URL struct with its Query() method handles building and parsing URLs correctly. Go is strict about encoding — net/http rejects requests with unencoded special characters in paths.

PHP: urlencode(value) for query values (spaces become +). rawurlencode(value) for path segments (spaces become %20). http_build_query(array) for building query strings from arrays. PHP's $_GET superglobal auto-decodes query parameters, so you never see percent-encoding in your application code unless you're reading the raw URL.