Regex Cheat Sheet: Patterns Every Developer Needs

10 min2026年5月11日

Why Another Regex Cheat Sheet?

Every regex cheat sheet I've found online lists syntax — anchors, quantifiers, character classes — but never shows you the patterns you actually copy-paste at work. This one is different. It's organized by task: validate an email, extract a URL, enforce password rules, parse a log line. Each pattern comes with an explanation of why it works and, more importantly, where it breaks.

Regular expression syntax hasn't changed much since Perl 5 in 1994. The basics are the same across JavaScript, Python, Go, Java, and .NET. The differences are in flags (JavaScript uses /g, Python uses re.DOTALL), lookahead support (Go's RE2 engine doesn't support lookbehinds), and Unicode handling. This cheat sheet uses JavaScript regex syntax since that's what most web developers encounter first, but the patterns work in any PCRE-compatible engine.

One opinion upfront: if your regex is longer than ~80 characters, you're probably better off with a parser or multiple simpler regexes chained together. I've debugged 300-character email validation regexes that still missed edge cases. Readability matters more than cleverness.

Regex Cheat Sheet: Core Syntax Quick Reference

Characters: . matches any character except newline. \d matches digits [0-9]. \w matches word characters [a-zA-Z0-9_]. \s matches whitespace (space, tab, newline). Their uppercase versions (\D, \W, \S) match the opposite. The backslash escapes special characters: \. matches a literal dot.

Quantifiers: * means zero or more. + means one or more. ? means zero or one. {3} means exactly 3. {2,5} means 2 to 5. {3,} means 3 or more. By default quantifiers are greedy (match as much as possible). Add ? to make them lazy: .*? matches as little as possible. This distinction matters when parsing HTML or extracting quoted strings.

Anchors and boundaries: ^ matches start of string (or line with /m flag). $ matches end of string (or line with /m flag). \b matches a word boundary — the position between a \w and a \W character. Use \b to avoid partial matches: /\bcat\b/ matches "cat" but not "concatenate".

Groups and alternation: (abc) captures a group. (?:abc) groups without capturing (useful for alternation without polluting your match array). a|b matches a or b. (?=abc) is a positive lookahead — matches position before "abc" without consuming it. (?<=abc) is a positive lookbehind (not supported in all engines).

// Quick reference examples
/\d{3}-\d{4}/        // "555-1234" — US phone fragment
/\b\w+@\w+\.\w+\b/  // crude email (don't use in production)
/^https?:\/\//       // starts with http:// or https://
/(?<=\$)\d+\.\d{2}/ // "19.99" after a $ sign (lookbehind)
/"([^"]*)"/ .exec(str)[1]  // extract content between quotes

Email Validation (The Honest Version)

The only regex that fully validates email addresses per RFC 5322 is over 6,000 characters long. Don't use it. In practice, you need a regex that catches obvious typos without rejecting valid but unusual addresses. Here's what I use: /^[^\s@]+@[^\s@]+\.[^\s@]{2,}$/ — it checks for [email protected] with no spaces. That's it.

Why so simple? Because valid email addresses can contain + signs ([email protected]), dots anywhere in the local part, quoted strings ("weird@chars"@example.com), and even IP addresses as domains (user@[192.168.1.1]). Any regex strict enough to reject "bad" addresses will also reject some valid ones. Gmail alone has 1.8 billion users with addresses that include dots and plus signs.

My recommendation: validate format loosely with regex, then verify the address actually exists by sending a confirmation email. That's what every serious service does. The regex is just a first-pass filter to catch typos like "user@gmailcom" (missing dot) or "user@@gmail.com" (double @).

One gotcha: the HTML5 input type="email" uses its own validation regex (defined in the WHATWG spec) which is different from RFC 5322. If your backend regex is stricter than the browser's, users will see the form submit successfully but get a server-side error. Test both.

URL and Path Patterns

Matching URLs in free text is surprisingly hard. The "simple" approach /https?:\/\/[^\s]+/ works for most cases but fails on URLs with parentheses (common in Wikipedia links) and matches trailing punctuation ("Visit https://example.com." captures the period). A better version: /https?:\/\/[^\s<>"]+[^\s<>".,;:!?)]/ — it excludes common trailing punctuation.

For validating a URL that a user typed into a form, don't use regex at all. Use the URL constructor: try { new URL(input) } catch { /* invalid */ }. It handles edge cases (ports, auth, fragments, IDN domains) that no reasonable regex covers. Regex is for extracting URLs from text, not validating them.

File path patterns differ by OS. Windows: /^[A-Z]:\\(?:[^\\/:*?"<>|]+\\)*[^\\/:*?"<>|]*$/ validates a path like C:\Users\docs\file.txt. Unix: /^\/(?:[^\/]+\/)*[^\/]+$/ is simpler because fewer characters are illegal. In practice, just check for null bytes and path traversal (../) — those are the security-relevant validations.

Extracting query parameters: /[?&]([^=]+)=([^&]*)/ with /g flag gives you key-value pairs. But again, URLSearchParams is better for this in JavaScript. Use regex when you're parsing logs or text that contains URLs, not when you have an actual URL object available.

// Extract all URLs from a block of text
const urlPattern = /https?:\/\/[^\s<>"]+[^\s<>".,;:!?)]/g;
const text = "Visit https://example.com/path?q=1 or http://test.org.";
text.match(urlPattern);
// ["https://example.com/path?q=1", "http://test.org"]

// Validate URL (don't use regex for this)
function isValidUrl(str) {
  try { new URL(str); return true; }
  catch { return false; }
}

// Extract path segments
"/api/v2/users/123/posts".match(/\/([^\/]+)/g);
// ["/api", "/v2", "/users", "/123", "/posts"]

Password Strength Rules with Regex

Password validation is where regex actually shines — you're checking for the presence of character classes, not parsing structure. The classic "at least 8 chars, one uppercase, one lowercase, one digit, one special" translates to four separate lookaheads: /^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[^a-zA-Z\d]).{8,}$/

I prefer separate checks over one monster regex. It's easier to give specific feedback ("needs a number") instead of a generic "password too weak" message. Check each requirement independently: /[a-z]/.test(pw) for lowercase, /[A-Z]/.test(pw) for uppercase, /\d/.test(pw) for digits, pw.length >= 12 for length. Combine the results in application logic.

The NIST 800-63B guidelines (updated 2024) actually recommend against complexity rules. They suggest minimum 8 characters, checking against a breached password list, and allowing up to 64 characters. No forced special characters, no mandatory uppercase. The research shows that complexity rules lead to predictable patterns ("Password1!") while length alone produces better entropy. Our password generator tool creates random strings that satisfy any policy without falling into these patterns.

One thing regex can't check: whether the password appears in a breach database. You need to hash it and check against Have I Been Pwned's API (k-anonymity model, so you only send the first 5 chars of the SHA-1 hash). No regex in the world catches "Correct Horse Battery Staple" as weak — only a dictionary check does.

// Option A: Single regex with lookaheads
const strongPassword = /^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[^a-zA-Z\d]).{8,}$/;
strongPassword.test("Tr0ub4dor&3"); // true

// Option B: Separate checks (recommended — better UX)
function checkPassword(pw) {
  return {
    hasLower: /[a-z]/.test(pw),
    hasUpper: /[A-Z]/.test(pw),
    hasDigit: /\d/.test(pw),
    hasSpecial: /[^a-zA-Z\d]/.test(pw),
    longEnough: pw.length >= 12,
    // NIST recommends 8 minimum, I prefer 12
  };
}

// Check against common patterns (not a substitute for breach DB)
const commonPatterns = /^(password|123456|qwerty|admin)/i;

Log Parsing and Data Extraction

This is where regex earns its keep. Parsing structured log lines is a perfect fit — the format is predictable, the data is text, and you need to extract specific fields. An Apache access log line: /^(\S+) \S+ \S+ \[([^\]]+)\] "(\S+) (\S+) \S+" (\d+) (\d+)/ gives you IP, timestamp, method, path, status code, and response size in one match.

Named capture groups (available since ES2018) make log parsing much more readable. Instead of match[1], match[2], you get match.groups.ip, match.groups.timestamp. The syntax is (?<name>pattern). Use it whenever you have more than 2 capture groups — your future self will thank you.

For multi-line log entries (Java stack traces, for example), use the /s flag (dotAll) so that . matches newlines, or use [\s\S] as a cross-engine alternative. Combine with lazy quantifiers: /ERROR[\s\S]*?(?=\n\d{4}-|$)/ matches from ERROR to the next log entry's timestamp.

Performance warning: backtracking can make regex exponentially slow on certain inputs. The pattern /(a+)+b/ on the string "aaaaaaaaaaaaaaaaac" takes seconds because the engine tries every possible way to split the a's between the inner and outer groups. If you're parsing untrusted input (user-submitted text, log files from unknown sources), use RE2-compatible patterns or set a timeout.

// Apache access log parsing with named groups
const logPattern = /^(?<ip>\S+) \S+ \S+ \[(?<time>[^\]]+)\] "(?<method>\S+) (?<path>\S+) \S+" (?<status>\d+) (?<size>\d+)/;

const line = '192.168.1.1 - - [21/May/2026:10:15:32 +0000] "GET /api/users HTTP/1.1" 200 1234';
const { groups } = line.match(logPattern);
// groups.ip = "192.168.1.1"
// groups.status = "200"
// groups.path = "/api/users"

// Extract all key=value pairs from a log line
const kvPattern = /(?<key>[\w.]+)=(?<value>"[^"]*"|\S+)/g;
const entries = [...line.matchAll(kvPattern)].map(m => m.groups);

When Regex Is the Wrong Tool

HTML parsing: Don't. The famous Stack Overflow answer about parsing HTML with regex (the one about summoning Zalgo) is funny because it's true. HTML is not a regular language — nested tags, self-closing elements, attributes with quotes inside quotes, CDATA sections, and comments all break regex. Use DOMParser in the browser or cheerio/jsdom in Node.js.

JSON validation: A regex cannot validate nested JSON because it can't count matching braces. You can check if something looks like JSON (/^\s*[{\[]/) but you can't verify it's valid. Use JSON.parse() in a try/catch. Our JSON formatter tool does this with proper error reporting.

Arithmetic expressions: Anything with nested parentheses or operator precedence needs a parser, not a regex. If you're building a calculator or formula evaluator, look into recursive descent parsers or parser combinators (like nearley.js or PEG.js).

Natural language processing: Regex can tokenize text and find simple patterns, but it can't understand grammar, handle ambiguity, or deal with context. Sentence splitting looks easy until you hit "Dr. Smith went to Washington, D.C. He arrived at 3 p.m." — every period is ambiguous. Use a proper NLP library (spaCy, compromise) for anything beyond simple pattern matching.

Regex Performance Tips and Common Pitfalls

Compile once, use many times. In JavaScript, define your regex outside the loop: const pattern = /\d+/g; not inside it. In Python, use re.compile(). The compilation cost is small but adds up over millions of iterations. In a benchmark I ran, pre-compiled regex was 3x faster than inline regex over 10 million matches.

Avoid catastrophic backtracking. Patterns like /(.+)+@/ or /(a|a)+b/ can take exponential time on non-matching inputs. The rule: never nest quantifiers on overlapping patterns. If you're not sure, test your regex against pathological inputs (long strings of repeated characters that almost match). The regex101.com debugger shows backtracking steps.

Use atomic groups or possessive quantifiers when available. In Java and .NET, (?>pattern) prevents backtracking into the group. JavaScript doesn't support these yet (proposal is at Stage 3 as of 2026), but you can restructure patterns to avoid the need. Replace (.*)\d with ([^\d]*)\d — the character class can't match digits, so there's nothing to backtrack.

For large-scale text processing (scanning gigabytes of logs), consider tools built on finite automata instead of backtracking engines. ripgrep uses Rust's regex crate (based on RE2's approach) and processes text at 2-5 GB/s. It's literally 100x faster than grep -P for complex patterns because it never backtracks.