MD5 vs SHA256: Hash Functions Explained Simply

9 min2026年5月16日

What Hash Functions Do (MD5 vs SHA256 in Plain English)

A hash function takes any input — a single character, a 10 GB file, an empty string — and produces a fixed-size output called a digest. MD5 always outputs 128 bits (32 hex characters). SHA-256 always outputs 256 bits (64 hex characters). The same input always produces the same output, but you can't reverse the process to recover the input. That's the entire concept. The md5 vs sha256 debate comes down to one question: do you need collision resistance?

Collision resistance means it should be computationally infeasible to find two different inputs that produce the same hash. MD5 lost this property in 2004 when Xiaoyun Wang demonstrated practical collision attacks. By 2012, researchers could create colliding PDFs with different visible content but identical MD5 hashes. SHA-256 has no known collisions and is expected to remain secure for decades.

But here's what most articles get wrong: MD5 isn't "broken" for everything. It's broken for security (digital signatures, certificates, integrity verification against malicious tampering). It's perfectly fine for non-security uses: deduplication, cache keys, checksums for accidental corruption, hash table distribution. If your threat model doesn't include an attacker deliberately crafting collisions, MD5 is fast and adequate.

How Hash Functions Work (The Mechanics)

All cryptographic hash functions follow the same pattern: pad the input to a multiple of the block size, split it into blocks, and process each block through a compression function that mixes it with the running state. MD5 uses 512-bit blocks and 4 rounds of 16 operations each. SHA-256 uses 512-bit blocks and 64 rounds. More rounds = more mixing = harder to reverse or find collisions.

The avalanche effect is what makes hashes useful: changing a single bit in the input flips roughly 50% of the output bits. "hello" and "hellp" produce completely different hashes with no visible relationship. This means you can't deduce anything about the input from the output, and similar inputs don't produce similar outputs. Our hash-generator tool demonstrates this — try changing one character and watch the entire hash change.

Performance varies dramatically. On modern hardware (Intel i7-13700K), MD5 processes about 6 GB/s, SHA-256 about 2 GB/s (or 8 GB/s with SHA-NI hardware acceleration), SHA-3 about 1.5 GB/s, and BLAKE3 about 12 GB/s (it's designed for speed with SIMD). For hashing passwords, you want slowness — bcrypt at cost 12 does about 4 hashes/second on the same CPU. That's deliberate.

One detail that trips people up: hash functions are deterministic but not portable across encodings. SHA-256 of the string "hello" depends on whether you encode it as UTF-8, UTF-16, or ASCII. The bytes are different, so the hash is different. Always specify the encoding. In JavaScript, new TextEncoder().encode("hello") gives you UTF-8 bytes, which is the standard convention.

// Same string, different encodings = different hashes
const text = "hello";

// UTF-8 (standard): 68 65 6c 6c 6f (5 bytes)
// SHA-256: 2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824

// UTF-16LE: 68 00 65 00 6c 00 6c 00 6f 00 (10 bytes)  
// SHA-256: completely different hash

// Always use UTF-8 unless you have a specific reason not to
const encoder = new TextEncoder(); // UTF-8 by default
const bytes = encoder.encode(text);
const hash = await crypto.subtle.digest("SHA-256", bytes);

When to Use Each Algorithm (Practical Guide)

File integrity (detecting accidental corruption): MD5 or CRC32 is fine. You're protecting against disk errors and network glitches, not attackers. MD5 is faster than SHA-256 and the 128-bit output is sufficient. Linux package managers still use MD5 checksums alongside SHA-256 for backward compatibility. If you want extra safety at minimal cost, use SHA-256 — it's fast enough for most file sizes.

Content addressing and deduplication: SHA-256 is the standard. Git uses SHA-1 (migrating to SHA-256), IPFS uses SHA-256, Docker image layers use SHA-256. The hash becomes the content's identity — if two files have the same SHA-256, they're the same file (with overwhelming probability). Don't use MD5 here because an attacker could craft a malicious file with the same MD5 as a legitimate one.

Digital signatures and certificates: SHA-256 minimum. TLS certificates moved from SHA-1 to SHA-256 in 2017 after Google demonstrated a SHA-1 collision (the "SHAttered" attack, which cost ~$110,000 in GPU time). For new systems, SHA-256 or SHA-3 are both good choices. Ed25519 signatures use SHA-512 internally.

Password hashing: NONE of the above. MD5, SHA-256, SHA-3 are all wrong for passwords because they're too fast. Use bcrypt, scrypt, or Argon2id — these are deliberately slow functions designed to resist brute-force attacks. A GPU can compute 10 billion SHA-256 hashes per second but only ~70 bcrypt hashes per second. See our password-generator tool for generating passwords that resist even slow-hash brute-forcing.

The SHA Family: SHA-1, SHA-2, SHA-3

SHA-1 (160 bits): Broken since 2017. Google's SHAttered attack produced two different PDFs with the same SHA-1 hash. Cost: ~$110,000 in cloud GPU time in 2017, probably under $10,000 today. Don't use SHA-1 for anything security-related. Git still uses it but is migrating to SHA-256. Chrome and Firefox reject SHA-1 certificates since 2017.

SHA-2 family (SHA-224, SHA-256, SHA-384, SHA-512): The current standard. SHA-256 is the most common variant. SHA-512 is faster on 64-bit processors (counterintuitively — it processes 1024-bit blocks vs SHA-256's 512-bit blocks, and 64-bit operations are native). SHA-384 is just SHA-512 with a different initial state and truncated output. For most purposes, SHA-256 is the default choice.

SHA-3 (Keccak, standardized 2015): A completely different internal design from SHA-2 (sponge construction vs Merkle-Damgård). It exists as a backup in case SHA-2 is broken — having two unrelated algorithms means a breakthrough against one doesn't affect the other. SHA-3 is slightly slower than SHA-2 in software but has a wider security margin. Use it if your compliance requirements specify it, otherwise SHA-256 is fine.

BLAKE3 (2020): Not a SHA variant but worth mentioning. It's 3-6x faster than SHA-256, parallelizable (scales with CPU cores), and has a 256-bit output. It's used in Bao (verified streaming), the Rust ecosystem, and increasingly in content-addressed storage. The downside: it's newer and not yet in NIST standards, so regulated industries can't use it. For internal tooling and performance-sensitive applications, BLAKE3 is excellent.

Hash Collisions: What They Mean in Practice

A collision is two different inputs that produce the same hash output. For a 128-bit hash (MD5), the birthday paradox says you'll likely find a collision after about 2^64 attempts (~18 quintillion). For SHA-256 (256 bits), it's 2^128 attempts — more than the number of atoms in the observable universe. In practice, SHA-256 collisions will never be found by brute force.

But brute force isn't the only attack. Cryptanalysis exploits mathematical weaknesses in the algorithm. MD5's compression function has structural flaws that allow finding collisions in seconds on a laptop (not 2^64 attempts, but about 2^18 — a few hundred thousand operations). SHA-1 requires about 2^63 operations (still expensive but feasible for nation-states and well-funded attackers).

What can an attacker do with a collision? They can create two documents with the same hash — one benign, one malicious. They get the benign one signed/certified, then substitute the malicious one. This is how the Flame malware (2012) used an MD5 collision to forge a Microsoft Windows Update certificate. The certificate authority signed a legitimate-looking certificate, but the attackers had a colliding certificate that worked for their malware.

For non-security uses, collisions are a non-issue. If you're using MD5 as a hash table key or cache identifier, a collision just means two different inputs map to the same bucket — your code handles this with chaining or open addressing. The probability is so low (1 in 2^64 for random inputs) that you'll never see it in practice. The security concern is only about deliberately crafted collisions.

HMAC: When You Need Authentication, Not Just Hashing

A plain hash verifies integrity (the data wasn't accidentally corrupted) but not authenticity (the data came from who you think). An attacker who modifies the data can recompute the hash. HMAC (Hash-based Message Authentication Code) solves this by mixing a secret key into the hash: HMAC(key, message) = Hash((key ⊕ opad) || Hash((key ⊕ ipad) || message)).

Use HMAC when: verifying API webhook signatures (Stripe, GitHub, Shopify all use HMAC-SHA256), creating tamper-proof tokens (JWT signatures use HMAC-SHA256 with the HS256 algorithm), or validating that data hasn't been modified in transit by an attacker. The key must be kept secret — if the attacker has the key, HMAC provides no protection.

Common mistake: using Hash(key + message) instead of HMAC. This is vulnerable to length-extension attacks — an attacker who knows Hash(key + message) can compute Hash(key + message + attacker_data) without knowing the key. SHA-256 and MD5 are both vulnerable to this. HMAC's double-hashing construction prevents it. SHA-3 is not vulnerable to length extension (different internal structure), but use HMAC anyway for consistency.

In code: Node.js has crypto.createHmac("sha256", key).update(message).digest("hex"). Python has hmac.new(key, message, hashlib.sha256).hexdigest(). Never implement HMAC yourself — use your language's standard library. The construction looks simple but timing attacks on comparison (using === instead of crypto.timingSafeEqual) can leak the correct HMAC byte by byte.

Common Mistakes with Hash Functions

Mistake 1: Using SHA-256 for passwords. SHA-256 is fast — that's bad for passwords. An RTX 4090 computes 22 billion SHA-256 hashes per second. An 8-character password from the full ASCII set (6.6 quadrillion combinations) falls in 3.5 days. Use bcrypt/Argon2id which reduce that to centuries. If you're storing user passwords with SHA-256 (even with salt), migrate immediately.

Mistake 2: Comparing hashes with == in a security context. String comparison short-circuits on the first different character, leaking timing information. An attacker can determine the correct HMAC one character at a time by measuring response times. Use constant-time comparison: crypto.timingSafeEqual() in Node.js, hmac.compare_digest() in Python, subtle.ConstantTimeCompare() in Go.

Mistake 3: Truncating hashes for "shorter IDs." If you take the first 8 characters of a SHA-256 hash (32 bits), your collision probability jumps to 50% at only 77,000 items (birthday paradox). I've seen this in URL shorteners and cache keys. If you need shorter identifiers, use a purpose-built short ID generator (nanoid, hashids) rather than truncating a cryptographic hash.

Mistake 4: Assuming hash = encryption. Hashing is one-way — you cannot recover the input from the output. Encryption is two-way — you can decrypt with the key. If you need to store data that you'll need to read later (API keys, credit card numbers), use encryption (AES-256-GCM). If you need to verify data without storing it (passwords, integrity checks), use hashing. These are fundamentally different operations.