Understanding Character Encodings: ASCII, Unicode, and UTF-8
Have you ever opened a text file and seen a bunch of characters or weird symbols like é?
This is called Mojibake (garbled text), and it happens when the computer is confused about which "Character Encoding" to use.
But what exactly is character encoding? Let's dive into the fundamentals of how computers handle text.
1. Computers Only Know Numbers
At the lowest level, computers only understand binary: 0s and 1s.
To store text, we need a map that assigns a number to each character.
A->65B->66!->33
This mapping system is called a Character Set.
2. ASCII: The American Standard (1963)
The first widely used standard was ASCII (American Standard Code for Information Interchange).
It used 7 bits to represent 128 characters:
- English letters (A-Z, a-z)
- Numbers (0-9)
- Basic punctuation
- Control codes (newline, tab)
This was great for English speakers. But what about the rest of the world?
ASCII had no way to represent accents (é, ñ), let alone Chinese or Emoji.
3. The Chaos of "Extended ASCII"
To support other languages, different regions created their own 8-bit extensions of ASCII.
- ISO-8859-1 (Latin-1): Added Western European accents.
- Windows-1252: Similar to Latin-1 but with some differences.
- Shift_JIS: For Japanese.
- EUC-KR: For Korean.
The Problem:
If you opened a file saved in Shift_JIS using ISO-8859-1, it would look like garbage. There was no single standard that covered everyone.
4. Unicode: One Ring to Rule Them All
Unicode was created to solve this chaos. Its goal: Assign a unique number (Code Point) to every character in every language.
A->U+0041é->U+00E9한->U+D55C💩->U+1F4A9
Unicode currently defines over 149,000 characters.
However, Unicode is just a list of numbers. It doesn't tell us how to store those numbers in bytes. That's where Encodings come in.
5. UTF-8: The Genius Solution
You might think, "Why not just use 32 bits (4 bytes) for every character?" (This is what UTF-32 does).
But that would quadruple the file size for English text, which only needs 8 bits.
Enter UTF-8 (Unicode Transformation Format - 8-bit).
It is a Variable Width Encoding:
- Standard ASCII characters use 1 byte. (Backward compatible with ASCII!)
- European scripts use 2 bytes.
- Asian scripts (CJK) use 3 bytes.
- Emojis use 4 bytes.
How it works
- If a byte starts with
0, it's a single-byte ASCII char. - If it starts with
110, it's the start of a 2-byte sequence. - If it starts with
1110, it's the start of a 3-byte sequence.
This brilliant design makes UTF-8 efficient for English (same size as ASCII) while capable of supporting every character in the world.
Conclusion
Today, UTF-8 is the dominant standard, used by over 98% of websites.
When you save a file, always choose UTF-8 (without BOM). The BOM (Byte Order Mark) is rarely needed today and can cause issues in some software. Choosing "UTF-8" ensures your text will be readable by anyone, anywhere, on any device.
Pro Tip: If you see
éinstead ofé, it usually means a UTF-8 file is being misread as Windows-1252. If you see ``, it often means a binary file or incompatible encoding is being read as text.
Explore Related Tools
Try these free developer tools from Pockit