A Deep Dive into Character Encoding for Beginners
· What is Character Encoding?
∘ Real-World Examples of Character Encoding
∘ How Encoding Works: A Step-by-Step Example
∘ For a more complex example, let’s look at “Café”:
∘ Special Characters: Spaces and New Lines
∘ Different systems handle these characters differently:
Character encoding is a fundamental concept in computing that allows computers to understand and display text.
At its core, it’s like a secret code that transforms letters, numbers, and symbols into a language that computers can comprehend — numbers.
What is Character Encoding?
Imagine you have a secret code to talk with your friends, where each letter is replaced by a special symbol or number. For example:
A = 1
B = 2
C = 3
- This is essentially what character encoding does for computers.
- Since computers can only understand numbers, character encoding acts as a special dictionary that tells the computer how to turn letters and symbols into numbers it can process, and vice versa.
- When you type a letter on your keyboard, the computer uses this “dictionary” to know which number represents that letter. And when it needs to show you text on the screen, it uses the same dictionary to turn the numbers back into letters you can read.
Real-World Examples of Character Encoding
— ASCII (American Standard Code for Information Interchange): One of the oldest and simplest encodings. In ASCII:
- The letter ‘A’ is represented by the number 65
- The letter ‘B’ is 66
- The number ‘0’ is 48
- The space character is 32
— UTF-8 (Unicode Transformation Format — 8-bit): A modern and flexible encoding that can represent almost every character from all writing systems in the world. In UTF-8:
- Basic Latin letters are represented the same as in ASCII
- The Chinese character ‘中’ is represented by three numbers: 228, 184, 173
- The emoji ‘😊’ is represented by four numbers: 240, 159, 152, 138
— UTF-16 (Unicode Transformation Format — 16-bit): Another way to encode characters, using bigger “chunks”. In UTF-16:
- The letter ‘A’ is represented by 65 (same as ASCII)
- The Chinese character ‘中’ is represented by a single number: 20013
- The emoji ‘😊’ is represented by two numbers: 55357 and 56842
How Encoding Works: A Step-by-Step Example
Let’s take the word “Hello” and see how it’s encoded using UTF-8:
- Start with plain text: Hello
- Convert each character to its Unicode code point:
H = U+0048
e = U+0065
l = U+006C
l = U+006C
o = U+006F
3. Encode these code points using UTF-8 rules:
H (U+0048) = 01001000 (binary) = 48 (decimal)
e (U+0065) = 01100101 (binary) = 101 (decimal)
l (U+006C) = 01101100 (binary) = 108 (decimal)
l (U+006C) = 01101100 (binary) = 108 (decimal)
o (U+006F) = 01101111 (binary) = 111 (decimal)
4. In UTF-8, “Hello” is represented as this sequence of bytes:
48 101 108 108 111
For a more complex example, let’s look at “Café”:
- Unicode code points:
C = U+0043
a = U+0061
f = U+0066
é = U+00E9
2. UTF-8 encoded:
C (U+0043) = 01000011 = 67
a (U+0061) = 01100001 = 97
f (U+0066) = 01100110 = 102
é (U+00E9) = 11000011 10101001 = 195 169
3. “Café” in UTF-8:
67 97 102 195 169
Notice how the “é” character requires two bytes in UTF-8, while the others only need one byte each.
Special Characters: Spaces and New Lines
Different encoding formats handle spaces and new lines in various ways:
- ASCII and UTF-8:
— Space: ASCII 32, Binary 00100000, Hex 0x20
— New line:
- Carriage Return (CR): ASCII 13, Binary 00001101, Hex 0x0D
- Line Feed (LF): ASCII 10, Binary 00001010, Hex 0x0A
2. UTF-16:
— Space: U+0020
— New line: Same as ASCII/UTF-8 (LF: U+000A, CR: U+000D)
3. Unicode: Defines additional space and line break characters like:
— Non-breaking space: U+00A0
— Em space: U+2003
— Line Separator: U+2028
Different systems handle these characters differently:
- Unix/Linux/macOS typically use LF for new lines
- Windows typically uses CR+LF
- Programming languages often use escape sequences (\n, \r)
- HTML collapses multiple spaces unless in a <pre> tag or using
- Databases may have their own conventions
- Network protocols often use CR+LF to denote end of line
Understanding these differences is crucial when working across different platforms and systems to avoid formatting issues or data corruption.
In conclusion, character encoding is the unsung hero that allows our digital devices to communicate in the rich tapestry of human languages. From the simplest space to the most complex emoji, it’s all just numbers to a computer — but with the magic of encoding, it becomes the text we read every day.