Character Encoding Systems
Last Updated :
24 Sep, 2025
In the digital world, every letter, number, and symbol you see on your screen—from the "A" in "Apple" to the "@" in an email address—is ultimately represented by a series of bits, the fundamental 0s and 1s that computers understand. Character encoding is the process of converting characters (letters, numbers, symbols) into a format that computers can understand and store.
- Encoding ensures computers can store, process, and transmit text data accurately.
- Decoding reverses the process, turning binary codes back into readable text for display or use.
- The process of converting human-readable characters into these binary codes and back again is called character encoding.
A system that maps characters to numbers so computers can store and display text.Why We Need a Character Encoding System
Every letter, number, or symbol has a unique code number. When you type 'A', the computer looks it up in its codebook, finds the number for 'A', and then converts that number into 0s and 1s. When it needs to show you 'A' on the screen, it does the reverse.
Text Encoding Types
Text encoding is how computers understand our words. It turns letters, numbers, and symbols into a code computers can read, usually as binary (1) and (0). Over time, different encoding systems have been created to handle all the different languages and symbols we use, like:
ASCII
Also known as American Standard Code for Information Interchange, it is arguably the most fundamental and widely recognized character encoding. Developed in the 1960s for teletypes, it laid the groundwork for how computers worldwide communicate text. The idea is so simple just assign a number to each character, like A is assigned as 65, and so on.

ASCII is a 7-bit encoding, meaning it can represent 27= 128 different characters, like
1. Non-printable, system codes between 0 and 31.
- 8 → backspace
- 10 → new line
- 13 → carriage return
2. Lower ASCII, between 32 and 127.
- 32 → space
- 40 → (
- 65 → A
- 70 → F
- 97 → a
- 102 → f
3. Higher ASCII, between 128 and 255.
See Complete ASCII Table
Point to Know:
The 7-bit nature of ASCII limits it to English characters and a basic set of symbols. It cannot represent characters from other languages (like accented letters, Cyrillic, Arabic, Chinese, Japanese, Korean, etc.) or specialized symbols. This limitation led to the development of "extended ASCII" variants (using the 8th bit for an additional 128 characters), but these were inconsistent and caused "mojibake" (garbled text) when files were opened on systems using a different extended ASCII variant.
Unicode: The Universal Character Sets.
- Unicode is a universal encoding technique that can be used to encode any language or letter irrespective of devices, operating systems, or software.
- Unicode data can be used without generating data corruption in a variety of system.
- The Unicode Standard is completely compatible and synchronized with ISO/IEC 10646.
While Unicode defines the code points, it doesn't dictate how these code points are stored as sequences of bytes in computer memory or files. This is where Unicode Transformation Formats (UTFs) come into play.
It's a standardized method for encoding Unicode characters into a sequence of bytes for storage or transmission. Unicode is a universal character set that aims to represent all written languages, symbols, and emojis, while UTF defines how these characters are stored in binary form.
Common UTF Encoding
UTF-16 Character Encoding
UTF-16 is a Unicode Transformation Format that uses 16-bit code units.
For expamle if you want to store DOG then the code points should be:
Character: 'B' (Code Point: U+0042)
Character: 'U' (Code Point: U+0055)
Character: 'S' (Code Point: U+0053)
And it store in the memory like:
00 42 00 55 00 53
Points to Know
UTF-16 is commonly used internally by operating systems (like Microsoft Windows) and programming environments (like Java, JavaScript, and some Python versions) for text manipulation and string storage.
UTF-8:
UTF-8 is a new encoding system just like the UTF-16 it also follows the Unicode Standard. With one to four one-byte (8-bit) code units, it can encode all 1,112,064[nb 1] valid Unicode character code points.
Code points with lower numerical values are encoded with fewer bytes since they occur more frequently.
When it was created the creators make sure that this encoding scheme is ASCII compatible and the first 128 Unicode characters that are one-to-one to ASCII are encoded using a single byte with the same binary value as ASCII and ensure that ASCII text is also valid UTF-8-encoded Unicode.
Example:
ASCII Character (1 byte)
Character: 'A'
- Unicode code point: U+0041 (hex 41)
- UTF-8 binary: 01000001 (same as ASCII).
- UTF-8 hex: 41
UTF-32:
UTF-32 is known as 32-bit Unicode Transformation Format. It is a fixed-length encoding that encodes Unicode code points using 32 bits per code. It uses 4-bytes per character and we can count the number of characters in UTF-32 string simply by just counting bytes.
The main advantage of using UTF-32 is that Unicode code points can be directly indexed (although letters in general, such as "grapheme clusters" or some emojis, cannot be directly indexed, thus determining the displayed width of a string is more complex).
Explore
Basics of Computer
Application Software
System Software
Networks & Internet Protocols
Programming Languages