Character Encoding Systems

Last Updated : 24 Sep, 2025

In the digital world, every letter, number, and symbol you see on your screen—from the "A" in "Apple" to the "@" in an email address—is ultimately represented by a series of bits, the fundamental 0s and 1s that computers understand. Character encoding is the process of converting characters (letters, numbers, symbols) into a format that computers can understand and store.

Encoding ensures computers can store, process, and transmit text data accurately.
Decoding reverses the process, turning binary codes back into readable text for display or use.
The process of converting human-readable characters into these binary codes and back again is called character encoding.

Why We Need a Character Encoding System

Every letter, number, or symbol has a unique code number. When you type 'A', the computer looks it up in its codebook, finds the number for 'A', and then converts that number into 0s and 1s. When it needs to show you 'A' on the screen, it does the reverse.

Text Encoding Types

Text encoding is how computers understand our words. It turns letters, numbers, and symbols into a code computers can read, usually as binary (1) and (0). Over time, different encoding systems have been created to handle all the different languages and symbols we use, like:

ASCII

Also known as American Standard Code for Information Interchange, it is arguably the most fundamental and widely recognized character encoding. Developed in the 1960s for teletypes, it laid the groundwork for how computers worldwide communicate text. The idea is so simple just assign a number to each character, like A is assigned as 65, and so on.

ASCII is a 7-bit encoding, meaning it can represent 2⁷= 128 different characters, like

1. Non-printable, system codes between 0 and 31.

8 → backspace
10 → new line
13 → carriage return

2. Lower ASCII, between 32 and 127.

32 → space
40 → (
65 → A
70 → F
97 → a
102 → f

3. Higher ASCII, between 128 and 255.

See Complete ASCII Table

Point to Know:

The 7-bit nature of ASCII limits it to English characters and a basic set of symbols. It cannot represent characters from other languages (like accented letters, Cyrillic, Arabic, Chinese, Japanese, Korean, etc.) or specialized symbols. This limitation led to the development of "extended ASCII" variants (using the 8th bit for an additional 128 characters), but these were inconsistent and caused "mojibake" (garbled text) when files were opened on systems using a different extended ASCII variant.

Unicode: The Universal Character Sets.

Unicode is a universal encoding technique that can be used to encode any language or letter irrespective of devices, operating systems, or software.
Unicode data can be used without generating data corruption in a variety of system.
The Unicode Standard is completely compatible and synchronized with ISO/IEC 10646.

While Unicode defines the code points, it doesn't dictate how these code points are stored as sequences of bytes in computer memory or files. This is where Unicode Transformation Formats (UTFs) come into play.

Unicode Transformation Formats (UIt's a):

It's a standardized method for encoding Unicode characters into a sequence of bytes for storage or transmission. Unicode is a universal character set that aims to represent all written languages, symbols, and emojis, while UTF defines how these characters are stored in binary form.

Common UTF Encoding

UTF-16 Character Encoding

UTF-16 is a Unicode Transformation Format that uses 16-bit code units.

For expamle if you want to store DOG then the code points should be:

Character: 'B' (Code Point: U+0042)

UTF-8 Bytes: 42

Character: 'U' (Code Point: U+0055)

UTF-8 Bytes: 55

Character: 'S' (Code Point: U+0053)

UTF-8 Bytes: 53

And it store in the memory like:

00 42 00 55 00 53

Points to Know

UTF-16 is commonly used internally by operating systems (like Microsoft Windows) and programming environments (like Java, JavaScript, and some Python versions) for text manipulation and string storage.

UTF-8:

UTF-8 is a new encoding system just like the UTF-16 it also follows the Unicode Standard. With one to four one-byte (8-bit) code units, it can encode all 1,112,064[nb 1] valid Unicode character code points.

Code points with lower numerical values are encoded with fewer bytes since they occur more frequently.

When it was created the creators make sure that this encoding scheme is ASCII compatible and the first 128 Unicode characters that are one-to-one to ASCII are encoded using a single byte with the same binary value as ASCII and ensure that ASCII text is also valid UTF-8-encoded Unicode.

Example:

ASCII Character (1 byte)

Character: 'A'

Unicode code point: U+0041 (hex 41)
UTF-8 binary: 01000001 (same as ASCII).
UTF-8 hex: 41

UTF-32:

UTF-32 is known as 32-bit Unicode Transformation Format. It is a fixed-length encoding that encodes Unicode code points using 32 bits per code. It uses 4-bytes per character and we can count the number of characters in UTF-32 string simply by just counting bytes.

The main advantage of using UTF-32 is that Unicode code points can be directly indexed (although letters in general, such as "grapheme clusters" or some emojis, cannot be directly indexed, thus determining the displayed width of a string is more complex).

muskankhoiya

Improve

Article Tags :

Character Encoding Systems

Why We Need a Character Encoding System

Text Encoding Types

ASCII

Point to Know:

Unicode: The Universal Character Sets.

Unicode Transformation Formats (UIt's a):

Common UTF Encoding

UTF-16 Character Encoding

Points to Know

UTF-8:

UTF-32:

Explore

Basics of Computer

Application Software

System Software

Networks & Internet Protocols

Programming Languages

Thank You!

What kind of Experience do you want to share?