What is Character Encoding System?
Last Updated :
19 Oct, 2021
As we all know, computers do not understand the English alphabet, numbers except 0 and 1, or text symbols. We use encoding to convert these. So, encoding is the method or process of converting a series of characters, i.e, letters, numbers, punctuation, and symbols into a special or unique format for transmission or storage in computers. Data is represented in computers using ASCII, UTF8, UTF32, ISCII, and Unicode encoding schemes. All types of data, including numbers, text, photos, audio, and video files, can be handled by computers. For example, 65 is represented as A because all the characters, symbols, numbers are assigned some unique code by the standard encoding schemes. Some of the commonly used encoding schemes are described below:
1. ASCII: ASCII is known as American Standard Code for Information Interchange. The X3 group, part of the ASA, produced and published ASCII for the first time in 1963. (American Standards Association). The ASCII standard was first published in 1963 as ASA X3.4-1963, and it was revised ten times between 1967 and 1986. ASCII is an 8-bit code standard that divides the 256 slots into letters, numbers, and other characters. The ASCII decimal (Dec) number is constructed using binary, which is the universal computer language. The decimal value of the lowercase "h" character (char) is 104, which is "01101000" in binary.
The ASCII table is broken down into three sections.
- Non-printable, system codes between 0 and 31.
- Lower ASCII, between 32 and 127.
- Higher ASCII, between 128 and 255.
ASCII Table for characters:
Letter | ASCII Code | Letter | ASCII Code |
---|
a | 97 | A | 65 |
b | 98 | B | 66 |
c | 99 | C | 67 |
d | 100 | D | 68 |
e | 101 | E | 69 |
f | 102 | F | 70 |
g | 103 | G | 71 |
h | 104 | H | 72 |
i | 105 | I | 73 |
j | 106 | J | 74 |
k | 107 | K | 75 |
l | 108 | L | 76 |
m | 109 | M | 77 |
n | 110 | N | 78 |
o | 111 | O | 79 |
p | 112 | P | 80 |
q | 113 | Q | 81 |
r | 114 | R | 82 |
s | 115 | S | 83 |
t | 116 | T | 84 |
u | 117 | U | 85 |
v | 118 | V | 86 |
w | 119 | W | 87 |
x | 120 | X | 88 |
y | 121 | Y | 89 |
z | 122 | Z | 90 |
2. ISCII: ISCII (Indian Script Code for Information Interchange) is the abbreviation for the Indian Script Code for Information Interchange. ISCII is a method of encoding that can be used to encode a wide range of Indian languages, both written and spoken. To ease transliteration across multiple writing systems, ISCII adopts a single encoding mechanism.
ISCII was established in 1991 by the Bureau of Indian Standards (BIS). It has a character count of roughly 256 and employs an 8-bit encoding technique. From 0-127, the first 128 characters are the same as in ASCII. The following characters, which range from 128 to 255, represent characters from Indian scripts.
Advantages include:
- The vast majority of Indian languages are represented in this.
- The character set is simple and straightforward.
- It is possible to easily transliterate between languages.
Disadvantages include:
- A special keyboard with ISCII character keys is required.
- Because Unicode was created later, and Unicode included ISCII characters, ISCII became obsolete.ISCII (Indian Script Code for Information Interchange) is the Indian Script Code for Information Interchange.
- ISCII is a method of encoding that can encode a wide range of Indian languages, both written and spoken. To ease transliteration across multiple writing systems, ISCII adopts a single encoding mechanism.
3. Unicode: Unicode Characters are translated and stored in computer systems as numbers (bit sequences) that the processor can handle. In Unicode, a code page is an encoding system that converts a set of bits into a character representation. Hundreds of different encoding techniques allocated a number to each letter or character in the globe before Unicode. Many of these methods used code pages with only 256 characters and each of which required 8 bits of storage.
- Unicode enables the creation of a single software product or website for multiple platforms, languages, and countries (without re-engineering), resulting in significant cost savings over older character sets.
- Unicode data can be used without generating data corruption in a variety of systems.
- Unicode is a universal encoding technique that can be used to encode any language or letter irrespective of devices, operating systems, or software.
- Unicode is a character encoding standard that allows you to convert between multiple character encoding systems. Because Unicode is a superset of all other major character encoding systems, you can convert from one encoding scheme to Unicode and then from Unicode to a different encoding scheme.
- The most extensively used encoding is Unicode.
- The applicable versions of ISO/IEC 10646, which defines the Universal Character Set character encoding, are fully compatible and synchronized with Unicode Standard versions. Or we can say that it includes 96,447 character codes that are far enough to decode any character symbol present in the world.
4. UTF-8: It is a character encoding with variable widths that are used in electronic communication. With one to four one-byte (8-bit) code units, it can encode all 1,112,064[nb 1] valid Unicode character code points. Code points with lower numerical values are encoded with fewer bytes since they occur more frequently. When it was created the creators make sure that this encoding scheme is ASCII compatible and the first 128 Unicode characters that are one-to-one to ASCII are encoded using a single byte with the same binary value as ASCII and ensure that ASCII text is also valid UTF-8-encoded Unicode.
Converting Symbols to Binary:
Character | ASCII | Byte |
---|
A | 65 | 1000001 |
a | 97 | 1100001 |
B | 66 | 1000010 |
b | 98 | 1100010 |
Z | 90 | 1011010 |
0 | 48 | 110000 |
9 | 57 | 111001 |
! | 33 | 100001 |
? | 63 | 111111 |
5. UTF-32: UTF-32 is known as 32-bit Unicode Transformation Format. It is a fixed-length encoding that encodes Unicode code points using 32 bits per code. It uses 4-bytes per character and we can count the number of characters in UTF-32 string simply by just counting bytes. The main advantage of using UTF-32 is that Unicode code points can be directly indexed (although letters in general, such as "grapheme clusters" or some emojis, cannot be directly indexed, thus determining the displayed width of a string is more complex). A constant-time operation is finding the Nth code point in a sequence of code points. On the other hand, a variable-length code necessitates sequential access to locate the Nth code point in a row. As a result, UTF-32 is a straightforward substitute for ASCII code that examines each issue in a string using numbers incremented by one.
Similar Reads
What is Internal Storage Encoding of Characters(ISCII)?
We all know, the computer does not directly store letters, numbers, and pictures directly. It converts them into small pieces called bits, which either have two values, 0 or 1. To represent each letter or number properly, we need some rules to correctly store them. These rules correspond to the enco
3 min read
What is character entities in HTML ?
In this article, we will learn about HTML character entities and how to use them. HTML character entities are basically a set of characters (entity) used to represent few characters reserved by the HTML, especially invisible characters or characters difficult to type out using a regular keyboard. HT
2 min read
What is a Character (CHAR)?
In progrаmmiÕ¸g, dаtа is stored аոd mаnupulаted ÑÕ¸ vаrious forms, kÕ¸owÕ¸ аs dаtа types. OÕ¸e fuÕ¸dаmeÕ¸tаl dаtа type is the chаrаcter, ofteÕ¸ly аbbreviаted аs chаr. UÕ¸derstаոdiÕ¸g chаrаcters аոd their use ÑÕ¸ progrаmmiÕ¸g is crucÑаl for hаոdliÕ¸g text, symbols, аոd more complex dаtа structures effectively. Wh
3 min read
How to set character encoding for document in HTML5 ?
In this article, we will learn how to set character encoding for document in HTML5. Character encoding is a method of defining a mapping between bytes and text. To display an HTML document correctly, we must choose a proper character encoding. The different types of character encoding include: ASCII
2 min read
What is a Number System?
A number system is a way of writing numbers using specific symbols or digits. It helps us represent numbers mathematically. There are different types of number systems, such as the decimal system, binary system, octal system, and hexadecimal system. Here, we will discuss the types of number systems
4 min read
What is utf 8 in HTML
When creating websites and web applications, one important can ensuring that content displays correctly for users around the world. Text encoding can play a critical role in this, as it defines how characters are represented in the digital form. UTF-8 (Unicode Transformation Format 8-bit) is one of
4 min read
HTTP headers | Accept-Encoding
The HTTP headers Accepts-Encoding is usually a comparison algorithm of request header. All the HTTP client used to tell the server which encoding or encodings it supports. Then the server will respond in any of the supporting encoding formats. The server selects any one of the proposals, uses it and
2 min read
Convert char to int (Characters to Integers)
Converting Characters to Integers is essential in programming, whether it is users' input or extracting values from strings. Every programming language has different techniques to handle the conversion of these lines. In this article, we will explore how to convert characters to integers in several
4 min read
Detect Encoding of a Text file with Python
Python provides a straightforward way to determine the encoding of a text file, essential for the proper handling of diverse character sets. The chardet library is a popular choice for automatic character encoding detection. By analyzing the statistical distribution of byte values, it accurately ide
2 min read
HTTP headers | Content-Encoding
The HTTP headers Content-encoding is used to compress the media type. It informers the server which encoding the user will supported. It sends the information to the Accept-encoding. The server selects any one of the proposals, uses it and informs the client of its choice with the Content-Encoding r
2 min read