0% found this document useful (0 votes)
73 views

008 What Is UTF-8 - UTF-8 Character Encoding Tutorial

UTF-8 is a character encoding system that allows representing characters as ASCII text while including international characters. It uses a variable number of bytes per character, from one to four, with ASCII characters using one byte. To use UTF-8, include the <meta charset="utf-8"> tag in the <head> of an HTML document. UTF-8 is an efficient encoding that can represent many characters while maintaining compatibility with ASCII, making it a good choice for internationalized software.

Uploaded by

Saint-Clair Lima
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views

008 What Is UTF-8 - UTF-8 Character Encoding Tutorial

UTF-8 is a character encoding system that allows representing characters as ASCII text while including international characters. It uses a variable number of bytes per character, from one to four, with ASCII characters using one byte. To use UTF-8, include the <meta charset="utf-8"> tag in the <head> of an HTML document. UTF-8 is an efficient encoding that can represent many characters while maintaining compatibility with ASCII, making it a good choice for internationalized software.

Uploaded by

Saint-Clair Lima
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

What is UTF-8?

UTF-8 Character Encoding Tutorial


Quincy Larson
April 3, 2022

UTF-8 is a character encoding system. It lets you represent characters as


ASCII text, while still allowing for international characters, such as Chinese
characters.

As of the mid 2020s, UTF-8 is one of the most popular encoding systems.

To start using UTF-8, you will want to first familiarize yourself with the the
basic ASCII character set.

What is the ASCII Character Set?


ASCII uses 7-bit code points to represent 128 different characters. These
code points are divided into 95 printable characters, which include the 26
letters of the English alphabet (A to Z, both upper- and lower-case), the 10
digits (0 through 9), and a variety of punctuation and other symbols.

There are also 33 non-printable characters, which include control


characters like carriage return and line feed, as well as various other
characters that are used for things like formatting text.

UTF-8 VS ASCII – What's the Difference?


UTF-8 extends the ASCII character set to use 8-bit code points, which
allows for up to 256 different characters.

This means that UTF-8 can represent all of the printable ASCII characters,
as well as the non-printable characters.

UTF-8 also includes a variety of additional international characters, such as


Chinese characters and Arabic characters.

How to Use UTF-8 in Your Webpages – HTML UTF-8


Example
And now the easy part. You don't actually need to know how it works
(though I'll tell you in a moment.) You can configure UTF-8 Character
Encoding in your HTML code with a single line of HTML located in the
<head> section of your code:

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
</head>
</html>

With that out of the way, let me explain how UTF-8 works, and why it's such
a brilliant encoding scheme.

How UTF-8 Encoding Works, and How Much Storage


Each Character Uses
When representing characters in UTF-8, each code point is represented by
a sequence of one or more bytes. The number of bytes used depends on the
code point being represented by the character. Here's a breakdown of the
usage range:

code points in the ASCII range (0-127) are represented by a single byte

code points in the range (128-2047) are represented by two bytes

code points in the range (2048-65535) are represented by three bytes

and code points in the range (65536-1114111) are represented by four


bytes. (This may seem like a lot of possible characters, but keep in mind
that in Chinese alone, there are 100,000s of characters.)

The first byte of a UTF-8 sequence is called the "leader byte". The leader
byte provides information about how many bytes are in the sequence, and
what the code point value of the character is.

The leader byte for a single-byte sequence is always in the range (0-127).
The leader byte for a two-byte sequence is in the range (194-223). The
leader byte for a three-byte sequence is in the range (224-239). And the
leader byte for a four-byte sequence is in the range (240-247).

The remaining bytes in the sequence are called "trailing bytes." The trailing
bytes for a two-byte sequence are in the range (128-191). The trailing bytes
for a three-byte sequence are in the range (128-191). And the trailing bytes
for a four-byte sequence are in the range (128-191).

You can calculate the code point value of a character by looking at the
leader byte and the trailing bytes. For a single-byte sequence, the code
point value is equal to the value of the leader byte.

For a two-byte sequence, the code point value is equal to ((leader byte -
194) * 64) + (trailing byte - 128).

For a three-byte sequence, the code point value is equal to ((leader byte -
224) * 4096) + ((trailing byte1 - 128) * 64) + (trailing byte2 - 128).

For a four-byte sequence, the code point value is equal to ((leader byte -
240) * 262144) + ((trailing byte1 - 128) * 4096) + ((trailing byte2 - 128) * 64)
+ (trailing byte3 - 128).

UTF-8 is a Sound Choice for Encoding


Again, UTF-8 is a super efficient encoding system. It can represent a wide
range of characters while still being compatible with ASCII. This makes it a
sound choice for use in internationalized software.

I hope you've found this helpful. If you want to learn more about
programming and technology, try freeCodeCamp's core coding curriculum.
It's free.
ADVERTISEMENT
freeCodeCamp.org © 2023

You might also like