Base 64 Report
Base 64 Report
INTRODUCTION
We are living in a world of electronics and computers and internet and emails are the
mostly used communication media now days. What is the purpose of communication without
security? Communication over the internet need security protocols and different encoding
schemes are worldwide for security.
In this seminar an encoding scheme called base 64 encoding and its many
implementations including PEM are presented.Base64 is a generic term for any number of
similar encoding schemes that encode binary data by treating it numerically and translating it
into a base 64 representation.Base64 encoding schemes are commonly used when there is a
need to encode binary data that needs be stored and transferred over media that are designed
to deal with textual data. That is, base64 is commonly used in transfer of emails .This is to
ensure that the data remains intact without modification during transport. Base64 is used
commonly in a number of applications including email via MIME, and storing complex data
in XML. Besides being the default Encoding standard being used for encoding files to be sent
as attachments by Multipurpose Internet Mail Extensions or MIME, it has also started being
used in a number of other places.
PEM,UTF 7,Open PGP ,MIME are other implementations of base 64which uses
different encryption schemes and base 64 .PEM is the first implementation which is used to
secure emails using base 64.
The advantage of base 64 is that it provides security and being an easy algorithm, it
can be easily implemented.
2. ENCRYPTION
Encryption refers to algorithmic schemes that encode plain text into non-readable
form or cipher text, providing privacy. The receiver of the encrypted text uses a "key" to
decrypt the message, returning it to its original plain text form. The key is the trigger
mechanism to the algorithm we interact with the Internet. A cipher (or cipher) is a pair of
algorithms that create the encryption and the reversing decryption. The detailed operation of a
cipher is controlled both by the algorithm and in each instance by a key. This is a secret
parameter (ideally known only to the communicants) for a specific message exchange
context. Keys are important, as ciphers without variable keys can be trivially broken with
only the knowledge of the cipher used and are therefore useless (or even counter-productive)
for most purposes. Historically, ciphers were often used directly for encryption or decryption
without additional procedures such as authentication or integrity checks.
The earliest forms of secret writing required little more than local pen and paper
analogs, as most people could not read. More literacy, or literate opponents, required actual
cryptography. The main classical cipher types are transposition ciphers, which rearrange the
order of letters in a message (e.g., 'hello world' becomes 'ehlol owrdl' in a trivially simple
rearrangement scheme), and substitution ciphers, which systematically replace letters or
groups of letters with other letters or groups of letters (e.g., 'fly at once' becomes 'gmz bu
podf' by replacing each letter with the one following it in the Latin alphabet). Simple versions
of either offered little confidentiality from enterprising opponents, and still do. An early
substitution cipher was the Caesar cipher, in which each letter in the plaintext was replaced
by a letter some fixed number of positions further down the alphabet. It was named after
Julius Caesar who is reported to have used it, with a shift of 3, to communicate with his
generals during his military campaigns, just like Excess-3 code in Boolean algebra. There is
record of several early Hebrew ciphers as well. The earliest known use of cryptography is
some carved cipher text on stone in Egypt (ca 1900 BC), but this may have been done for the
amusement of literate observers. The next oldest is bakery recipes from Mesopotamia.
Until the advent of the Internet, encryption was rarely used by the public, but was
largely a military tool. The development of digital computers and electronics after WWII
made possible much more complex ciphers. Furthermore, computers allowed for the
encryption of any kind of data representable in any binary format, unlike classical ciphers
which only encrypted written language texts; this was new and significant. Today, with
online marketing, banking, healthcare and other services, even the average householder is
aware of encryption. Now the process of hiding information is collectively denoted by the
term cryptography. The term is derived from the Greek language. ’krytos’ means secret and
‘graphos’ means writing.
Base64 is a generic term for any number of similar encoding schemes that encode
binary data by treating it numerically and translating it into a base 64 representation. The
Base64 term originates from a specific MIME content transfer encoding. Base64 encoding
schemes are commonly used when there is a need to encode binary data that needs be stored
and transferred over media that are designed to deal with textual data. This is to ensure that
the data remains intact without modification during transport. So Base 64 encoding method is
commonly used in email systems. The email systems that where developed back in the time
of Arpanet, where designed to support only Letters (A-Z, a-z), Numbers (0-9) and some
limited punctuation marks. So in order to transfer files which can contain more than
characters and digits (for e.g., a picture.jpg file), Base 64 Encoding is used.
Since its introduction, Base64 encoding has extremely quickly gained popularity.
Besides being the default Encoding standard being used for encoding files to be sent as
attachments by Multipurpose Internet Mail Extensions or MIME, it has also started being
used in a number of other places.Base64 is used commonly in a number of applications
including email via MIME, and storing complex data in XML, used in web servers for
implementing HTTP based basic authentication etc.
4. HISTORY AND IMPLEMENTATIONS OF BASE64
4.1 PEM (PRIVACY ENHANCED MAIL)
Privacy Enhanced Mail (PEM), is an early IETF proposal for securing email using
public key cryptography. Although PEM became an IETF proposed standard it was never
widely deployed or used.
The first known standardized use of the encoding now called MIME Base64 was in
the Privacy-enhanced Electronic Mail (PEM) protocol, proposed by RFC 989 in 1987. PEM
defines a "printable encoding" scheme that uses Base64 encoding to transform an arbitrary
sequence of octets to a format that can be expressed in short lines of 6-bit characters, as
required by transfer protocols such as SMTP.
The current version of PEM (specified in RFC 1421) uses a 64-character alphabet
consisting of upper- and lower-case Roman alphabet characters (A–Z, a–z), the numerals (0–
9), and the "+" and "/" symbols. The "=" symbol is also used as a special suffix code. The
original specification, RFC 989, additionally used the "*" symbol to delimit encoded but
unencrypted data within the output stream.
To convert data to PEM printable encoding, the first byte is placed in the most
significant eight bits of a 24-bit buffer, the next in the middle eight, and the third in the least
significant eight bits. If there are fewer than three bytes left to encode (or in total), the
remaining buffer bits will be zero. The buffer is then used, six bits at a time, most significant
first, as indices into the string:
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"
, and the indicated character is output.
The process is repeated on the remaining data until fewer than four octets remain. If
three octets remain, they are processed normally. If fewer than three octets (24 bits) are
remaining to encode, the input data is right-padded with zero bits to form an integral multiple
of six bits.
After encoding the non-padded data, if two octets of the 24-bit buffer are padded-
zeros, two "=" characters are appended to the output; if one octet of the 24-bit buffer is filled
with padded-zeros, one "=" character is appended. This signals the decoder that the zero bits
added due to padding should be excluded from the reconstructed data. This also guarantees
that the encoded output length is a multiple of 4 bytes.
PEM requires that all encoded lines consist of exactly 64 printable characters, with the
exception of the last line, which may contain fewer printable characters. Lines are delimited
by white space characters according to local (platform-specific) conventions.
4.2 MIME
Multipurpose Internet Mail Extensions (MIME) is an Internet standard that extends
the format of e-mail to support:
1. Text in character sets other than ASCII
2. Non-text attachments
3. Message bodies with multiple parts
4. Header information in non-ASCII character sets
MIME's use, however, has grown beyond describing the content of e-mail to describing
content type in general, including for the web (see Internet media type).Virtually all human-
written Internet e-mail and a fairly large proportion of automated e-mail is transmitted via
SMTP in MIME format. Internet e-mail is so closely associated with the SMTP and MIME
standards that it is sometimes called SMTP/MIME e-mail.
The content types defined by MIME standards are also of importance outside of e-
mail, such as in communication protocols like HTTP for the World Wide Web. HTTP
requires that data be transmitted in the context of e-mail-like messages, although the data
most often is not actually e-mail.
MIME (Multipurpose Internet Mail Extensions) specification, lists base64 as one of
two binary-to-text encoding schemes (the other being quoted-printable).MIME's Base64
encoding is based on that of the RFC 1421 version of PEM: it uses the same 64-character
alphabet and encoding mechanism as PEM, and uses the "=" symbol for output padding in the
same way, as described at RFC 1521.
MIME does not specify a fixed length for Base64-encoded lines, but it does specify a
maximum line length of 76 characters. Additionally it specifies that any extra-alphabetic
characters must be ignored by a compliant decoder, although most implementations use a
CR/LF newline pair to delimit encoded lines. Thus, the actual length of MIME-compliant
Base64-encoded binary data is usually about 137% of the original data length, though for
very short messages the overhead can be a lot higher because of the overhead of the headers.
Very roughly, the final size of Base64-encoded binary data is equal to 1.37 times the original
data size + 814 bytes (for headers). In other words, you can approximate the size of the
decoded data with this formula: bytes = (string_length (encoded_string) - 814) / 1.37
4.3 UTF 7
UTF-7 (7-bit Unicode Transformation Format) is a variable-length character encoding
that was proposed for representing Unicode text using a stream of ASCII characters, for
example for use in Internet E-mail messages. UTF-7 was first proposed as an experimental
protocol in RFC 1642, A Mail-Safe Transformation Format of Unicode.
Provided certain rules are followed during encoding, UTF-7 can be sent in e-mail
without using an underlying MIME transfer encoding, but still must be explicitly identified as
the text character set. In addition, if used within e-mail headers such as "Subject:", UTF-7
must be contained in MIME encoded words identifying the character set. Since encoded
words force use of either quoted-printable or base64, UTF-7 was designed to avoid using the
= sign as an escape character to avoid double escaping when it is combined with quoted-
printable (or its variant, the RFC 2047
Some characters can be represented directly as single ASCII bytes. The first group is
known as "direct characters" and contains all 62 alphanumeric characters and 9 symbols: '
( ) , - . / : ?. The direct characters are considered very safe to include literally. The other main
group, known as "optional direct characters", contains all other printable characters in the
range U+0020–U+007E except ~ \ + and space. Using the optional direct characters reduces
size and enhances human readability but also increases the chance of breakage by things like
badly designed mail gateways and may require extra escaping when used in encoded words
for header fields. Space, tab, carriage return and line feed may also be represented directly as
single ASCII bytes. However, if the encoded text is to be used in e-mail, care is needed to
ensure that these characters are used in ways that do not require further content transfer
encoding to be suitable for e-mail. The plus sign (+) may be encoded as +-.
For encoding In UTF 7,first an encoder must decide which characters to represent
directly in ASCII form, which +es have to be escaped as +-, and which to place in blocks of
Unicode characters. A simple encoder may encode all characters it considers safe for direct
encoding directly. However the cost of coming out of a Unicode block to represent a single
character and then going directly back in is 3 to 3⅔ bytes, this is more than the 2⅔ bytes
needed to represent such a character as a part of a Unicode sequence. Each Unicode sequence
must be encoded using the following procedure, then surrounded by the appropriate
delimiters.
widely chosen quality cryptographic system), many wanted to write their own software that
would interoperate with PGP. Zimmermann became convinced that an open standard for PGP
encryption was critical for them and for the cryptographic community as a whole. In July
1997, PGP Inc. proposed to the IETF that there be a standard called OpenPGP. They gave the
IETF permission to use the name OpenPGP to describe this new standard as well as any
program that supported the standard. The IETF accepted the proposal and started the
OpenPGP Working Group. The Free Software Foundation has developed its own OpenPGP-
compliant program called GNU Privacy Guard (abbreviated GnuPG or GPG). GnuPG is
freely available together with all source code under the GNU General Public License (GPL)
and is maintained separately from several Graphical User Interfaces (GUIs) that interact with
the GnuPG library for encryption, decryption and signing functions. While originally used
primarily for encrypting the contents of e-mail messages and attachments from a desktop
client, PGP products have been diversified since 2002 into a set of encryption applications
which can be managed by an optional central policy server. PGP encryption applications
include e-mail and attachments, digital signatures, laptop full disk encryption, file and folder
security, protection for IM sessions, batch file transfer encryption, and protection for files and
folders stored on network servers. There is also a Wordpress plugin available, called wp-
enigform-authentication, that takes advantage of the session management features of Open
PGP.
5. BASE 64
Base64 is a different way of interpreting bits of data in order to transmit that data over
a text-only medium, such as the body of an e-mail. In the standard 8-bit ASCII character set,
there are 256 characters that are used to format text . However, only a fraction of these
characters are actually printable and readable when you are looking at them onscreen, or
sending them in an e- mail. We need a way to convert unreadable characters into readable
characters, do something with them (i.e. send them in an e- mail), and convert them back to
their original format.
We now know that using 16 different characters to represent each half byte is a viable
option, but not our ideal option because it is only half as space efficient as a byte. So how
else can we dice bytes up to get our goal: readable characters for any value of 0 to 255?
Instead of looking at one byte at a time, and trying to chop that byte up, take several bytes
and see what we can do with them.
Table 5.1
As you can easily see, using three bytes, we have a total of 24 bits. How else can we
chop 24 bits up? If instead of 3 bytes of 8 bits each we use 4 "clumps" of 6 bytes each, what
are we left with? Now we have 26 which equals 64. So now instead of needing 3 instances of
a character that can represent any of 256 different combinations, we now need just 4
instances of a character that can represent any of 64 different combinations. The same bits as
in the above table fit into the table below.
Table 5.2
surmise, base64 is still less space efficient than using a full byte, but instead of hex's double
space usage, base64 uses only one and a third as much space. In other words for every 3
bytes, you must have 4 base64 characters. All of the characters listed above are easily
readable. Other variations, usually derived from Base64, share this property but differ in the
symbols chosen for the last two values
Table 5.3
6. ENCODING INTO BASE 64
6.1 BASE 64 ENCODING ALGORITHM
6.2 EXAMPLE
Let's start with something simple, a text-to-base64 conversion. We will convert the
string "Hello World!" to a base64 representation. We will start by getting the ASCII and
binary byte values for each letter.
Table 6.1
For base64, we will be using three bytes at a time. Each ASCII character is one byte, so we
will be working with "Hel", "lo[space]", "Wor", and "ld!" separately. Let's start with the first
three characters:
010010 = 18
000110 = 6
010101 = 21
101100 = 44
5. Use each of the four decimals to look up the base64 character code.
18 = 'S'
6 = 'G'
21 = 'V'
44 = 's'
6. You now have your first three ASCII characters ("Hel") encoded as base64
("SGVs").
Follow these steps for the next 9 ASCII characters and you get the following results:
"Hel" = SGVs
"lo[space]" = bG8g
"Wor" = V29y
"ld!" = bGQh
Table 6.2
Follow the same exact steps for the first three characters as above. Your first three ASCII
characters "Hel" are the same base64 as before "SGVs". For the remaining 2 characters,
follow these steps:
2. Starting from the left, separate the bytes into 6 bit chunks as best as
possible.
As you can see, we still need two more bits for the last group, plus a whole other six
bits for the full four base64 characters. What we need is something looking like 011011
000110 1111xx xxxxxx. We can convert 011011 and 000110 to decimal just fine.
011011 = 27
000110 = 6
1111xx = what?
xxxxxx = what?
To resolve this problem, we fill the last two bits of 1111xx with 0's, so 111100 = 60.
We now have:
011011 = 27
000110 = 6
111100 = 60
xxxxxx = what?
Our base64 characters so far are "bG8". Since we are missing one single complete base64
character, we add one of our special "=" characters to the back to signify that we are missing
one byte. Our complete converted base64 string is now "bg8=". So the word "Hello"
translates to "SGVSbg8=". We do the same thing for the word "blue", which is missing 2
bytes.
Table 6.3
The first three characters should be easy by now to convert. "blu" is 01100010
01101100 01110101. Translate that to 6 bit groups and you get 011000 100110 110001
110101. These convert to "Ymx1" in base64. Now you have one remaining character, "e".
We do the exact same thing as last time. "e" in binary is 01100101. When you split that into
four 6 bit groups, you get the following:
011001 = 25
01xxxx = what?
xxxxxx = what?
xxxxxx = what?
Fill the second group with 0's to be able to look it up. 011001 010000 xxxxxx xxxxxx
becomes "ZQ". Because you were missing two complete bytes, add two of our special
character on the end. So the letter "e" in ASCII becomes "ZQ==". The word "blue" becomes
"Ymx1ZQ==". Note: I said before that base64 encoding is one and one third larger than the
byte representation. In the cases were you are missing a byte, it is actually slightly more than
this. The actual range is from exactly one and one third to one and one third plus two
characters.
7. BASE64 DECODING
We will now tackle translating from base64 characters back into normal bytes. We
will use the same mapping of values (0 through 63) to base64 characters (A-Z, a-z, 0-9, '+',
and '/'). The reverse process is relatively simple now that we know how to perform the
forward operation. Let’s start with the base64 string "YmFzZTY0IGlzIGZ1biEh". Right now,
that makes no sense. We begin the same way, by looking up the value for each base64
character.
T
able 7.1
It is very important to remember that when you are encoding, you use 8 bits for each
character, and when you are decoding you use 6 bits for each character! Once again, we start
by chopping it into smaller pieces and work on each piece. When we are decoding a base64
string into normal bytes, we use 4 characters at a time instead of the 3 we used when
encoding. So our base64 string is broken up from "YmFzZTY0IGlzIGZ1biEh" into "YmFz",
"ZTY0", "IGlz", "IGZ1", and "biEh". Instead of using a number to look up a base64
character, we are now using a base64 character to look up a number. Lets start with our first
group, "YmFz".
01100001 = 97
01110011 = 115
4. Use each of the three decimals to look up the ASCII character for that value.
98 = 'b'
97 = 'a'
115 = 's'
You now have your first four base64 characters ("YmFz") decoded as ASCII
("bas").
Follow these steps for the next 16 base64 characters and you get the following results:
"ZTY0" = "e64"
"biEh" = "n!!"
We know how to encode bytes when we don't have exact groups of three to work
with. But how do you decode base64 that has our special symbol, "="? It is very similar, you
just have to remember the rules that caused us to use the "=". One thing before we get started:
base64 encoded text will always be in groups of 4 base64 characters; if the number of base64
characters is not divisible by 4 with no remainder, then you have corrupted data. Let's try
decoding a base64 string that contains the "=" symbol. Our string this time will be
"Li4ub3IgbWF5YmUgbm90Lg==". The first thing we do is divide this up into groups of four
characters. "Li4ub3IgbWF5YmUgbm90Lg==" becomes "Li4u", "b3Ig", "bWF5", "YmUg",
bm90", and "Lg==". The first five quartets are decoded in the exact same manner. We just
need to learn what to do for the last quartet, "Lg==". Remember what the "="s mean: one "="
means that we were missing one whole byte when we encoded the data, two "="s means that
we were missing two whole bytes when we encoded the data. We begin in the same way as
before.
'L' = 11
'g' = 32
'=' = nothing
'=' = nothing
11 = 001011
32 = 100000
We know that because we had two "="s at the end, that we were missing two
complete bytes in the original data. Remember where we had to add zeros when we encoded
into base64? Those are the zeros you see in the second 8 bit group ("0000xxxx"). Because
each of these 8 bit groups represents one byte from the original data, and we know that we
are missing two whole bytes, we discard the last two 8 bit groups, "0000xxxx" and
"xxxxxxx". So the only data we now need to worry about is the first byte, 00101110. We
convert this value to decimal.
00101110 = 46
We convert the 46 to ASCII and we get the character '.' and add this to the other data that we
have decoded.
"Li4u" = "..."
"bWF5" = "may"
"bm90" = "not"
"Lg==" = "."
8. APPLICATIONS
8.1 URL APPLICATIONS
Base64 encoding can be helpful when fairly lengthy identifying information is used in
an HTTP environment. For example, a database persistence framework for Java objects might
use Base64 encoding to encode a relatively large unique id (generally 128-bit ) into a string
for use as an HTTP parameter in HTTP forms or HTTP GET URLs. Also, many applications
need to encode binary data in a way that is convenient for inclusion in URLs, including in
hidden web form fields, and Base64 is a convenient encoding to render them in not only a
compact way, but in a relatively unreadable one when trying to obscure the nature of data
from a casual human observer.
Using standard Base64 in URL requires encoding of '+' and '/' characters into special
percent-encoded hexadecimal sequences ('+' = '%2B' and '/' = '%2F'), which makes the string
unnecessarily longer.
For this reason, a modified Base64 for URL variant exists, where no padding '=' will
be used, and the '+' and '/' characters of standard Base64 are respectively replaced by '-' and
'_', so that using URL encoders/decoders are no longer necessary and have no impact on the
length of the encoded value, leaving the same encoded form intact for use in relational
databases, web forms, and object identifiers in general.
These systems use base 64 encoding to encrypt the page url there by providing
security to the users. Figure represents such a site which uses base 64 encoding on the
address.
Figure 8.1
Figure 8.2
Figure 8.1 shows a privacy protection website using base 64 encoding. Figure 8.2 shows the
base 64 encoded url of www.google.com by using base 64 encoding.
There are other variants that use '_-' or '._' when the Base64 variant string must be
used within valid identifiers for programs.
8.4 XML
XML identifiers and name tokens are encoded using two variants:
Another variant called modified Base64 for regexps uses '!-' instead of '*-' to replace
the standard Base64 '+/', because both '+' and '*' may be reserved for regular expressions
(note that '[]' used in the IRCu variant above would not work in that context).
Base64 is a different way of interpreting bits of data in order to transmit that data over
A text-only medium, such as the body of an e-mail. As from this seminar we can
understand that base 64 is the most economical and easy and secure method of creating an
en encoded data that can be easily transmitted over the internet. Base 64 will encode the
data into a different format that can’t be understood by a normal user. Base 64 provides
different advantages and applications. It provides security to emails, attachments
etc,Base64 encoding can be easily implemented using java programs,Base64 provides
privacy while browsing the internet by using URL encoding etc are some important
applications. There are different implementations for base 64 encoding. These include
UTF 7, Privacy Enhanced Mail, Open PGP and MIME.
11. REFERENCES
2. Hack Proofing Your Network: Second Edition , David R. Mirza Ahmad &
3. " MIME: A Portable and Robust Multimedia Format for Internet Mail" by Nathaniel
S. Borenstein <[email protected]> for Multimedia Systems journal
4. http://www.openpgp.org/
5. http://tools.ietf.org/html/rfc1421.html
6. http://tools.ietf.org/html/rfc1422.html
7. http://tools.ietf.org/html/rfc3748.html
8. http://tools.ietf.org/html/rfc3548.html
9. http://tools.ietf.org/html/rfc4648.html
10. http://www.scribd.com/doc/6945825/Base64-Encoding-Torn-Apart.html
11. http://www.herongyang.com/encoding/Base64-Encoding.html
12. http://en.wikipedia.org/wiki/Privacy-enhanced_Electronic_Mail
13. http://en.wikipedia.org/wiki/ASCII
14. http://en.wikipedia.org/wiki/MIME
15. http://www.wisegeek.com/what-is-encryption.html