0% found this document useful (0 votes)
0 views

Basic concepts of Message Digest and Hash Function draft

The document explains the concepts of hash functions and message digests, highlighting their importance in information security. It details the properties, design, and popular algorithms such as MD5 and SHA, along with their applications in password storage and data integrity checks. Additionally, it discusses the limitations of Message Authentication Codes (MAC) and the MD5 algorithm's structure and processing steps.

Uploaded by

nhirak061
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Basic concepts of Message Digest and Hash Function draft

The document explains the concepts of hash functions and message digests, highlighting their importance in information security. It details the properties, design, and popular algorithms such as MD5 and SHA, along with their applications in password storage and data integrity checks. Additionally, it discusses the limitations of Message Authentication Codes (MAC) and the MD5 algorithm's structure and processing steps.

Uploaded by

nhirak061
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Basic concepts of Message Digest and Hash Function

Table of Contents
What is Hashing? Cryptography Hash functions

Algorithm? Steps Advantages of MD5

Cryptography Hash functions


Hash functions are extremely useful and appear in almost all information security applications.
A hash function is a mathematical function that converts a numerical input value into another
compressed numerical value. The input to the hash function is of arbitrary length but output is
always of fixed length.
Values returned by a hash function are called message digest or simply hash values. The
following picture illustrated hash function −

Features of Hash Functions


The typical features of hash functions are −
 Fixed Length Output (Hash Value)
o Hash function coverts data of arbitrary length to a fixed length. This process is often
referred to as hashing the data.
o In general, the hash is much smaller than the input data, hence hash functions are
sometimes called compression functions.
o Since a hash is a smaller representation of a larger data, it is also referred to as
a digest.
o Hash function with n bit output is referred to as an n-bit hash function. Popular hash
functions generate values between 160 and 512 bits.
 Efficiency of Operation
o Generally for any hash function h with input x, computation of h(x) is a fast operation.
o Computationally hash functions are much faster than a symmetric encryption.

Properties of Hash Functions


In order to be an effective cryptographic tool, the hash function is desired to possess following
properties −
 Pre-Image Resistance
o This property means that it should be computationally hard to reverse a hash
function.
o In other words, if a hash function h produced a hash value z, then it should be a
difficult process to find any input value x that hashes to z.
o This property protects against an attacker who only has a hash value and is trying to
find the input.
 Second Pre-Image Resistance
o This property means given an input and its hash, it should be hard to find a different
input with the same hash.
o In other words, if a hash function h for an input x produces hash value h(x), then it
should be difficult to find any other input value y such that h(y) = h(x).
o This property of hash function protects against an attacker who has an input value
and its hash, and wants to substitute different value as legitimate value in place of
original input value.
 Collision Resistance
o This property means it should be hard to find two different inputs of any length that
result in the same hash. This property is also referred to as collision free hash
function.
o In other words, for a hash function h, it is hard to find any two different inputs x and y
such that h(x) = h(y).
o Since, hash function is compressing function with fixed hash length, it is impossible
for a hash function not to have collisions. This property of collision free only confirms
that these collisions should be hard to find.
o This property makes it very difficult for an attacker to find two input values with the
same hash.
o Also, if a hash function is collision-resistant then it is second pre-image resistant.

Design of Hashing Algorithms


At the heart of a hashing is a mathematical function that operates on two fixed-size blocks of data
to create a hash code. This hash function forms the part of the hashing algorithm.
The size of each data block varies depending on the algorithm. Typically the block sizes are from
128 bits to 512 bits. The following illustration demonstrates hash function −
Hashing algorithm involves rounds of above hash function like a block cipher. Each round takes an
input of a fixed size, typically a combination of the most recent message block and the output of
the last round.
This process is repeated for as many rounds as are required to hash the entire message.
Schematic of hashing algorithm is depicted in the following illustration −

Since, the hash value of first message block becomes an input to the second hash operation,
output of which alters the result of the third operation, and so on. This effect, known as
an avalanche effect of hashing.
Avalanche effect results in substantially different hash values for two messages that differ by even
a single bit of data.
Understand the difference between hash function and algorithm correctly. The hash function
generates a hash code by operating on two blocks of fixed-length binary data.
Hashing algorithm is a process for using the hash function, specifying how the message will be
broken up and how the results from previous message blocks are chained together.

Popular Hash Functions


Let us briefly see some popular hash functions −
Message Digest (MD)
MD5 was most popular and widely used hash function for quite some years.
 The MD family comprises of hash functions MD2, MD4, MD5 and MD6. It was adopted as
Internet Standard RFC 1321. It is a 128-bit hash function.
 MD5 digests have been widely used in the software world to provide assurance about
integrity of transferred file. For example, file servers often provide a pre-computed MD5
checksum for the files, so that a user can compare the checksum of the downloaded file to
it.
 In 2004, collisions were found in MD5. An analytical attack was reported to be successful
only in an hour by using computer cluster. This collision attack resulted in compromised
MD5 and hence it is no longer recommended for use.
Secure Hash Function (SHA)
Family of SHA comprise of four SHA algorithms; SHA-0, SHA-1, SHA-2, and SHA-3. Though from
same family, there are structurally different.
 The original version is SHA-0, a 160-bit hash function, was published by the National
Institute of Standards and Technology (NIST) in 1993. It had few weaknesses and did not
become very popular. Later in 1995, SHA-1 was designed to correct alleged weaknesses of
SHA-0.
 SHA-1 is the most widely used of the existing SHA hash functions. It is employed in several
widely used applications and protocols including Secure Socket Layer (SSL) security.
 In 2005, a method was found for uncovering collisions for SHA-1 within practical time frame
making long-term employability of SHA-1 doubtful.
 SHA-2 family has four further SHA variants, SHA-224, SHA-256, SHA-384, and SHA-512
depending up on number of bits in their hash value. No successful attacks have yet been
reported on SHA-2 hash function.
 Though SHA-2 is a strong hash function. Though significantly different, its basic design is
still follows design of SHA-1. Hence, NIST called for new competitive hash function designs.
 In October 2012, the NIST chose the Keccak algorithm as the new SHA-3 standard. Keccak
offers many benefits, such as efficient performance and good resistance for attacks.
RIPEMD
The RIPEMD is an acronym for RACE Integrity Primitives Evaluation Message Digest. This set of
hash functions was designed by open research community and generally known as a family of
European hash functions.
 The set includes RIPEMD, RIPEMD-128, and RIPEMD-160. There also exist 256, and 320-
bit versions of this algorithm.
 Original RIPEMD (128 bit) is based upon the design principles used in MD4 and found to
provide questionable security. RIPEMD 128-bit version came as a quick fix replacement to
overcome vulnerabilities on the original RIPEMD.
 RIPEMD-160 is an improved version and the most widely used version in the family. The
256 and 320-bit versions reduce the chance of accidental collision, but do not have higher
levels of security as compared to RIPEMD-128 and RIPEMD-160 respectively.
Whirlpool
This is a 512-bit hash function.
 It is derived from the modified version of Advanced Encryption Standard (AES). One of the
designer was Vincent Rijmen, a co-creator of the AES.
 Three versions of Whirlpool have been released; namely WHIRLPOOL-0, WHIRLPOOL-T,
and WHIRLPOOL.

Applications of Hash Functions


There are two direct applications of hash function based on its cryptographic properties.
Password Storage
Hash functions provide protection to password storage.
 Instead of storing password in clear, mostly all logon processes store the hash values of
passwords in the file.
 The Password file consists of a table of pairs which are in the form (user id, h(P)).
 The process of logon is depicted in the following illustration –
 An intruder can only see the hashes of passwords, even if he accessed the password. He
can neither logon using hash nor can he derive the password from hash value since hash
function possesses the property of pre-image resistance.
Data Integrity Check
Data integrity check is a most common application of the hash functions. It is used to generate the
checksums on data files. This application provides assurance to the user about correctness of the
data.
The process is depicted in the following illustration −

The integrity check helps the user to detect any changes made to original file. It however, does not
provide any assurance about originality. The attacker, instead of modifying file data, can change
the entire file and compute all together new hash and send to the receiver. This integrity check
application is useful only if the user is sure about the originality of file.
the data integrity threats and the use of hashing technique to detect if any modification attacks
have taken place on the data.
Another type of threat that exist for data is the lack of message authentication. In this threat, the
user is not sure about the originator of the message. Message authentication can be provided
using the cryptographic techniques that use secret keys as done in case of encryption.

Message Authentication Code (MAC)


MAC algorithm is a symmetric key cryptographic technique to provide message authentication. For
establishing MAC process, the sender and receiver share a symmetric key K.
Essentially, a MAC is an encrypted checksum generated on the underlying message that is sent
along with a message to ensure message authentication.
The process of using MAC for authentication is depicted in the following illustration −

Let us now try to understand the entire process in detail −


 The sender uses some publicly known MAC algorithm, inputs the message and the secret
key K and produces a MAC value.
 Similar to hash, MAC function also compresses an arbitrary long input into a fixed length
output. The major difference between hash and MAC is that MAC uses secret key during
the compression.
 The sender forwards the message along with the MAC. Here, we assume that the message
is sent in the clear, as we are concerned of providing message origin authentication, not
confidentiality. If confidentiality is required then the message needs encryption.
 On receipt of the message and the MAC, the receiver feeds the received message and the
shared secret key K into the MAC algorithm and re-computes the MAC value.
 The receiver now checks equality of freshly computed MAC with the MAC received from the
sender. If they match, then the receiver accepts the message and assures himself that the
message has been sent by the intended sender.
 If the computed MAC does not match the MAC sent by the sender, the receiver cannot
determine whether it is the message that has been altered or it is the origin that has been
falsified. As a bottom-line, a receiver safely assumes that the message is not the genuine.
Limitations of MAC
There are two major limitations of MAC, both due to its symmetric nature of operation −
 Establishment of Shared Secret.
o It can provide message authentication among pre-decided legitimate users who have
shared key.
o This requires establishment of shared secret prior to use of MAC.
 Inability to Provide Non-Repudiation

o Non-repudiation is the assurance that a message originator cannot deny any


previously sent messages and commitments or actions.
o MAC technique does not provide a non-repudiation service. If the sender and receiver
get involved in a dispute over message origination, MACs cannot provide a proof that
a message was indeed sent by the sender.
o Though no third party can compute the MAC, still sender could deny having sent the
message and claim that the receiver forged it, as it is impossible to determine which
of the two parties computed the MAC.
Both these limitations can be overcome by using the public key based digital signatures discussed
in following section.
With the consensus aiming towards an educated public on digital privacy, it’s no surprise to see an
increasing interest in encryption algorithms and cybersecurity. MD5 algorithm was one of the first
hashing algorithms to take the global stage as a successor to the MD4 algorithm. Despite the
security vulnerabilities encountered in the future, MD5 remains a crucial part of data infrastructure
in a multitude of environments.

Before diving headfirst into the main topic, it is best to go through the basic concept of hashing
first.

What is Hashing?
Hashing consists of converting a general string of information into an intricate piece of data. This is
done to scramble the data so that it completely transforms the original value, making the hashed
value utterly different from the original.

Hashing uses a hash function to convert standard data into an unrecognizable format. These hash
functions are a set of mathematical calculations that transform the original information into their
hashed values, known as the hash digest or digest in general. The digest size is always the same
for a particular hash function like MD5 or SHA1, irrespective of input size.

Also Read: Top Data Structures and Algorithms Every Data Science Professional Should Know

Hashing has two primary use cases:

 Password Verification:
It is common to store user credentials of websites in a hashed format to prevent third parties from
reading the passwords. Since hash functions always provide the same output for the same input,
comparing password hashes is much more private.

The entire process is as follows:

1. User signs up to the website with a new password

2. It passes the password through a hash function and stores the digest on the server

3. When a user tries to log in, they enter the password again

4. It passes the entered password through the hash function again to generate a digest

5. If the newly developed digest matches the one on the server, the login is verified

Integrity Verification:

Some files can be checked for data corruption using hash functions. Like the above scenario, hash
functions will always give the same output for similar input, irrespective of iteration parameters.

The entire process follows this order:

1. A user uploads a file on the internet


2. It also uploads the hash digest along with the file

3. When a user downloads the file, they recalculate the hash digest

4. If the digest matches the original hash value, file integrity is maintained

Now that you have a base foundation set in hashing, you can look at the focus for this tutorial, the
MD5 algorithm.

What is the MD5 Algorithm?


MD5 (Message Digest Method 5) is a cryptographic hash algorithm used to generate a 128-bit
digest from a string of any length. It represents the digests as 32 digit hexadecimal numbers.

Ronald Rivest designed this algorithm in 1991 to provide the means for digital signature
verification. Eventually, it was integrated into multiple other frameworks to bolster security indexes.

The digest size is always 128 bits, and thanks to hashing function guidelines, a minor change in
the input string generate a drastically different digest. This is essential to prevent similar hash
generation as much as possible, also known as a hash collision.

You will now learn the steps that constitute the working of the MD5 algorithm.
Steps in MD5 Algorithm

There are four major sections of the algorithm:

Padding Bits

When you receive the input string, you have to make sure the size is 64 bits short of a multiple of
512. When it comes to padding the bits, you must add one(1) first, followed by zeroes to round out
the extra characters.

Padding Length

You need to add a few more characters to make your final string a multiple of 512. To do so, take
the length of the initial input and express it in the form of 64 bits. On combining the two, the final
string is ready to be hashed.

Initialize MD Buffer

The entire string is converted into multiple blocks of 512 bits each. You also need to initialize four
different buffers, namely A, B, C, and D. These buffers are 32 bits each and are initialized as
follows:
A = 01 23 45 67

B = 89 ab cd ef

C = fe dc ba 98

D = 76 54 32 10

Process Each Block

Each 512-bit block gets broken down further into 16 sub-blocks of 32 bits each. There are four
rounds of operations, with each round utilizing all the sub-blocks, the buffers, and a constant array
value.

This constant array can be denoted as T[1] -> T[64].

Each of the sub-blocks are denoted as M[0] -> M[15].

According to the image above, you see the values being run for a single buffer A. The correct
order is as follows:

 It passes B, C, and D onto a non-linear process.

 The result is added with the value present at A.

 It adds the sub-block value to the result above.

 Then, it adds the constant value for that particular iteration.

 There is a circular shift applied to the string.

 As a final step, it adds the value of B to the string and is stored in buffer A.

The steps mentioned above are run for every buffer and every sub-block. When the last block’s
final buffer is complete, you will receive the MD5 digest.

The non-linear process above is different for each round of the sub-block.
Round 1: (b AND c) OR ((NOT b) AND (d))

Round 2: (b AND d) OR (c AND (NOT d))

Round 3: b XOR c XOR d

Round 4: c XOR (b OR (NOT d))

With this, you conclude the working of the MD5 algorithm. You will now see the advantages
procured when using this particular hash algorithm.

 Easy to Compare: Unlike the latest hash algorithm families, a 32 digit digest is relatively easier to
compare when verifying the digests.

 Storing Passwords: Passwords need not be stored in plaintext format, making them accessible for
hackers and malicious actors. When using digests, the database also gets a boost since the size of all
hash values will be the same.

 Low Resource: A relatively low memory footprint is necessary to integrate multiple services into the
same framework without a CPU overhead.

 Integrity Check: You can monitor file corruption by comparing hash values before and after transit. Once
the hashes match, file integrity checks are valid, and it avoids data corruption.

Message-digest algorithm characteristics


Message digests, also known as hash functions, are one-way functions; they accept a message of
any size as input and produce as output a fixed-length message digest.

MD5 is the third message-digest algorithm Rivest created. MD2, MD4 and MD5 have similar
structures, but MD2 was optimized for 8-bit machines, in comparison with the two later
algorithms, which are designed for 32-bit machines. The MD5 algorithm is an extension of
MD4, which the critical review found to be fast but potentially insecure. In comparison, MD5 is
not quite as fast as the MD4 algorithm, but offered much more assurance of data security.

How does MD5 work?


The MD5 message-digest hashing algorithm processes data in 512-bit strings, broken down into
16 words composed of 32 bits each. The output from MD5 is a 128-bit message-digest value.

Computation of the MD5 digest value is performed in separate stages that process each 512-bit
block of data along with the value computed in the preceding stage. The first stage begins with
the message-digest values initialized using consecutive hexadecimal numerical values. Each
stage includes four message-digest passes, which manipulate values in the current data block
and values processed from the previous block. The final value computed from the last block
becomes the MD5 digest for that block.

Is MD5 secure?
The goal of any message-digest function is to produce digests that appear to be random. To be
considered cryptographically secure, the hash function should meet two requirements:

1. It is impossible for an attacker to generate a message matching a specific hash value.

2. It is impossible for an attacker to create two messages that produce the same hash value.

MD5 hashes are no longer considered cryptographically secure methods and should not be used
for cryptographic authentication, according to IETF.

In 2011, IETF published RFC 6151, "Updated Security Considerations for the MD5 Message-
Digest and the HMAC-MD5 Algorithms," which cited a number of recent attacks against MD5
hashes. It mentioned one that generated hash collisions in a minute or less on a standard
notebook and another that could generate a collision in as little as 10 seconds on a 2.6 gigahertz
Pentium 4 system. As a result, IETF suggested that new protocol designs should not use MD5 at
all and that the recent research attacks against the algorithm "have provided sufficient reason to
eliminate MD5 usage in applications where collision resistance is required such as digital
signatures."

Alternatives to MD5
A major concern with MD5 is the potential it has for message collisions when message hash
codes are inadvertently duplicated. MD5 hash code strings also are limited to 128 bits. This
makes them easier to breach than other hash code algorithms that followed.
Alternate hash codes to MD5 include the following.

Secure Hash Algorithm 1 (SHA-1). Developed by the U.S. government in the 1990s, SHA-1
used techniques like those of MD5 in the design of message-digest algorithms. But SHA-1
generated more secure 160-bit values when compared to MD5's 128-bit hash value lengths.
Despite this, SHA-1 had some weaknesses and did not prove to be the ultimate algorithmic
methodology for encryption, either. Security concerns began to emerge, prompting companies
like Microsoft to discontinue support for SHA-1 in its software.

The SHA-2 hash code family. The more secure successor to SHA-1 and one that is widely used
today is the SHA-2 family of hash codes. SHA-2 hash codes were created by the U.S. National
Security Agency in 2001. They represent a significant departure from SHA-1 in that the SHA-2
message-digest algorithms were longer and harder to break. The SHA-2 family of algorithms
delivers hash values that are 224, 256, 384 and 512 bits in length. They are known by the names
of their message-digest lengths -- for example, SHA-224 and SHA-256.

Cyclic redundancy check (CRC) codes. CRC codes are often suggested as possible
substitutions for MD5 because both MD5 and CRC perform hashing functions, and both deliver
checksums. But the similarity ends there. A 32-bit CRC code is used to detect errors during data
transmissions so corrupted or lost data can be identified. Meanwhile, MD5 is a secure hash
algorithm and a cryptographic hash function that can detect some data corruption but is
primarily intended for the secure encryption of data that is being transmitted and the verification
of digital certificates.

You might also like