HTTP
HTTP
Contents
l 1. Introduction
¡ 1.1 History
¡ 1.2 Caching and Proxies
¡ 1.3 Plug-ins and Helpers
¡ 1.4 Summary
l 2. Internet Protocol
¡ 2.1 Introduction
¡ 2.2 IP Addressing
¡ 2.3 Domain Name System
l 3. Transmission Control Protocol
¡ 3.1 Introduction
¡ 3.2 Socket Creation
¡ 3.3 Reliable Packet Transmission
l 4. HTTP/0.9 and HTTP/1.0
¡ 4.1 Introduction
¡ 4.2 GET
¡ 4.3 Headers
¡ 4.4 Responses
¡ 4.5 HEAD and POST Requests
l 5. HTTP/1.1
¡ 5.1 Introduction
¡ 5.2 Persistent Connection
¡ 5.3 Caching
¡ 5.4 Other Enhancements
Appendices
l A. References
l B. HyperText Transfer Protocol Design Issues
-- 1 --
© Oxford Brookes University 2002
1. Introduction
l 1.1 Basic Model
l 1.2 Caching and Proxies
l 1.3 Plug-ins and Helpers
2. Internet Protocol
l 2.1 Introduction
l 2.2 IP Addressing
l 2.3 Domain Name System
2.1 Introduction
As we have already seen, the Internet is a network of networks. The problem IP is addressing is how to
achieve end-to-end communication across a potentially heterogeneous collection of intermediate networks
along the way. The Internet Protocol (IP) is the network-level protocol that underlies the Internet. The main
aim was to keep the Internet itself simple. IP defines an architecture based on sending individual packets
from one host to another. Each host has a unique name, its IP address. A packet knows where it came from
and where it is going. The IP header contains this information and it is followed by the data in the packet.
The Internet consists of a set of routers that accept packets, look at their destination and decide where to
route it next. No state is retained and it is feasible for the packet to arrive back at the same router. Routers
make decisions for each packet based on its current understanding of the state of the Internet. Thus packets
travel independently on different paths and they are likely to arrive at the host in a different order from when
they left. Packets may not all follow the same route, and routes are not necessarily of the same length.
Packets may be lost on the way or be corrupted when they arrive. If the Internet is congested, routers may
need to store packages before forwarding them. If a route gets overloaded, it will throw packets away in order
to relieve congestion.
By being so simple, IP can exist in a network environment that is overloaded and having transient failures.
Clearly for many applications, this unreliable and unordered delivery is a problem, as it is for the Web. In
consequence, sitting on top of IP is the Transmission Control Protocol (TCP) that has the problem of
ensuring that packets are delivered reliably and in the right order. We will talk about TCP later.
2.2 IP Addressing
IP addresses are 32 bit numbers that can be considered as mailing addresses. Apart from some
considerations that we do not need to bother about, the IP address refers to a specific machine on a defined
site. In consequence, it was sensible to divide the address into two parts. The first part defines the place on
the network where the specific machine sits and the second part, the host part, defines the machine at that
location. In consequence, the routers on the Internet only need to know the network part of the address and
can rely on the network at the site to deliver the package to the correct machine.
An organisation like Brookes will be allocated a set of IP addresses for its machines and, in a simple
situation, a single network address. Unfortunately things are not that simple. How many bits should be
allocated to the network part of the address and how many to the host part. Well that depends on the size of
the site. As this varies, different Classes of sites were defined early on that correspond to large, medium and
small sites. Figure 2.1 shows the Classes available:
Class A
Really large sites of up to 16 million machines. 8 bits are used for the network address and 24 for the host.
Class B
Medium size sites with up to 65000 machines. 16 bits are used for the network address and 16 for the
host.
Class C
Small organisations with up to 256 hosts. 24 bits are used for the network address and 8 bits are used for
the host.
The addresses are written with full stops between each octet value written in decimal.
-- 7 --
© Oxford Brookes University 2002
3.1 Introduction
The position of the Transmission Control Protocol (TCP) in the nature of things is shown in Figure 3.1. HTTP
expects requests to arrive and responses also. TCP's role is to achieve that despite its underlying
infrastructure (IP) being unreliable and unordered.
IP can be thought of as a set of buckets transmitting information from one place to another. The aim is for
TCP to turn the individual buckets into a drainpipe [4].
TCP achieves this by establishing a unique path between the Client and Server (called a socket) and then
ensuring that all packets are delivered, forcing a retransmission when packets appear to be lost. TCP does
all the work in solving the problems of packet loss, corruption and reordering that the IP layer may have
introduced.
-- 11 --
© Oxford Brookes University 2002
4.1 Introduction
Given the underlying TCP/IP network, it can be assumed that requests made by HTTP will arrive at the
correct Server and that if the server locates the page required, it will be returned correctly. If the page is not
located, an error will be received. It is on this basis that we will now look at the HTTP protocol.
The time line for HTTP is as follows:
December 1990
HTTP defined to transmit first Web pages
January 1992: HTTP/0.9
Simple GET protocol for the Web, limits on data transfer
March 1993: HTTP/1.0 Draft
Several iterations
May 1996:HTTP/1.0 RFC 1945
Headers give information about the data transfered. Greater data transfer in both directions
January 1997: HTTP/1.1 Proposal
Supports hierarchical proxy servers, caching, and persistent connections
June 1999: HTTP/1.1 Draft Standard
Initially proposed in 1997, significant use by 1998
2001: HTTP/1.1 Standard
Date to be defined
IETF, the organisatiion that standardises HTTP, tends to issue a prelease version of its standards with a /0.9
label before the fulll standard is given the label /1.0. As can be seen progress has been relatively slow. It
took several years before HTTP/0.9 was finalised and another 4 years before HTTP/1.0 was finalised. The
latest version of HTTP is HTTP/1.1 and this is in wide use yet it is not formally a standard yet. This is
generally the case with IETF. Much user experience is built up before the standard is finalised. By 1998, the
world was nearly all using HTTP/1.0 with only 10% still using HTTP/0.9. Similarly, HTTP/1.1 is the main
version in use today even though the standard is not finalised yet.
HTTP is based on messages that pass from the Client to the Server called request messages and those
from the Server to the Client which are called response messages. All request messages start with a
request line and all response messages start with a response line. Following the initial line in both cases it
is possible to include zero or more header lines that give additional information.
HTTP/0.9 had a single GET request defined and no headers were defined. HTTP/1.0 added HEAD and
POST requests and introduced headers.
4.2 GET Request
HTTP/0.9 had a very simple GET request message:
The HTTP request message consists of an ASCII string consisting of the message name followed by a URL
and the Carriage Return and Line Feed characters. This can be simply demonstrated by opening a Telenet
session on Port 80 of a web site as follows:
-- 21 --
© Oxford Brookes University 2002
telnet www.w3.org 80
GET /www/overview.htm CRLF
This will cause the W3C site to send the document www.w3.org/www/overview.html to the command line
interface. Something like the following will be returned:
<HTML>
. . .
</HTML>
telnet www.w3.org 80
GET /www/overview.htm HTTP/0.9 CRLF
Early on it was not necessary to say which version of HTTP was being used. With several possibilities now
existing, it is best to indicate the type of HTTP request being made.
When the user clicks on a link in an HTML page, asks the browser to load a page, hit the back button or
choose a favourite site, in all these cases the result is a GET messaage requesting the required page being
sent by TCP to the Server having firsty established a connection. IN HTTP/0.9, the document returned was
limited to 1024 characters. In consequence, most HTTP requests consisted of a single message/packet in
each direction once the connection had been established. In HTTP/0.9 and HTTP/1.0, once the server has
sent the document, the connection is dropped. No state is retained in HTTP concerning the transaction. In
HTTP/0.9 there would have been 3 packets used to set up the connection, two packets used to send the
request message (plus the acknowledgment), two to send the response and 4 to drop the connection, a total
of 11 packets for the two information packets to be sent and received. This is of course assuming that no
packets were lost or retransmitted.
4.3 Headers
The general message request format was extended in HTTP/1.0 to include headers:
Request line
Headers (0 or more lines)
Blank line
Optional Message Body
The GET request line is followed by a header that modifies the request to say only retrieve the file if it has
been modified since September 1999.
The format of the reply will be:
The response has come via HTTP/1.0. The response code is 200 which says that the document was
successfully located and will be following. The headers give additional information about the Server and the
document.
Headers fall into four types:
This is not the complete list available with HTTP/1.0 but gives a flavour of those available. They all have the
general format of a name followed by colon and then the information. Request headers only appear in
request messages and response headers in replies. General and Entity can appear in both. Brief
descriptions are:
Header Type Meaning
Date General Current time/date
Pragma General Request to behave in a certain way (no-cache requests proxies not to
provided cached copies)
Authorization Request Send userid/password
From Request User sends email address as identification
If-Modified-Since Request Conditional GET. Ignore request if not modified since
User-Agent Request Web browser name and version number
Location Response Redirect the request to where it can be found
Server Response Type of server
WWW- Response Challenge client seeking access to resource that needs authentication
Authenticate
Allow Entity Defines methods allowed to access resource
Content-Length Entity Number of bytes of data in body
Content-Type Entity MIME-type of data
Content-Encoding Entity Decoding needed to generate Content-Type, usually used for compression
Expires Entity When to discard from cache
Last-Modified Entity Time when data last modified
-- 23 --
© Oxford Brookes University 2002
4.4 Responses
The responses all start with the response line which includes the response code. These are divided into the
following classes:
l 1XX: Information
l 2XX: Request successful
l 3XX: Client error
l 4XX: Server error
l 5XX: System failure
No information responses were defined in HTTP/1.0 although the class was set up. The possible responses
are:
Response Meaning
200 Request succeeded
202 Request accepted, processing incomplete
204 No Content, for example clicking on part of an image map that is inactive
301 Requested URL assigned a new permanent URL
302 Requested URL temporarily assigned a new URL
304 Document not modified. Not changed since the last modification time in a request
400 Bad request
401 Request not accepted, need user authentication
403 Forbidden. Reason might be given in an entity response
404 Not found, the most widely received message
500 Internal server error
501 Not implemented
502 Invalid response from gateway or upstream server.
503 Service temporarily unavailable
The most frequent ones are 200 and 404.
4.5 HEAD and POST Requests
Two new requests were added to HTTP/1.0, HEAD and POST. HEAD is similar in format to GET except that
it does not expect the file to be transmitted. All it requires are the associated header lines. For example:
This is a request to see if the document has been modified since a specified time on 13 September 1999.
The response would be something like:
name=Bob+Hopgood&children=3
-- 25 --
© Oxford Brookes University 2002
5. HTTP/1.1
l 5.1 Introduction
l 5.2 Persistent Connection
l 5.3 Caching
l 5.4 Other Enhancements
5.1 Introduction
HTTP/1.1 is a major update to HTTP/1.0. A sub-set of the new Headers in HTTP/1.1 are:
Header Type Meaning
Cache-Control General Caching information
Connection General Connection management
Trailer General Headers at the end of the message, used with chunking
Transfer-Encoding General Transformation applied to message body, allows separate chunks to be
sent
Upgrade General Suggesting another newer protocol server can handle
Via General Information about intermediate servers passed on
Warning General Non-catastrophic error
Proxy-Authorization Request Authentication with proxy
Accept Response Preferred media type
If-Match, If-None- Request Checking Entity Tags
Match
If-Unmodified-Since Request Check when last modified
Expect Request Expected server behaviour, can it handle a large file
Host Request Resource Host, mandatory in HTTP/1.1
Max-Forwards Request Limits number of hops
Range Request Requests part of file
Location Response Alternative place to find file
Retry-After Response Time before retrying
Accept-Ranges Response Server can accept range requests
Proxy-Authenticate Response Authentication but for the proxy
ETag Response Defines ETag
Vary Response Variant of Resource
Content-Language Entity Language of resource
Content-Location Entity Alternative location
Content-Range Entity Range in resource
The design of HTTP/1.0 had several limitations once the Web started to be used in a wider context. The
simple model of setting up a connection, downloading a page, and then dropping the connection does not
make a great deal of sense once images start to be included in a document. Several GET requests are
needed to retrieve all the parts of the page. Many of these will be from the same Server resulting in several
connections being set up and destroyed for a single page to be transmitted. TCP keeps some information
concerning bandwidth congestion around the network so that future transmissions can benefit from past
experience. Such information is lost when the connection is broken and has to be discovered again when the
connection is reestablished. Browsers started to set up multiple connections at the same time to try and
improve performance for one user while making things worse for others.
-- 26 --
© Oxford Brookes University 2002
A second major problem occurs once caches became common place. There is little information available as
to when an old stale page can be tolerated and when an up-to-date page must be provided. It should be
possible to receive a Web page from a cache rather than the original server and be confident that it is the
most recent.
There was also the need for many minor enhancements within HTTP/1.1 and we will discuss one or two of
these once we have looked at persistent connections and caching. Five new requests, over 30 new headers,
and another 20 response codes have been added to HTTP/1.1 just to give a feel for the level of change.
5.2 Persistent Connection
A fundamental difference between HTTP/1.0 and HTTP/1.1 is that the connection is not dropped when the
request has been replied to. In HTTP/1.1, the connection remains open until the Server makes a decision
that it is no longer required. This is a heuristic decision based on the availability of resources and the use that
has been made of the connection. HTTP/1.1 also allows several GET requests to be sent over the
connection before the first response has been returned. At the server end, this allows the server to return
some documents before others depending on whether it needs to access them from disc or whether they are
available in its own cache.
As well as improving users perceived performance, because the connections are not dropped and TCP can
make more efficient use of the network, it reduces the load on both the network and the servers. An
estimated improvement is that downloading a similar file with HTTP/1.1 requires about 60% of the resources
in the Server compared with HTTP/1.0. In hardware terms for a large site, that might mean that 20% less
servers are needed.
W3C has a benchmark that originated by merging the Netscape and Microsoft home pages as they were a
few years ago complete with a set of images and using this Microscape benchmark to check out
performance. There were 43 images in the combined Microscape page. Four secenarios were compared:
The first uses HTTP/1.0 and opens 4 TCP connections to bring down the images with the page itself. The
second uses HTTP/1.1 but with a single connection that is persistent. The third pipelines the requests rather
than wait for each to complete. the last test also added compression which is available in HTTP/1.1. This will
not improve the images very much as they are already compressed but has some effect on the page itself.
the results are shown in Figure 5.1.
As can be seen, the four HTTP/1.0 connections are faster than the single HTTP/1.1 connection but at the
cost of many more TCP packets being sent due to the dropping of the connections after each transfer. Once
the HTTP/1.1 connection is pipelined, the number of packets needed goes down and the performance is
significantly better. With the test with many images, only marginal improvement comes from the
compression.
-- 27 --
© Oxford Brookes University 2002
Cache-Control: no-cache
This would tell all the caches that this page should not be cached presumably because it contains
information that quickly ggoes out-of-date. The full set of possibilities in GET requests are:
-- 28 --
© Oxford Brookes University 2002
Directive Description
no-cache Must come from original server
only-if-cached Only from cache
no-store Do not cache response
max-age Age no greater than this. max-age=0 same as no-cache
max-stale Can be expired but not more than this. max-stale=120 accepts stale response as long
as not by more than 2 minutes
min-fresh Fresh for at least this long. min-fresh=120 means that response has to have an
expired time at least 2 minutes in the future
no-transform Proxy must not change media type. Not allowed to turn a png into a gif
Support is also provided for the server to give further information when it returns that page using the Cache-
Control header:
Directive Description
public Insists page is cacheable. Cache might have thought it was in a non-cacheable class
private Stops a response being cacheable (such as Customer ID or a password)
no-store Stops caching of both request and response
no-cache Insists page is not cacheable. Proxy may cache as long as it validates each time
no-transform Must not transform MIME type
must-revalidate Strictly follow the rules. Must revalidate. If it cannot, must return a 504 or 500 error
proxy-revalidate Proxy must follow the rules. Local browser caches don't have to
max-age Maximum time it is fresh, overrides Expires Header
s-maxage Maximum time it should be cached in a proxy cache
Appendix A
References
There are some useful Web sites and books relevant to HTTP:
1. http://www.w3.org/Protocols/rfc1945/rfc1945
HTTP/1.0 May 1996
2. http://www.ietf.org/rfc/rfc2616.txt
HTTP/1.1 Draft Standard, June 1999
3. Web Protocols and Practice, Balachander Krishnamurthy and Jennifer Rexford
Addison Wesley, 2001.
4. The World Wide Web, Mark Handley and Jon Crowcroft
UCL Press, 1996.
5. Computer Networks (Third Edition), Andrew S Tanenbaum
Prentice-Hall, 1996.
6. High-Performance Communication Networks (2nd edition), Jean Walrand & Pravin Varaiya
Morgan Kaufmann Publishers, 2000.
-- ii --
© Oxford Brookes University 2002
Underlying protocol
There are various distinct possible bases for the protocol - we can choose
l Something based on, and looking like, an Internet protocol. This has the advantage of being well
understood, of existing implementations being all over the place. It also leaves open the possibility of
a universal FTP/HTTP or NNTP/HTTP server. This is the case for the current HTTP.
l Something based on an RPC standard. This has the advantage of making it easy to generate the
code, that the parsing of the messages is done automatically, and that the transfer of binary data is
efficient. It has the disadvantage that one needs the RPC code to be available on all platforms. One
would have to chose one (or more) styles of RPC. Another disadvantage may be that existing RPC
systems are not efficient at transferring large quantities of text over a stream protocol unless (like DD-
OC-RPC) one has a let-out and can access the socket directly.
l Something based on the OSI stack, as is Z39.50. This would have to be run over TCP in the internet
world.
Current HTTP uses the first alternative, to make it simple to program, so that it will catch on: conversion to
run over an OSI stack will be simple as the structure of the messages is well defined.
Idempotent ?
Another choice is whether to make the protocol idempotent or not. That is, does the server need to keep any
state information about the client? (For example, the NFS protocol is idempotent, but the FTP and NNTP
protocols are not.) In the case of FTP the state information consists of authorisation, which is not trivial to
establish every time but could be, and current directory and transfer mode which are basically trivial. The
proposed protocol IS idempotent.
This causes, in principle, a problem when trying to map a non-idempotent system (such as library search
systems which stored "result sets" on behalf of the client) into the web. The problem is that to use them in an
idempotent way requires the re-evaluation of the intermediate result sets at each query. This can be solved
by the gateway intelligently caching result sets for a reasonable time.
Response
Suppose the response is an SGML document, with the document type a function of the status.
Possible replies one could imagine, encoded as SGML:
<!GDOC HTML> a
normal HTML document <!/GDOC> <!GDOC HTML>
Status
A status is required in machine-readable format. See the 3-figure status codes of FTP for example. Bad
status codes should be accompanied by an explanatory document, possible containing links to further
information. A possibility would be to make an error response a special SGML document type. Some
special status codes are mentioned below.
Format
The format selected by the server
Document
The document in that format
Status codes
Success
Accompanied by format and document.
Forward
Accompanied by new address. The server indicates a new address to be used by the client for finding the
document. the document may have moved, or the server may be a name server.
Need Authorisation
The authorisation is not sufficient. Accompanied by the address prefix for which authorisation is required.
The browser should obtain authorisation, and use it every time a request is made for a document name
matching that prefix.
Refused
Access has been refused. Sending (more) authorization won't help.
Bad document name
The document name did not refer to a valid document.
Server failure
Not the client's fault. Accompanied by a natural language explanation.
Not available now
Temporary problem - trying at a later time might help. This does not i,ply anything about the document
name and authorisation being valid. Accompanied by a natural language explanation.
Search fail
Accompanied by a HTML hit-list without any hits, but possibly containing a natural explanation.
Valid
XHTML