MIW Chapter 2
MIW Chapter 2
2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines.
Transfer protocols:
Conventions that regulate the communication between a browser (web user agent) and a server
SGML Components
SGML documents have three parts:
Declaration: specifies which characters and delimiters may appear in the application DTD/ style sheet: defines the syntax of markup constructs Document instance: actual text (with the tag) of the documents
HTML Background
HTML was originally developed by Tim BernersLee while at CERN, and popularized by the Mosaic browser developed at NCSA. The Web depends on Web page authors and vendors sharing the same conventions for HTML. This has motivated joint work on specifications for HTML. HTML standards are organized by W3C : http://www.w3.org/MarkUp/
Modeling the Internet and the Web
School of Information and Computer Science University of California, Irvine
HTML Functionalities
HTML gives authors the means to:
Publish online documents with headings, text, tables, lists, photos, etc
Include spread-sheets, video clips, sound clips, and other applications directly in their documents
Link information via hypertext links, at the click of a button Design forms for conducting transactions with remote services, for use in searching for information, making reservations, ordering products, etc
10
HTML Versions
HTML 4.01 is a revision of the HTML 4.0 Recommendation first released on 18th December 1997.
HTML 4.01 Specification:
http://www.w3.org/TR/1999/REC-html401-19991224/html40.txt HTML 4.0 was first released as a W3C Recommendation on 18 December 1997 HTML 3.2 was W3C's first Recommendation for HTML which represented the consensus on HTML features for 1996 HTML 2.0 (RFC 1866) was developed by the IETF's HTML Working Group, which set the standard for core HTML features based upon current practice in 1994.
11
Sample Webpage
12
13
HTML Structure
An HTML document is divided into a head section (here, between <HEAD> and </HEAD>) and a body (here, between <BODY> and </BODY>) The title of the document appears in the head (along with other information about the document) The content of the document appears in the body. The body in this example contains just one paragraph, marked up with <P>
14
HTML Hyperlink
<a href="relations/alumni">alumni</a> A link is a connection from one Web resource to another It has two ends, called anchors, and a direction Starts at the "source" anchor and points to the "destination" anchor, which may be any Web resource (e.g., an image, a video clip, a sound bite, a program, an HTML document)
Modeling the Internet and the Web
School of Information and Computer Science University of California, Irvine
15
Resource Identifiers
URI: Uniform Resource Identifiers URL: Uniform Resource Locators URN: Uniform Resource Names
16
Introduction to URIs
Every resource available on the Web has an address that may be encoded by a URI URIs typically consist of three pieces: The naming scheme of the mechanism used to access the resource. (HTTP, FTP) The name of the machine hosting the resource The name of the resource itself, given as a path
Modeling the Internet and the Web
School of Information and Computer Science University of California, Irvine
17
URI Example
http://www.w3.org/TR There is a document available via the HTTP protocol Residing on the machines hosting www.w3.org Accessible via the path "/TR"
18
Protocols
Describe how messages are encoded and exchanged Different Layering Architectures ISO OSI 7-Layer Architecture TCP/IP 4-Layer Architecture
19
20
21
22
23
24
25
26
27
Registrars
Domain names ending with .aero, .biz, .com, .coop, .info, .museum, .name, .net, .org, or .pro can be registered through many different companies (known as "registrars") that compete with one another InterNIC at http://internic.net Registrars Directory: http://www.internic.net/regist.html
Modeling the Internet and the Web
School of Information and Computer Science University of California, Irvine
28
Referrer Log: where the request originated Agent Log: browser software making the request (spider) Error Log: request resulted in errors (404)
Modeling the Internet and the Web
School of Information and Computer Science University of California, Irvine
29
30
31
Search Engines
According to Pew Internet Project Report (2002), search engines are the most popular way to locate information online About 33 million U.S. Internet users query on search engines on a typical day. More than 80% have used search engines Search Engines are measured by coverage and recency
Modeling the Internet and the Web
School of Information and Computer Science University of California, Irvine
32