Lecture 3 Web Crawler Basics of HTTP
Lecture 3 Web Crawler Basics of HTTP
Client/Server Description
In the world of computer networking and distributed computing, the client-server architecture is a fundamental and widely used model that
defines how devices and software applications interact with each other over a network. This architecture forms the backbone of the modern
internet and plays a crucial role in how information and services are accessed, processed, and shared.
At its core, the client-server architecture is a model of communication where two distinct entities, namely the client and the server,
collaborate to accomplish various tasks. These entities are essentially software programs or hardware devices that perform specific roles in
the communication process.
Client: The client is a device or software application that initiates a request for a service or resource. It acts as the user's interface to
access data or services provided by the server. Clients send requests to servers, receive responses, and typically display or utilize the
results. Examples of clients include web browsers, email clients, and mobile apps.
Server: The server, on the other hand, is a specialized device or software application responsible for fulfilling client requests. Servers
listen for incoming requests, process them, and send back the appropriate responses. They are designed to provide specific services or
resources, such as web pages, databases, email, or file storage.
6. Security: Security measures can be implemented at both the client and server ends to protect data and ensure secure communication.
Authentication, encryption, and access control are essential components of client-server security.
Web Services: When you browse a website, your web browser (the client) requests web pages and other resources from a web server.
The server processes these requests and sends back the requested content.
Email: Email clients (e.g., Outlook, Gmail) communicate with email servers to send and receive messages.
Databases: Client applications access database servers to retrieve and update data, commonly used in business and enterprise
environments.
File Sharing: Network-attached storage (NAS) devices allow clients to access and share files stored on a central server.
Online Gaming: Online multiplayer games often use client-server architecture to facilitate gameplay and data synchronization.
Cloud Computing: Cloud services use client-server principles to provide on-demand computing resources, storage, and applications.
HTTP Protocol:
1. Detailed overview of the HTTP (Hypertext Transfer Protocol).
Explanation of HTTP methods (GET, PUT, POST, DELETE).
Understanding request and response headers.
2. Open your terminal.
3. Type the following command to initiate a Telnet session with the web server (replace example.com with the actual website you want to
connect to):
1 telnet example.com 80
1 GET / HTTP/1.1
2 Host: example.com
3 Connection: close
4
Make sure to press Enter twice after entering the request to send it.
5. The server will respond with the HTTP response, and you'll see the content displayed in your terminal.
GET / HTTP/1.1 : This is the GET request line, specifying the HTTP method, path ( / for the root), and HTTP version (HTTP/1.1).
Host: example.com : This is the Host header, which should match the target website's domain.
Connection: close : This header indicates that the connection should be closed after the response is received.
This method allows you to make a GET request using Telnet directly in your terminal without the need for a separate script.
1 import requests
2 from bs4 import BeautifulSoup
3
4 # Replace with the URL you want to GET
5 url = "https://example.com"
6
7 try:
8 # Send a GET request
9 response = requests.get(url)
10
11 # Check the status code
12 if response.status_code == 200:
13 print("GET request successful!")
14
15 # Parse the HTML content with BeautifulSoup
16 soup = BeautifulSoup(response.text, 'html.parser')
17
18 # Example: Extract and print the page title
19 page_title = soup.title.string
20 print(f"Page Title: {page_title}")
21
22 # You can further parse and extract data from the HTML as needed
23
24 else:
25 print(f"GET request failed with status code: {response.status_code}")
26
27 except requests.exceptions.RequestException as e:
28 print(f"An error occurred: {e}")
29
30 except Exception as e:
31 print(f"An unexpected error occurred: {e}")
32
33 # DOM Parser.
34 # SAX Parser.
35 # Xpath Parser.
In this code:
1. We import the necessary libraries, requests for making HTTP requests and BeautifulSoup for parsing HTML content.
2. Replace the url variable with the URL of the web page you want to fetch.
4. It checks the HTTP status code to determine if the request was successful (status code 200).
5. If the request is successful, we parse the HTML content using BeautifulSoup. In this example, we extract and print the page title as a
demonstration.
6. If the request fails or an exception occurs, appropriate error messages are displayed.
Make sure to install the requests and beautifulsoup4 libraries if you haven't already by running:
This code will fetch a web page, parse its HTML content, and print the page title. You can extend it to extract and manipulate other elements
from the page as needed.
GET/PUT/POST/Delete Methods
1. PUT Method:
Purpose: The PUT method is used to update or replace an existing resource or create a new resource if it doesn't already exist at the
specified URI (Uniform Resource Identifier).
Key Characteristics:
Idempotent: A PUT request is idempotent, meaning that making the same request multiple times will have the same effect as making it
once. It should not have any side effects on subsequent requests.
Replaces Entire Resource: When you send a PUT request, you typically send the entire representation of the resource you want to
update. The server replaces the existing resource with the new data provided.
URI Specifies the Resource: The URI in the request specifies the location of the resource you want to update.
2. POST Method:
Purpose: The POST method is used to submit data to be processed to a specified resource. Unlike PUT, POST does not replace the entire
resource; instead, it creates a new subordinate resource or processes the data in a specific way on the server.
Key Characteristics:
Not Idempotent: POST requests are not idempotent. Making the same POST request multiple times may result in different outcomes or
side effects.
Data in the Request Body: POST requests often include data in the request body. This data can be in various formats, such as JSON,
XML, or form data.
URI May Not Specify the Resource: The URI in a POST request may not always specify the exact resource. It may simply specify the
endpoint or route to handle the data.
Creating a new resource on the server (e.g., adding a new user to a database).
3. DELETE Method:
Purpose: The DELETE method is used to request the removal of a resource at the specified URI.
Key Characteristics:
Idempotent: Similar to PUT, DELETE requests are idempotent. Repeated requests to delete the same resource will have the same
effect as making the request once.
No Request Body: DELETE requests typically do not have a request body. The URI specifies the resource to be deleted.
Resource Deletion: When a DELETE request is successful, the resource at the specified URI is removed from the server.
In summary, PUT is used for updating or replacing resources, POST is used for creating new resources or processing data, and DELETE is
used for removing resources. Understanding the differences between these HTTP methods is crucial when designing RESTful APIs and
working with web services. Each method has its own specific use cases and behavior, so choosing the right one is essential for proper web
application functionality.
Requests in python
Free API - Huge List of Public APIs For Testing [No Key] - Apipheny
1 # PUT REQUEST
2 api_url = "https://httpbin.org/put"
3 data = {"name":"test","salary":"123","age":"23"}
4
5 try:
6 # Send a PUT request with the data
7 put_response = requests.put(api_url, json=data)
8
9 # Check the response status code
10 if put_response.status_code == 200:
11 print("PUT Request Successful!")
12 print(put_response.json()) # Response data, if any
13 else:
14 print(f"PUT Request Failed with Status Code: {put_response.status_code}")
15
16 except requests.exceptions.RequestException as e:
17 print(f"An Error Occurred: {e}")
1 # POST REQUEST
2 api_url = "https://httpbin.org/post"
3 data = {"name":"test","salary":"123","age":"23"}
4
5 try:
6 put_response = requests.post(api_url, json=data)
7
8 # Check the response status code
9 if put_response.status_code == 200:
10 print("PUT Request Successful!")
11 print(put_response.json()) # Response data, if any
12 else:
13 print(f"POST Request Failed with Status Code: {put_response.status_code}")
14
15 except requests.exceptions.RequestException as e:
16 print(f"An Error Occurred: {e}")
1 # DELETE REQUEST
2 api_url = "https://httpbin.org/delete"
3 data = {"name":"test","salary":"123","age":"23"}
4
5 try:
6 put_response = requests.delete(api_url)
7
8 # Check the response status code
9 if put_response.status_code == 200:
10 print("DELETE Request Successful!")
11 print(put_response.json()) # Response data, if any
12 else:
13 print(f"PUT Request Failed with Status Code: {put_response.status_code}")
14
15 except requests.exceptions.RequestException as e:
16 print(f"An Error Occurred: {e}")
In this code:
Replace "<https://example.com" > with the URL of the web page you want to scrape.
The script sends a GET request to the specified URL using the requests library.
If the request is successful (status code 200), it uses BeautifulSoup to parse the HTML content of the page.
It then finds all anchor ( <a> ) tags in the HTML and extracts the href attribute from each one.
Make sure you have the requests and beautifulsoup4 libraries installed by running pip install requests beautifulsoup4 before
running this code.