Increase the speed of Web Scraping in Python using HTTPX module
Last Updated :
30 Jul, 2024
In this article, we will talk about how to speed up web scraping using the requests module with the help of the HTTPX module and AsyncIO by fetching the requests concurrently.
The user must be familiar with Python. Knowledge about the Requests module or web scraping would be a bonus.
Required ModuleĀ
For this tutorial, we will use 4 modules –
- time
- requests
- httpx
- asyncio.
pip install httpx
pip install requests
time and asyncio comes pre-installed so no need to install them.
Using the requests module to get the required time –
First, we will use the traditional way of fetching URLs using the get() method of the requests module, then using the time module we will check the total time consumed.
Python
import time
import requests
def fetch_urls():
urls=[
"https://en.wikipedia.org/wiki/Badlands",
"https://en.wikipedia.org/wiki/Canyon",
"https://en.wikipedia.org/wiki/Cave",
"https://en.wikipedia.org/wiki/Cliff",
"https://en.wikipedia.org/wiki/Coast",
"https://en.wikipedia.org/wiki/Continent",
"https://en.wikipedia.org/wiki/Coral_reef",
"https://en.wikipedia.org/wiki/Desert",
"https://en.wikipedia.org/wiki/Forest",
"https://en.wikipedia.org/wiki/Geyser",
"https://en.wikipedia.org/wiki/Mountain_range",
"https://en.wikipedia.org/wiki/Peninsula",
"https://en.wikipedia.org/wiki/Ridge",
"https://en.wikipedia.org/wiki/Savanna",
"https://en.wikipedia.org/wiki/Shoal",
"https://en.wikipedia.org/wiki/Steppe",
"https://en.wikipedia.org/wiki/Tundra",
"https://en.wikipedia.org/wiki/Valley",
"https://en.wikipedia.org/wiki/Volcano",
"https://en.wikipedia.org/wiki/Artificial_island",
"https://en.wikipedia.org/wiki/Lake"
]
res = [requests.get(addr).status_code for addr in urls]
print(set(res))
start = time.time()
fetch_urls()
end = time.time()
print("Total Consumed Time",end-start)
Firstly we imported the requests and time module then created a function called fetch_urls() inside which we created a list consisting of 20 links (user can choose any number of any random links which exists). Then inside a variable res which is a type of list we are using the get() method with the status_code method of requests module to send a request to each of those links and fetch and store their status_codes as a list. Then lastly we are printing the set of that res. Now main reason of converting it into a set is that if everysite is working then all will return 200 status code so making it set will only be a single value so the time consumption will be less in that work (Motive is to use as less time as possible in other works).
Then outside the function using the time() method of time module we are storing the starting and ending time and in between calling the function. Then finally printing the total consumed time.
Output:

Ā
We can see from the output that it consumed total 12.6422558 seconds.
Using HTTPX with AsyncIOĀ
In a Jupyter Notebook environment, you need to be aware that Jupyter Notebook’s event loop is managed differently compared to a standalone script. Specifically, you should use nest_asyncio to handle the event loop correctly in Jupyter Notebooks.
Here’s how you can modify the code to work smoothly in a Jupyter Notebook:
- Import nest_asyncio and apply it to patch the event loop.
- Ensure the asynchronous function is run properly within the notebook environment.
Python
import time
import asyncio
import httpx
import nest_asyncio
# Patch the event loop for Jupyter Notebook
nest_asyncio.apply()
async def fetch_httpx():
urls = [
"https://en.wikipedia.org/wiki/Badlands",
"https://en.wikipedia.org/wiki/Canyon",
"https://en.wikipedia.org/wiki/Cave",
"https://en.wikipedia.org/wiki/Cliff",
"https://en.wikipedia.org/wiki/Coast",
"https://en.wikipedia.org/wiki/Continent",
"https://en.wikipedia.org/wiki/Coral_reef",
"https://en.wikipedia.org/wiki/Desert",
"https://en.wikipedia.org/wiki/Forest",
"https://en.wikipedia.org/wiki/Geyser",
"https://en.wikipedia.org/wiki/Mountain_range",
"https://en.wikipedia.org/wiki/Peninsula",
"https://en.wikipedia.org/wiki/Ridge",
"https://en.wikipedia.org/wiki/Savanna",
"https://en.wikipedia.org/wiki/Shoal",
"https://en.wikipedia.org/wiki/Steppe",
"https://en.wikipedia.org/wiki/Tundra",
"https://en.wikipedia.org/wiki/Valley",
"https://en.wikipedia.org/wiki/Volcano",
"https://en.wikipedia.org/wiki/Artificial_island",
"https://en.wikipedia.org/wiki/Lake"
]
async with httpx.AsyncClient() as httpx_client:
req = [httpx_client.get(addr) for addr in urls]
result = await asyncio.gather(*req)
start = time.time()
await fetch_httpx() # Use 'await' to run the async function directly in a notebook cell
end = time.time()
print("Total Consumed Time using HTTPX:", end - start)
we have to use asyncio with HTTPX otherwise we can’t send requests concurrently. HTTPX itself has an inbuilt AsyncIO client which we will use here, for using that inside a function the function has be to Asynchronous. We are calling the AsyncClient() method of HTTPX module using the alias httpx_client, then using that alias we are concurrently sending requests to the same links used earlier. Then as we are using async we have to use await, using the await we are gathering the response and storing them in result. (If user wants they can print that too, but as my intention is to decrease the time consumed in other operations rather than fetching requests I didn’t print them.).
Then from outside the function using the asyncio.run() method we are calling that function and then printing the result of the total time consumed.
Output:

As we can see from the Output the total time consumed has been decrease by nearly 6 times. This difference can differ anytime. If we are sending requests to the same URL again and again then the time consumed for both requests and HTTPX will be lesser than last time so this difference will then be increased more.
Here the difference is nearly 10 times, HTTPX with AsyncIO is nearly 10 times faster than requests.
Similar Reads
Clean Web Scraping Data Using clean-text in Python
If you like to play with API's or like to scrape data from various websites, you must've come around random annoying text, numbers, keywords that come around with data. Sometimes it can be really complicating and frustrating to clean scraped data to obtain the actual data that we want. In this artic
2 min read
Implementing web scraping using lxml in Python
Web scraping basically refers to fetching only some important piece of information from one or more websites. Every website has recognizable structure/pattern of HTML elements. Steps to perform web scraping :1. Send a link and get the response from the sent link 2. Then convert response object to a
3 min read
Web Scraping Financial News Using Python
In this article, we will cover how to extract financial news seamlessly using Python. This financial news helps many traders in placing the trade in cryptocurrency, bitcoins, the stock markets, and many other global stock markets setting up of trading bot will help us to analyze the data. Thus all t
3 min read
How to Scrape Web Data from Google using Python?
Prerequisites: Python Requests, Implementing Web Scraping in Python with BeautifulSoup Web scraping is a technique to fetch data from websites. While surfing on the web, many websites donât allow the user to save data for personal use. One way is to manually copy-paste the data, which both tedious a
2 min read
Create API Tester using Python Requests Module
Prerequisites: Python Requests module, API In this article, we will discuss the work process of the Python Requests module by creating an API tester. API stands for Application Programming Interface (main participant of all the interactivity). It is like a messenger that takes our requests to a syst
3 min read
Scraping Indeed Job Data Using Python
In this article, we are going to see how to scrape Indeed job data using python. Here we will use Beautiful Soup and the request module to scrape the data. Module neededbs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This module does not come built-in with Py
3 min read
Web Scraping for Stock Prices in Python
Web scraping is a data extraction method that collects data only from websites. It is often used for data mining and gathering valuable insights from large websites. Web scraping is also useful for personal use. Python includes a nice library called BeautifulSoup that enables web scraping. In this a
7 min read
Pagination using Scrapy - Web Scraping with Python
Pagination using Scrapy. Web scraping is a technique to fetch information from websites. Scrapy is used as a Python framework for web scraping. Getting data from a normal website is easier, and can be just achieved by just pulling the HTML of the website and fetching data by filtering tags. But what
3 min read
How to Scrape Multiple Pages of a Website Using Python?
Web Scraping is a method of extracting useful data from a website using computer programs without having to manually do it. This data can then be exported and categorically organized for various purposes. Some common places where Web Scraping finds its use are Market research & Analysis Websites
6 min read
Python | Tools in the world of Web Scraping
Web page scraping can be done using multiple tools or using different frameworks in Python. There are variety of options available for scraping data from a web page, each suiting different needs. First, let's understand the difference between web-scraping and web-crawling. Web crawling is used to in
4 min read