Open In App

Scrape IMDB movie rating and details using Python and saving the details of top movies to .csv file

Last Updated : 21 Nov, 2022
Comments
Improve
Suggest changes
Like Article
Like
Report

We can scrape the IMDb movie ratings and their details with the help of the BeautifulSoup library of Python. 

Modules Needed:

Below is the list of modules required to scrape from IMDB.

  1. requests: Requests library is an integral part of Python for making HTTP requests to a specified URL. Whether it be REST APIs or Web Scraping, requests must be learned for proceeding further with these technologies. When one makes a request to a URI, it returns a response.
  2. html5lib: A pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.
  3. bs4: BeautifulSoup object is provided by Beautiful Soup which is a web scraping framework for Python. Web scraping is the process of extracting data from the website using automated tools to make the process faster.
  4. pandas: Pandas is a library made over the NumPy library which provides various data structures and operators to manipulate the numerical data.

Approach:

Steps to implement web scraping in python to extract IMDb movie ratings and its ratings:

  • Import the required modules.

Python3




from bs4 import BeautifulSoup
import requests
import re
import pandas as pd


  • Access the HTML content from the webpage by assigning the URL and creating a soap object.

Python3




# Downloading imdb top 250 movie's data
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")


  • Extract the movie ratings and their details. Here, we are extracting data from the BeautifulSoup object using Html tags like href, title, etc.

Python3




movies = soup.select('td.titleColumn')
crew = [a.attrs.get('title') for a in soup.select('td.titleColumn a')]
ratings = [b.attrs.get('data-value')
        for b in soup.select('td.posterColumn span[name=ir]')]


  • After extracting the movie details, create an empty list and store the details in a dictionary, and then add them to a list.

Python3




# create a empty list for storing
# movie information
list = []
 
# Iterating over movies to extract
# each movie's details
for index in range(0, len(movies)):
     
    # Separating movie into: 'place',
    # 'title', 'year'
    movie_string = movies[index].get_text()
    movie = (' '.join(movie_string.split()).replace('.', ''))
    movie_title = movie[len(str(index))+1:-7]
    year = re.search('\((.*?)\)', movie_string).group(1)
    place = movie[:len(str(index))-(len(movie))]
    data = {"place": place,
            "movie_title": movie_title,
            "rating": ratings[index],
            "year": year,
            "star_cast": crew[index],
            }
    list.append(data)


  • Now or list is filled with top IMBD movies along with their details. Then display the list of movie details

Python3




for movie in list:
    print(movie['place'], '-', movie['movie_title'], '('+movie['year'] +
          ') -', 'Starring:', movie['star_cast'], movie['rating'])


  • By using the following lines of code the same data can be saved into a .csv file be further used as a dataset.

Python3




#saving the list as dataframe
#then converting into .csv file
df = pd.DataFrame(list)
df.to_csv('imdb_top_250_movies.csv',index=False)


Implementation: Complete Code

Python3




from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
 
 
# Downloading imdb top 250 movie's data
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
movies = soup.select('td.titleColumn')
crew = [a.attrs.get('title') for a in soup.select('td.titleColumn a')]
ratings = [b.attrs.get('data-value')
        for b in soup.select('td.posterColumn span[name=ir]')]
 
 
 
 
# create a empty list for storing
# movie information
list = []
 
# Iterating over movies to extract
# each movie's details
for index in range(0, len(movies)):
     
    # Separating movie into: 'place',
    # 'title', 'year'
    movie_string = movies[index].get_text()
    movie = (' '.join(movie_string.split()).replace('.', ''))
    movie_title = movie[len(str(index))+1:-7]
    year = re.search('\((.*?)\)', movie_string).group(1)
    place = movie[:len(str(index))-(len(movie))]
    data = {"place": place,
            "movie_title": movie_title,
            "rating": ratings[index],
            "year": year,
            "star_cast": crew[index],
            }
    list.append(data)
 
# printing movie details with its rating.
for movie in list:
    print(movie['place'], '-', movie['movie_title'], '('+movie['year'] +
        ') -', 'Starring:', movie['star_cast'], movie['rating'])
 
 
##.......##
df = pd.DataFrame(list)
df.to_csv('imdb_top_250_movies.csv',index=False)


Output: 

Along with this in the terminal, a .csv file with a given name is saved in the same file and the data in the .csv file will be as shown in the following image.

 



Next Article

Similar Reads