Newspaper: Article scraping & curation (Python)
Last Updated :
27 Nov, 2024
Newspaper is a Python module used for extracting and parsing newspaper articles. Newspaper use advance algorithms with web scraping to extract all the useful text from a website. It works amazingly well on online newspapers websites. Since it use web scraping too many request to a newspaper website may lead to blocking, so use it accordingly.
Installation:
pip install newspaper3k
Newspaper supports following languages:
input codefull name
ar Arabic
da Danish
de German
el Greek
en English
it Italian
zh Chinese
......... and many more
When scraping articles from websites, especially news outlets, it’s common to encounter poorly structured or messy HTML content. This can make it difficult to extract meaningful data from the page. Fortunately, the combination of newspaper3k and lxml_html_clean offers an efficient way to clean and process web content, allowing for more accurate extraction of article text, titles, summaries, and keywords.
Installing lxml_html_clean
You can install the lxml_html_clean library using the following command:
Python
pip install lxml_html_clean
This will install the library and its dependencies, including lxml, which is an efficient and feature-rich library for processing XML and HTML in Python.
Missing punkt Data for NLP:
If you encounter an error related to missing punkt data (e.g., LookupError: Resource punkt not found), you can resolve it by downloading the necessary NLTK resources.
- Run the following code to download the required punkt tokenizer:
Python
import nltk
nltk.download('punkt')
- In case the error is related to a missing punkt_tab, try:
Python
nltk.download('punkt_tab')
This will ensure that the tokenizer is available for sentence segmentation required by NLP functions like nlp().
Note: You don't need to import lxml_html_clean
or nltk
if these libraries are already installed in your system or environment. If they are preinstalled, they will be called internally. However, if you encounter any errors or issues while running the code, try importing both libraries manually and downloading the necessary NLTK resources, such as punkt
and punkt_tab
, to ensure all required dependencies are available in your environment.
Some Useful functions to create an instance of an article
article_name = Article(url, language="language code according to newspaper")
To download an article
article_name.download()
To parse an article
article_name.parse()
To apply nlp(natural language processing) on article
article_name.nlp()
To extract article's text
article_name.text
To extract article's title
article_name.title
To extract article's summary
article_name.summary
To extract article's keywords
article_name.keywords
Python
from newspaper import Article
# URL of the article you want to scrape
url = "http://timesofindia.indiatimes.com/world/china/chinese-expert-warns-of-troops-entering-kashmir/articleshow/59516912.cms"
# Create an Article object with the given URL and language (e.g., 'en' for English)
toi_article = Article(url, language="en")
# To download the article
toi_article.download()
# To parse the article (i.e., extract the content)
toi_article.parse()
# To perform Natural Language Processing (NLP) on the article (optional)
toi_article.nlp()
# To extract the article's title
print("Article's Title:")
print(toi_article.title)
print("\n")
# To extract the article's full text
print("Article's Text:")
print(toi_article.text)
print("\n")
# To extract the article's summary (requires NLP)
print("Article's Summary:")
print(toi_article.summary)
print("\n")
# To extract keywords from the article (requires NLP)
print("Article's Keywords:")
print(toi_article.keywords)
Output:
Article's Title:
India China News: Chinese expert warns of troops entering Kashmir
Article's Text:
BEIJING: A Chinese expert has argued that his country's troops would be entitled to enter the Indian side of Kashmir by extending the logic that has permitted Indian troops to enter an area which is disputed by China and Bhutan This is one of the several arguments made by the scholar in an attempt to blame India for. India has responded to efforts by China to build a road in the Doklam area, which falls next to the trijunction connecting Sikkim with Tibet and Bhutan and"Even if India were requested to defend Bhutan's territory, this could only be limited to its established territory, not the disputed area, " Long Xingchun, director of the Center for Indian Studies at China West Normal University said in an article. "Otherwise, under India's logic, if the Pakistani government requests, a third country's army can enter the area disputed by India and Pakistan, including India-controlled Kashmir".China is not just interfering, it is building roads and other infrastructure projects right inside Pakistan-Occupied Kashmir (PoK), which is claimed by both India and Pakistan. This is one of the facts that the article did not mention.The scholar, through his article in the Beijing-based Global Times, suggested that Beijing can internationalize the Doklam controversy without worrying about western countries supporting India because the West has a lot of business to do with China."China can show the region and the international community or even the UN Security Council its evidence to illustrate China's position, " Long said. At the same time, he complained that "Western governments and media kept silent, ignoring India's hegemony over the small countries of South Asia" when India imposed a blockade on the flow of goods to Nepal in 2015.Recent actions by US president Donald Trump, which include selling arms to Taiwan and pressuring China on the North Korean issue, shows that the West is not necessarily cowered down by China's business capabilities.He reiterated the government's stated line that Doklam belongs to China, and that Indian troops had entered the area under the guise of helping Bhutan protect its territory."For a long time, India has been talking about international equality and non-interference in the internal affairs of others, but it has pursued hegemonic diplomacy in South Asia, seriously violating the UN Charter and undermining the basic norms of international relations, " he said.Interestingly, Chinese scholars are worrying about India interfering in Bhutan's "sovereignty and national interests" even though it is Chinese troops who have entered the Doklam area claimed by it."Indians have migrated in large numbers to Nepal and Bhutan, interfering with Nepal's internal affairs. The first challenge for Nepal and Bhutan is to avoid becoming a state of India, like Sikkim, " he said.
Article's Summary:
sending its troops to the disputed Doklam area +puts Indian territory at risk +BEIJING: A Chinese expert has argued that his country's troops would be entitled to enter the Indian side of Kashmir by extending the logic that has permitted Indian troops to enter an area which is disputed by China and Bhutan This is one of the several arguments made by the scholar in an attempt to blame India for.
"Otherwise, under India's logic, if the Pakistani government requests, a third country's army can enter the area disputed by India and Pakistan, including India-controlled Kashmir".China is not just interfering, it is building roads and other infrastructure projects right inside Pakistan-Occupied Kashmir (PoK), which is claimed by both India and Pakistan.
"China can show the region and the international community or even the UN Security Council its evidence to illustrate China's position, " Long said.
"Indians have migrated in large numbers to Nepal and Bhutan, interfering with Nepal's internal affairs.
The first challenge for Nepal and Bhutan is to avoid becoming a state of India, like Sikkim, " he said.
Article's Keywords:
['troops', 'india', 'china', 'territory', 'west', 'disputed', 'expert', 'indian', 'bhutan', 'kashmir', 'chinese', 'entering', 'doklam', 'area', 'warns']
Handling Errors and Common Issues
Blocking by Websites:
Since newspaper3k uses web scraping, repeated requests to a website may result in your IP being blocked by the site. To avoid this, consider using the following strategies:
- Respect website robots.txt: Ensure that you are allowed to scrape the site.
- Limit the number of requests: Avoid hitting a website repeatedly in a short period.
- Use proxies: You can use a proxy to mask your IP if necessary.
Reference: Newspaper python package on github
Similar Reads
Newspaper scraping using Python and News API
There are mainly two ways to extract data from a website: Use the API of the website (if it exists). For example, Facebook has the Facebook Graph API which allows retrieval of data posted on Facebook.Access the HTML of the webpage and extract useful information/data from it. This technique is called
4 min read
Automatic News Scraping with Python, Newspaper and Feedparser
The problem we are trying to solve here is to extract relevant information from news articles, such as the title, author, publish date, and the main content of the article. This information can then be used for various purposes such as creating a personal news feed, analyzing trends in the news, or
3 min read
Scraping websites with Newspaper3k in Python
Web Scraping is a powerful tool to gather information from a website. To scrape multiple URLs, we can use a Python library called Newspaper3k. The Newspaper3k package is a Python library used for Web Scraping articles, It is built on top of requests and for parsing lxml. This module is a modified an
2 min read
Scraping Flipkart Data using Python
Web scraping is commonly used to gather information from a webpage. Using this technique, we are able to extract a large amount of data and then save it. We can use this data at many places later according to our needs. Â For Scraping data, we need to import a few modules. These modules did not come
3 min read
Scraping data in network traffic using Python
In this article, we will learn how to scrap data in network traffic using Python. Modules Neededselenium: Selenium is a portable framework for controlling web browser.time: This module provides various time-related functions.json: This module is required to work with JSON data.browsermobproxy: This
5 min read
Web Scraping Financial News Using Python
In this article, we will cover how to extract financial news seamlessly using Python. This financial news helps many traders in placing the trade in cryptocurrency, bitcoins, the stock markets, and many other global stock markets setting up of trading bot will help us to analyze the data. Thus all t
3 min read
How to Scrape Multiple Pages of a Website Using Python?
Web Scraping is a method of extracting useful data from a website using computer programs without having to manually do it. This data can then be exported and categorically organized for various purposes. Some common places where Web Scraping finds its use are Market research & Analysis Websites
6 min read
How to Make API Call Using Python
APIs (Application Programming Interfaces) are an essential part of modern software development, allowing different applications to communicate and share data. Python provides a popular library i.e. requests library that simplifies the process of calling API in Python. In this article, we will see ho
3 min read
How to get the Daily News using Python
In this article, we are going to see how to get daily news using Python. Here we will use Beautiful Soup and the request module to scrape the data. Modules neededbs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This module does not come built-in with Python. T
3 min read
Scraping Indeed Job Data Using Python
In this article, we are going to see how to scrape Indeed job data using python. Here we will use Beautiful Soup and the request module to scrape the data. Module neededbs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This module does not come built-in with Py
3 min read