Jsoup Free Ebook
Jsoup Free Ebook
#jsoup
Table of Contents
About 1
Remarks 2
JavaScript support 2
Download 2
Versions 3
Examples 3
Parameters 6
Remarks 6
Examples 6
Examples 8
Examples 11
Chapter 5: Selectors 13
Remarks 13
Examples 14
Credits 20
About
You can share this PDF with anyone you feel could benefit from it, downloaded the latest version
from: jsoup
It is an unofficial and free Jsoup ebook created for educational purposes. All the content is
extracted from Stack Overflow Documentation, which is written by many hardworking individuals at
Stack Overflow. It is neither affiliated with Stack Overflow nor official Jsoup.
The content is released under Creative Commons BY-SA, and the list of contributors to each
chapter are provided in the credits section at the end of this book. Images may be copyright of
their respective owners unless otherwise specified. All trademarks and registered trademarks are
the property of their respective company owners.
Use the content presented in this book at your own risk; it is not guaranteed to be correct nor
accurate, please send your feedback and corrections to [email protected]
https://riptutorial.com/ 1
Chapter 1: Getting started with Jsoup
Remarks
Jsoup is a HTML parsing and data extraction library for Java, focused on flexibility and ease of
use. It can be used to extract sepecific data from HTML pages, which is commonly known as "web
scraping", as well as modify the content of HTML pages, and "clean" untrusted HTML with a
whitelist of allowed tags and attributes.
JavaScript support
Jsoup does not support JavaScript, and, because of this, any dynamically generated content or
content which is added to the page after page load cannot be extracted from the page. If you need
to extract content which is added to the page with JavaScript, there are a few alternative options:
• Use a library which does support JavaScript, such as Selenium, which uses an an actual
web browser to load pages, or HtmlUnit.
• Reverse engineer how the page loads it's data. Typically, web pages which load data
dynamically do so via AJAX, and thus, you can look at the network tab of your browser's
developer tools to see where the data is being loaded from, and then use those URLs in your
own code. See how to scrape AJAX pages for more details.
Download
Jsoup is available on Maven as org.jsoup.jsoup:jsoup, If you're using Gradle (eg. with Android
Studio), you can add it to your project by adding the following to your build.gradle dependencies
section:
compile 'org.jsoup:jsoup:1.8.3'
If you're using Ant (Eclipse), add the following to your POMs dependencies section:
<dependency>
<!-- jsoup HTML parser library @ http://jsoup.org/ -->
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.8.3</version>
https://riptutorial.com/ 2
</dependency>
Versions
1.9.2 2016-05-17
1.8.3 2015-08-02
Examples
Extract the URLs and titles of links
Jsoup can be be used to easily extract all links from a webpage. In this case, we can use Jsoup to
extract only specific links we want, here, ones in a h3 header on a page. We can also get the text
of the links.
http://stackoverflow.com/questions/12920296/past-5-week-calculation-in-webi-bo-4-0
Past 5 week calculation in WEBI (BO 4.0)?
http://stackoverflow.com/questions/36303701/how-to-get-information-about-the-visualized-
elements-in-listview
How to get information about the visualized elements in listview?
[...]
• First, we get the HTML document from the specified URL. This code also sets the User
Agent header of the request to "Mozilla", so that the website serves the page it would usually
serve to browsers.
• Then, use select(...) and a for loop to get all the links to Stack Overflow questions, in this
case links which have the class question-hyperlink.
• Print out the text of each link with .text() and the href of the link with attr("abs:href"). In this
https://riptutorial.com/ 3
case, we use abs: to get the absolute URL, ie. with the domain and protocol included.
Selecting only the attribute value of a link:href will return the relative URL.
String bodyFragment =
"<div><a href=\"/documentation\">Stack Overflow Documentation</a></div>";
System.out.println(link);
Output
/documentation
By passing the base URI into the parse method and using the absUrl method instead of attr, we
can extract the full URL.
System.out.println(link);
Output
http://stackoverflow.com/documentation
Jsoup can be used to manipulate or extract data from a file on local that contains HTML. filePath
is path of a file on disk. ENCODING is desired Charset Name e.g. "Windows-31J". It is optional.
// load file
File inputFile = new File(filePath);
// parse file as HTML document
Document doc = Jsoup.parse(filePath, ENCODING);
// select element by <a>
Elements elements = doc.select("a");
https://riptutorial.com/ 4
jsoup
https://riptutorial.com/ 5
Chapter 2: Formatting HTML Output
Parameters
Parameter Detail
Document.OutputSettings
outline(boolean) Enable or disable HTML outline mode.
Remarks
Jsoup 1.9.2 API
Examples
Display all elements as block
By default, Jsoup will display only block-level elements with a trailing line break. Inline elements
are displayed without a line break.
<select name="menu">
<option value="foo">foo</option>
<option value="bar">bar</option>
</select>
System.out.println(doc.html());
Results in:
<html>
<head></head>
<body>
<select name="menu"> <option value="foo">foo</option> <option value="bar">bar</option>
</select>
</body>
</html>
To display the output with each element treated as a block element, the outline option has to be
https://riptutorial.com/ 6
enabled on the document's OutputSettings.
doc.outputSettings().outline(true);
System.out.println(doc.html());
Output
<html>
<head></head>
<body>
<select name="menu">
<option value="foo">foo</option>
<option value="bar">bar</option>
</select>
</body>
</html>
https://riptutorial.com/ 7
Chapter 3: Logging into websites with Jsoup
Examples
A simple authentication POST request with Jsoup
A simple POST request with authentication data is demonstrated below, note that the username and
password field will vary depending on the website:
Most websites require a much more complicated process than the one demonstrated above.
Below is an example request that will log you into the GitHub website
https://riptutorial.com/ 8
input[type=\"hidden\"]:nth-child(2)")
.first()
.attr("value");
System.out.println(homePage.parse().html());
In this example, we will log into the GitHub website by using the FormElement class.
// # Go to login page
Connection.Response loginFormResponse = Jsoup.connect(LOGIN_FORM_URL)
.method(Connection.Method.GET)
.userAgent(USER_AGENT)
.execute();
https://riptutorial.com/ 9
.execute();
System.out.println(loginActionResponse.parse().html());
All the form data is handled by the FormElement class for us (even the form method detection). A
ready made Connection is built when invoking the FormElement#submit method. All we have to do
is to complete this connection with addional headers (cookies, user-agent etc) and execute it.
https://riptutorial.com/ 10
Chapter 4: Parsing Javascript Generated
Pages
Examples
Parsing JavaScript Generated Page with Jsoup and HtmUnit
<html>
<head>
<script src="loadData.js"></script>
</head>
<body onLoad="loadData()">
<div class="container">
<table id="data" border="1">
<tr>
<th>col1</th>
<th>col2</th>
</tr>
</table>
</div>
</body>
</html>
loadData.js
Col1 Col2
0.0 0.1
1.0 1.1
https://riptutorial.com/ 11
// load source from file
Document doc = Jsoup.parse(new File("page.html"), "UTF-8");
// print results
System.out.println(col.ownText());
Output
(empty)
What happened?
Jsoup parses the source code as delivered from the server (or in this case loaded from file). It
does not invoke client-side actions such as JavaScript or CSS DOM manipulation. In this example,
the rows and cols are never appended to the data table.
// print results
System.out.println(col.ownText());
// clean up resources
webClient.close();
Output
0.0
0.1
1.0
1.1
https://riptutorial.com/ 12
Chapter 5: Selectors
Remarks
A selector is a chain of simple selectors, separated by combinators. Selectors are case insensitive
(including against elements, attributes, and attribute values).
The universal selector (*) is implicit when no element selector is supplied (i.e. *.header and
.header is equivalent).
* any element *
elements of type E in
ns|E fb|name finds <fb:name> elements
the namespace ns
elements with
#id div#wrap, #logo
attribute ID of "id"
elements with a
.class div.left, .result
class name of "class"
elements with an
[attr] attribute named "attr" a[href], [title]
(with any value)
elements with an
attribute name
starting with
[^attrPrefix] [^data-], div[^data-]
"attrPrefix". Use to
find elements with
HTML5 datasets
elements with an
attribute named
[attr=val] img[width=500], a[rel=nofollow]
"attr", and value
equal to "val"
elements with an
attribute named span[hello="Cleveland"][goodbye="Columbus"],
[attr="val"]
"attr", and value a[rel="nofollow"]
equal to "val"
https://riptutorial.com/ 13
Pattern Matches Example
elements with an
attribute named
[attr^=valPrefix] "attr", and value a[href^=http:]
starting with
"valPrefix"
elements with an
attribute named
[attr$=valSuffix] "attr", and value img[src$=.png]
ending with
"valSuffix"
elements with an
attribute named
[attr*=valContaining] "attr", and value a[href*=/search/]
containing
"valContaining"
elements with an
attribute named
[attr~=regex] "attr", and value img[src~=(?i)\.(png|jpe?g)]
matching the regular
expression
Examples
Selecting elements using CSS selectors
https://riptutorial.com/ 14
// Parse the document
Document doc = Jsoup.parse(html);
// You can also select within elements, e.g. anchors with a href attribute
// within the third paragraph.
Element link = thirdParagraph.select("a[href]");
// or the first <h1> element in the document body
Element headline = doc.select("body").first().select("h1").first();
https://riptutorial.com/ 15
for (String twitterTag : twitterTags) {
// display results
System.out.printf("%s = %s%n", twitterTag, content);
}
Output
twitter:site =
twitter:site:id =
twitter:creator =
twitter:creator:id =
twitter:description = Q&A for professional and enthusiast programmers
twitter:title = Stack Overflow
twitter:image =
twitter:image:alt =
twitter:player =
twitter:player:width =
twitter:player:height =
twitter:player:stream =
twitter:app:name:iphone =
twitter:app:id:iphone =
twitter:app:url:iphone =
twitter:app:name:ipad =
twitter:app:id:ipad =
twitter:app:url:ipadt =
twitter:app:name:googleplay =
twitter:app:id:googleplay =
twitter:app:url:googleplay =
https://riptutorial.com/ 16
Chapter 6: Web crawling with Jsoup
Examples
Extracting email adresses & links to other pages
Jsoup can be used to extract links and email address from a webpage, thus "Web email address
collector bot" First, this code uses a Regular expression to extract the email addresses, and then
uses methods provided by Jsoup to extract the URLs of links on the page.
Pattern p = Pattern.compile("[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+");
Matcher matcher = p.matcher(doc.text());
Set<String> emails = new HashSet<String>();
while (matcher.find()) {
emails.add(matcher.group());
}
System.out.println(emails);
System.out.println(links);
This code could also be easily extended to also recursively visit those URLs and extract data from
linked pages. It could also easily be used with a different regex to extract other data.
In this example, we will try to find JavaScript data which containing backgroundColor:'#FFF'. Then,
we will change value of backgroundColor '#FFF' '#ddd'. This code uses getWholeData() and
setWholeData() methods to manipulate JavaScript data. Alternatively, html() method can be used to
get data of JavaScript.
https://riptutorial.com/ 17
html.append("<!DOCTYPE html> <html> <head> <title>Hello Jsoup!</title>");
html.append("<script>");
html.append("StackExchange.docs.comments.init({");
html.append("highlightColor: '#F4A83D',");
html.append("backgroundColor:'#FFF',");
html.append("});");
html.append("</script>");
html.append("<script>");
html.append("document.write(<style type='text/css'>div,iframe { top: 0; position:absolute;
}</style>');");
html.append("</script>\n");
html.append("</head><body></body> </html>");
Output
<script>StackExchange.docs.comments.init({highlightColor:
'#F4A83D',backgroundColor:'#ddd',});</script>
In this example we will extract all the web links from a website. I am using
http://stackoverflow.com/ for illustration. Here recursion is used, where each obtained link's page
is parsed for presence of an anchor tag and that link is again submitted to the same function.
The condition if(add && this_url.contains(my_site)) will limit results to your domain only.
import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
https://riptutorial.com/ 18
public static void main(String[] args) {
if (links.isEmpty()) {
return;
}
}
}
The program will take much time to execute depending on your website. The above code can be
extended to extract data (like titles of pages or text or images) from particular website. I would
recommend you to go through company's terms of use before scarping it's website.
The example uses JSoup library to get the links, you can also get the links using
your_url/sitemap.xml.
https://riptutorial.com/ 19
Credits
S.
Chapters Contributors
No
Formatting HTML
2 Zack Teater
Output
Logging into
3 Joel Min, JonasCz, Stephan
websites with Jsoup
Parsing Javascript
4 Zack Teater
Generated Pages
https://riptutorial.com/ 20