Use CSS selectors to find elements
Problem
You want to find or manipulate elements using CSS selectors.
Solution
Use the Element.select(
and Elements.select(
methods:
File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "https://example.com/");
Elements links = doc.select("a[href]"); // a with href
Elements pngs = doc.select("img[src$=.png]");
// img with src ending .png
Element masthead = doc.select("div.masthead").first();
// div with class=masthead
Elements resultDivs = doc.select("h3.r > div");
// direct div after h3
Elements resultAs = resultDivs.select("a");
// A elements within resultDivs
Description
jsoup elements support a CSS selector syntax to find matching elements, that allows very powerful and robust queries.
The select
method is available in a Document
, Element
, or in Elements
. It is contextual, so you can filter by selecting from a specific element, or by chaining select calls.
Select returns a list of Elements (as Elements
), which provides a range of methods to extract and manipulate the results.
See the Selector
API reference for the full supported list and more details.
You can experiment with different CSS selectors on Try jsoup.
jsoup's Element.select(
method functions similarly to the Javascript DOM method querySelectorAll(
, and Element.selectFirst(
is equivalent to querySelector(
.
Selector overview
tagname
: find elements by tag, e.g.div
#id
: find elements by ID, e.g.#logo
.class
: find elements by class name, e.g..masthead
[attribute]
: elements with attribute, e.g.[href]
[^attrPrefix]
: elements with an attribute name prefix, e.g.[^data-]
finds elements with HTML5 dataset attributes[attr=value]
: elements with attribute value, e.g.[width=500]
(also quotable, like[data-name='launch sequence']
)[attr^=value]
,[attr$=value]
,[attr*=value]
: elements with attributes that start with, end with, or contain the value, e.g.[href*=/path/]
[attr~=regex]
: elements with attribute values that match the regular expression; e.g.img[src~=(
?i)\.(png|jpe?g)] *
: all elements, e.g.*
[*]
selects elements that have any attribute. e.g.p[*]
finds paragraphs with at least one attribute, andp:not(
finds those with no attributes.[*]) ns|tag
: find elements by tag in a namespace prefix, e.g.dc|name
finds<dc:name>
elements*|tag
: find elements by tag in any namespace prefix, e.g.*|name
finds<dc:name>
and<name>
elements:empty
: selects elements that have no children (ignoring blank text nodes, comments, etc.); e.g.li:empty
Selector combinations
el#id
: elements with ID, e.g.div#logo
el.class
: elements with class, e.g.div.masthead
el[attr]
: elements with attribute, e.g.a[href]
- Any combination, e.g.
a[href].highlight
ancestor child
: child elements that descend from ancestor, e.g..body p
findsp
elements anywhere under a block with class "body"parent > child
: child elements that descend directly from parent, e.g.div.content > p
findsp
elements; andbody > *
finds the direct children of the body tagsiblingA + siblingB
: finds sibling B element immediately preceded by sibling A, e.g.div.head + div
siblingA ~ siblingX
: finds sibling X element preceded by sibling A, e.g.h1 ~ p
el, el, el
: group multiple selectors, find unique elements that match any of the selectors; e.g.div.masthead, div.logo
Pseudo selectors
:has(
: find elements that contain elements matching the selector; e.g.selector) div:has(
p) :is(
: find elements that match any of the selectors in the selector list; e.g.selector) :is(
finds any heading elementh1, h2, h3, h4, h5, h6) :not(
: find elements that do not match the selector; e.g.selector) div:not(
.logo) :lt(
: find elements whose sibling index (i.e. its position in the DOM tree relative to its parent) is less thann) n
; e.g.td:lt(
3) :gt(
: find elements whose sibling index is greater thann) n
; e.g.div p:gt(
2) :eq(
: find elements whose sibling index is equal ton) n
; e.g.form input:eq(
1) - Note that the above indexed pseudo-selectors are 0-based, that is, the first element is at index 0, the second at 1, etc
Text content pseudo selectors
:contains(
: find elements that contain (directly or via children) the given normalized text. The search is case-insensitive; e.g.text) div:contains(
jsoup) :containsOwn(
: find elements whose own text directly contains the given text. e.g.text) p:containsOwn(
jsoup) :containsData(
: selects elements that contain the specified data (e.g. withintext) <script>
,<style>
, or comments); e.g.script:containsData(
jsoup) :containsWholeText(
: selects elements that contain the exact, non-normalized whole text (case sensitive, preserving whitespace/newlines); e.g.text) p:containsWholeText(
jsoup The Java HTML Parser) :containsWholeOwnText(
: selects elements whose own text exactly matches the given non-normalized text (case sensitive); e.g.text) p:containsWholeOwnText(
jsoup The Java HTML Parser) :matches(
: find elements whose text matches the specified regular expression; e.g.regex) div:matches(
(?i)login) :matchesOwn(
: find elements whose own text matches the specified regular expressionregex) :matchesWholeText(
: selects elements whose entire, non-normalized text matches the specified regex; e.g.regex) div:matchesWholeText(
\d{3}-\d{2}-\d{4}) :matchesWholeOwnText(
: selects elements whose own non-normalized text matches the regex; e.g.regex) span:matchesWholeOwnText(
\w+)
Structural pseudo selectors
:root
: selects the root element of the document (in HTML, the<html>
element); e.g.:root
:nth-child(
: selects elements with an+b–1 preceding siblings; supports expressions likean+b) 2n+1
for odd elements; e.g.tr:nth-child(
2n+1) :nth-last-child(
: selects elements with an+b–1 following siblings; e.g.an+b) tr:nth-last-child(
-n+2) :nth-of-type(
: selects elements based on their position among siblings of the same type; e.g.an+b) img:nth-of-type(
2n+1) :nth-last-of-type(
: selects elements based on their position among siblings of the same type, counting from the end; e.g.an+b) img:nth-last-of-type(
2n+1) :first-child
: selects elements that are the first child of their parent; e.g.div > p:first-child
:last-child
: selects elements that are the last child of their parent; e.g.ol > li:last-child
:first-of-type
: selects the first element of its type among its siblings; e.g.dl dt:first-of-type
:last-of-type
: selects the last element of its type among its siblings; e.g.tr > td:last-of-type
:only-child
: selects elements that are the only child of their parent; e.g.div:only-child
:only-of-type
: selects elements that are the only element of their type among their siblings; e.g.span:only-of-type
Cookbook
Introduction
Input
- Parse a document from a String
- Parsing a body fragment
- Load a Document from a URL
- Load a Document from a File
- Parse large documents efficiently with StreamParser
Extracting data
- Use DOM methods to navigate a document
- Use CSS selectors to find elements
- Use XPath selectors to find elements and nodes
- Extract attributes, text, and HTML from elements
- Working with relative and absolute URLs
- Example program: list links