Processing XML and JSON in Python
Zdeněk Žabokrtský, Rudolf Rosa
Institute of Formal and Applied Linguistics
Charles University, Prague
NPFL092 Technology for Natural Language Processing
Zdeněk Žabokrtský, Rudolf Rosa (ÚFAL) XML & JSON in Python Techno4NLP 1 / 18
XML in Python
the two standard approaches for XML processing are supported in the
standard library:
I [Link].* – a standard DOM API
I [Link].* – a standard SAX API
but there’s a more pythonic API: [Link] (ET for
short)
I supports both DOM-like (i.e. all-in-memory) and SAX-like (i.e.
event-based, streaming) processing
Credit: The following slides are based on an ElementTree intro by Eli
Bendersky.
Zdeněk Žabokrtský, Rudolf Rosa (ÚFAL) XML & JSON in Python Techno4NLP 2 / 18
ET: loading an XML doc
import [Link] as ET
tree = [Link](file=’[Link]’)
Zdeněk Žabokrtský, Rudolf Rosa (ÚFAL) XML & JSON in Python Techno4NLP 3 / 18
ET: traversing the tree
root = [Link]()
for child in root:
print([Link], [Link], [Link])
for descendant in [Link]():
....
Zdeněk Žabokrtský, Rudolf Rosa (ÚFAL) XML & JSON in Python Techno4NLP 4 / 18
ET: simple searching
for elem in [Link](tag=’surname’):
....
Zdeněk Žabokrtský, Rudolf Rosa (ÚFAL) XML & JSON in Python Techno4NLP 5 / 18
ET: complex searching using XPath
for elem in [Link](’*/section/figure[@id="f15"]’):
....
Zdeněk Žabokrtský, Rudolf Rosa (ÚFAL) XML & JSON in Python Techno4NLP 6 / 18
ET: creating+storing an XML doc
root = [Link](’root)
new elem = [Link](root, ’data’)
[Link](root)
import sys
[Link]([Link])
Zdeněk Žabokrtský, Rudolf Rosa (ÚFAL) XML & JSON in Python Techno4NLP 7 / 18
JSON
JavaScript Object Notation
a simple text-oriented format for data exchange between a browser
and a server
inspired by JavaScript object literal syntax, but nowadays used well
beyond the JavaScript world
became one of the most popular data exchange formats in the last
years
Zdeněk Žabokrtský, Rudolf Rosa (ÚFAL) XML & JSON in Python Techno4NLP 8 / 18
XML vs. JSON – a first glimpse
<?xml version="1.0"?> {
<book id="123"> "id": 123,
<title>Object Thinking</title> "title": "Object Thinking",
<author>David West</author> "author": "David West",
<published> "published": {
<by>Microsoft Press</by> "by": "Microsoft Press",
<year>2004</year> "year": 2004
</published> }
</book> }
Zdeněk Žabokrtský, Rudolf Rosa (ÚFAL) XML & JSON in Python Techno4NLP 9 / 18
JSON – a quick syntax tour
data – hierarchical structures
curly braces hold objects
I name and value separated by colon
I name-value pairs separated by comma
square brackets hold arrays
I values separated by comma
whitespaces (space, tab, LF, CR) around syntactic elements ignored
BOM not allowed
no syntax for comments
Zdeněk Žabokrtský, Rudolf Rosa (ÚFAL) XML & JSON in Python Techno4NLP 10 / 18
JSON – data types
number
string
boolean
array
object
null
Zdeněk Žabokrtský, Rudolf Rosa (ÚFAL) XML & JSON in Python Techno4NLP 11 / 18
JSON in Python
json – JSON API in available the standard library
API similar to that of pickle
Zdeněk Žabokrtský, Rudolf Rosa (ÚFAL) XML & JSON in Python Techno4NLP 12 / 18
json: Implicit type conversions
A JSON object goes to Python dict
a JSON array goes to Python list
a JSON string goes to Python unicode
a JSON number goes to Python int or long
a JSON true goes to Python True
etc.
and vice versa.
Zdeněk Žabokrtský, Rudolf Rosa (ÚFAL) XML & JSON in Python Techno4NLP 13 / 18
json: serializing/deserializing
import json
named entity = {"form":"Bob", "type":"firstname", span:[0,1,2]}
serialized = [Link](named entity)
restored = [Link](serialized)
Zdeněk Žabokrtský, Rudolf Rosa (ÚFAL) XML & JSON in Python Techno4NLP 14 / 18
json: selected serialization options
There’s some space for customizing the serialization (within the limits
given by the JSON spec):
encoding – the character encoding (utf-8 by default)
indent – pretty-printing with the specified indent level for object
members
sort keys – output of dictionaries sorted lexicographically by key
separator – tuple (item sep, key sep)
Zdeněk Žabokrtský, Rudolf Rosa (ÚFAL) XML & JSON in Python Techno4NLP 15 / 18
XML vs. JSON – similarities
both XML and JSON are frequently used for data interchange
both formats are human readable (if designed properly)
both are currently supported by many programming languages
Zdeněk Žabokrtský, Rudolf Rosa (ÚFAL) XML & JSON in Python Techno4NLP 16 / 18
XML vs. JSON – differences
as usual, we face the trade-off of simplicity against expressiveness
with some over-simplification: JSON is a lightweight cousin of XML
JSON is slightly less verbose and simpler (and faster) to parse. . .
. . . , but currently there’s more functionality associated with the XML
standard: namespaces, referencing, validations schemes, stylesheet
transformations, query languages etc.
so threre’s no clear superiority of one against the other
your final choice should depend on what you really need (and, of
course, on the system context)
Zdeněk Žabokrtský, Rudolf Rosa (ÚFAL) XML & JSON in Python Techno4NLP 17 / 18
XML vs. JSON – can we estimate future from history?
In 1990s, XML was introduced as a considerably simplified
descendant of SGML.
But 20 years later SGML is still everywhere around, incarnated
basically in every web page.
However, does XML have such a killer app now?
Zdeněk Žabokrtský, Rudolf Rosa (ÚFAL) XML & JSON in Python Techno4NLP 18 / 18