XML BEGG SemiStructured Notes
XML BEGG SemiStructured Notes
What semistructured data is. Concepts of the Object Exchange Model (OEM), a model for semistructured data. Basics of Lore, a semistructured DBMS, and its query language, Lorel . Main language elements of XML. Difference between well-formed and valid XML documents. How Document Type Definitions (DTDs) can be used to define valid syntax of an XML document.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
Objectives
Introduction
How Document Object Model (DOM) compares with OEM. About other related XML technologies. Limitations of DTDs and how XML Schema overcomes these limitations. How RDF and RDF Schema provide a foundation for processing metadata. W3C XQuery Language. How to map XML to databases. SQL:2003 support for XML.
In 1998 XML 1.0 was formally ratied by W3C. Yet, has impacted every aspect of programming including graphical interfaces, embedded systems, distributed systems, and database management. Already becoming de facto standard for data communication within software industry, and is quickly replacing EDI systems as primary medium for data interchange among businesses. Some analysts believe it will become language in which most documents are created and stored, both on and off Internet.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
Semistructured Data
Semistructured Data
Data that may be irregular or incomplete and have a structure that may change rapidly or unpredictably. Semistructured data is data that has some structure, but structure may not be rigid, regular, or complete. Generally, data does not conform to xed schema (sometimes use terms schema-less or self-describing).
Information normally associated with schema is contained within data itself. Some forms of semistructured data have no separate schema, in others it exists but only places loose constraints on data. Unfortunately, relational, object-oriented, and object-relational DBMSs do not handle data of this nature particularly well.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
Semistructured Data
Example
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
Example
Example
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
Data in OEM is schema-less and self-describing, and can be thought of as labeled directed graph where nodes are objects, consisting of:
unique object identier (for example, &7), descriptive textual label (street), type (string), a value (22 Deer Rd).
A label indicates what the object represents and is used to identify the object and to convey the meaning of the object, and so should be as informative as possible. Labels can change dynamically. A name is a special label that serves as an alias for a single object and acts as an entry point into the database (for example, DreamHome is a name that denotes object &1).
Valentina Tamma
COMP 302
Valentina Tamma
An OEM object can be considered as a quadruple (label, oid, type, value). For example:
{Staff, &4, set, {&9, &10}} {name, &9, string, Ann Beech} {salary, &10, decimal, 12000}
Lore (Lightweight Object REpository), is a multi-user DBMS, supporting crash recovery, materialized views, bulk loading of les in some standard format (XML is supported), and a declarative update language. Has an external data manager that enables data from external sources to be fetched dynamically and combined with local data during QP.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
Lorel
Lorel
Lorel (the Lore language) is an extension to OQL. Lorel was intended to handle:
queries that return meaningful results even when some data is absent; queries that operate uniformly over single-valued and setvalued data; queries that operate uniformly over data with different types; queries that return heterogeneous objects; queries where the object structure is not fully known.
Supports declarative path expressions for traversing graph structures and automatic coercion for handling heterogeneous and typeless data. A path expression is essentially a sequence of edge labels (L1.L2L n), which for given graph yields set of nodes. For example:
DreamHome.PropertyForRent yields set of nodes {&5, &6}; DreamHome.PropertyForRent.street yields set of nodes containing strings {2 Manor Rd, 18 Dale Rd}.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
Also supports general path expression that provides for arbitrary paths:
| indicates selection; ? indicates zero or one occurrences; + indicates one or more occurrences; * indicates zero or more occurrences.
For example:
DreamHome.(Branch | PropertyForRent).street would match path beginning with DreamHome, followed by either a Branch edge or a PropertyForRent edge, followed by a street edge.
Data in FROM clause contains objects &3 and &4. Applying WHERE restricts this set to object &4. Then apply SELECT clause.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
Answer PropertyForRent &5 street &11 2 Manor Rd type &12 Flat monthlyRent &13 375 OverseenBy &4 PropertyForRent &6 street &14 18 Dale Rd type &15 1 annualRent &16 7200 OverseenBy &4
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
DataGuide
A dynamically generated and maintained structural summary of database, which serves as a dynamic schema. Has three properties:
conciseness: every label path in the database appears exactly once in the DataGuide; accuracy: every label path in DataGuide exists in original database; convenience: a DataGuide is an OEM (or XML) object, so can be stored and accessed using same techniques as for source database.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
DataGuides
DataGuides
Can determine whether a given label path of length n exists in source database by considering at most n objects in the DataGuide. For example, to verify whether path Staff.Oversees.annualRent exists, need only examine outgoing edges of objects &19, &21, and &22 in our DataGuide. Further, only objects that can follow Branch are the two outgoing edges of object &20.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
DataGuides
DataGuides
XML
A meta-language (a language for describing other languages) that enables designers to create their own customized tags to provide functionality not available with HTML. Most documents on Web currently stored and transmitted in HTML. One strength of HTML is its simplicity. Simplicity may also be one of its weaknesses, with users wanting tags to simplify some tasks and make HTML documents more attractive and dynamic.
To satisfy this demand, vendors introduced some browserspecic HTML tags, making it difcult to develop sophisticated, widely viewable Web documents. W3C has produced XML, which could preserve general application independence that makes HTML portable and powerful.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
XML
XML
XML is a restricted version of SGML, designed especially for Web documents. SGML allows document to be logically separated into two: one that denes the structure of the document (DTD), other containing the text itself. By giving documents a separately dened structure, and by giving authors ability to dene custom structures, SGML provides extremely powerful document management system. However, SGML has not been widely adopted due to its inherent complexity.
XML attempts to provide a similar function to SGML, but is less complex and, at same time, network-aware. XML retains key SGML advantages of extensibility, structure, and validation. Since XML is a restricted form of SGML, any fully compliant SGML system will be able to read XML documents (although the opposite is not true). XML is not intended as a replacement for SGML or HTML.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
Advantages of XML
Advantages of XML
Simplicity Open standard and platform/vendor-independent Extensibility Reuse Separation of content and presentation Improved load balancing
Support for integration of data from multiple sources Ability to describe data from a wide variety of applications More advanced search engines New opportunities.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
XML
XML - Elements
Elements, or tags, are most common form of markup. First element must be a root element, which can contain other (sub)elements. XML document must have one root element (<STAFFLIST>. Element begins with start-tag (<STAFF>) and ends with endtag (</STAFF>). XML elements are case sensitive An element can be empty, in which case it can be abbreviated to <EMPTYELEMENT/>. Elements must be properly nested.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
XML - Attributes
Attributes are name-value pairs that contain descriptive information about an element. Attribute is placed inside start-tag after corresponding element name with the attribute value enclosed in quotes.
<STAFF branchNo = B005>
Could also have represented branch as subelement of STAFF. A given attribute may only occur once within a tag, while subelements with same tag may be repeated.
XML declaration: optional at start of XML document. Entity references: serve various purposes, such as shortcuts to often repeated text or to distinguish reserved characters from content. Comments: enclosed in <! and --> tags. CDATA sections: instructs XML processor to ignore markup characters and pass enclosed text directly to application. Processing instructions: can also be used to provide information to application.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
XML Ordering
Semistructured data model described earlier assumes collections are unordered. In XML, elements are ordered. In contrast, in XML attributes are unordered.
Denes the valid syntax of an XML document. Lists element names that can occur in document, which elements can appear in combination with which other ones, how elements can be nested, what attributes are available for each element type, and so on. Term vocabulary sometimes used to refer to the elements used in a particular application. Grammar specied using EBNF, not XML. Although optional, DTD is recommended for document conformity.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
Identify the rules for elements that can occur in the XML document. Options for repetition are:
* indicates zero or more occurrences for an element; + indicates one or more occurrences for an element; ? indicates either zero occurrences or exactly one occurrence for an element.
Name with no qualifying punctuation must occur exactly once. Commas between element names indicate they must occur in succession; if commas omitted, elements can occur in any order.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
10
Identify which elements may have attributes, what attributes they may have, what values attributes may hold, plus optional defaults. Some types: CDATA: character data, containing any text. ID: used to identify individual elements in document (ID is an element name). IDREF/IDREFS: must correspond to value of ID attribute(s) for some element in document. List of names: values that attribute can hold (enumerated type).
ID allows unique key to be associated with an element. IDREF allows an element to refer to another element with the designated key, and attribute type IDREFS allows an element to refer to multiple elements. To loosely model relationship Branch Has Staff:
<!ATTLIST STAFF staffNo ID #REQUIRED> <!ATTLIST BRANCH staff IDREFS #IMPLIED>
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
Two levels of document processing: well-formed and valid. Non-validating processor ensures an XML document is well-formed before passing information on to application. XML document that conforms to structural and notational rules of XML is considered well-formed; e.g.:
document must start with <?xml version 1.0>; all elements must be within one root element; elements must be nested in a tree structure without any overlap;
Validating processor will not only check that an XML document is well-formed but that it also conforms to a DTD, in which case XML document is considered valid.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
11
XML APIs generally fall into two categories: tree-based and event-based. DOM (Document Object Model) is tree-based API that provides object-oriented view of data. API was created by W3C and describes a set of platform- and language-neutral interfaces that can represent any wellformed XML/HTML document. Builds in-memory representation of document and provides classes and methods to allow an application to navigate and process the tree.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
Namespaces
An event-based, serial-access API that uses callbacks to report parsing events to application. For example, there are events for start and end elements. Application handles these events through customized event handlers. Unlike tree-based APIs, event-based APIs do not built an inmemory tree representation of the XML document. API product of collaboration on XML-DEV mailing list, rather than product of W3C.
Allows element names and relationships in XML documents to be qualied to avoid name collisions for elements that have same name but dened in different vocabularies. Allows tags from multiple namespaces to be mixed - essential if data comes from multiple sources. For uniqueness, elements and attributes given globally unique names using URI reference.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
12
Namespaces
<STAFFLIST xmlns=http://www.dreamhome.co.uk/branch5/ xmlns:hq = http://www.dreamhome.co.uk/HQ/> <STAFF branchNo = B005> <STAFFNO>SL21</STAFFNO> <hq:SALARY>30000</hq:SALARY> </STAFF> </STAFFLIST>
In HTML, default styling is built into browsers as tag set for HTML is predened and xed. Cascading Stylesheet Specication (CSS) provides alternative rendering for tags. Can also be used to render XML in a browser but cannot make structural alterations to a document. XSL created to dene how XML data is rendered and to dene how one XML document can be transformed into another document.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
XSLT
A subset of XSL, XSLT is a language in both markup and programming sense, providing a mechanism to transform XML structure into either another XML structure, HTML, or any number of other text-based formats (such as SQL). XSLTs main ability is to change the underlying structures rather than simply the media representations of those structures, as with CSS.
XSLT is important because it provides a mechanism for dynamically changing the view of a document and for ltering data. Also robust enough to encode business rules and it can generate graphics (not just documents) from data. Can even handle communicating with servers (scripting modules can be integrated into XSLT) and can generate the appropriate messages within body of XSLT itself.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
13
XPath
XPath
Declarative query language for XML that provides simple syntax for addressing parts of an XML document. Designed for use with XSLT (for pattern matching) and XPointer (for addressing). With XPath, collections of elements can be retrieved by specifying a directory-like path, with zero or more conditions placed on the path. Uses a compact, string-based syntax, rather than a structural XMLelement based syntax, allowing XPath expressions to be used both in XML attributes and in URIs.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
XPointer
XLink
Provides access to values of attributes or content of elements anywhere within an XML document. Basically an XPath expression occurring within a URI. Among other things, with XPointer can link to sections of text, select particular elements or attributes, and navigate through elements. Can also select data contained within more than one set of nodes, which cannot do with XPath.
Allows elements to be inserted into XML documents to create and describe links between resources. Uses XML syntax to create structures that can describe links similar to simple unidirectional hyperlinks of HTML as well as more sophisticated links. Two types of XLink: simple and extended. Simple link connects a source to a destination resource; an extended link connects any number of resources.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
14
Reformulation of HTML 4.01 in XML 1.0 and is intended to be next generation of HTML. Basically a stricter and cleaner version of HTML; e.g.:
tags and attributes must be in lowercase; all XHTML elements must be have an end-tag; attribute values must be quoted and minimization is not allowed; ID attribute replaces the name attribute; documents must conform to XML rules.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
15
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
WSDL Concepts
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
16
XML Schema
XML Schema is more comprehensive method of dening content model of an XML document. Additional expressiveness will allow Web applications to exchange XML data more robustly without relying on ad hoc validation tools.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
XML Schema
XML schema is the denition (both in terms of its organization and its data types) of a specic XML structure. XML Schema language species how each type of element in schema is dened and the elements data type. Schema is an XML document, and so can be edited and processed by same tools that read the XML it describes.
Elements that do not contain other elements or attributes are of type simpleType.
<xsd:element name=STAFFNO type = xsd:string/> <xsd:element name=DOB type = xsd:date/> <xsd:element name=SALARY type = xsd:decimal/>
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
17
Cardinality
Elements that contain other elements are of type complexType. List of children of complex type are described by sequence element.
<xsd:element name = STAFFLIST> <xsd:complexType> <xsd:sequence> <!-- children dened here --> </xsd:sequence> </xsd:complexType> </xsd:element>
Cardinality of an element can be represented using attributes minOccurs and maxOccurs. To represent an optional element, set minOccurs to 0; to indicate there is no maximum number of occurrences, set maxOccurs to unbounded.
<xsd:element name=DOB type=xsd:date minOccurs = 0/> <xsd:element name=NOK type=xsd:string minOccurs = 0 maxOccurs = 3/>
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
References
Can also dene new data types to create elements and attributes.
<xsd:simpleType name = STAFFNOTYPE> <xsd:restriction base = xsd:string> <xsd:maxLength value = 5/> </xsd:restriction> </xsd:simpleType>
If there are many references to STAFFNO, use of references will place denition in one place and improve the maintainability of the schema.
New type has been dened as a restriction of string (to have maximum length of 5 characters).
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
18
Groups
Constraints
Can dene both groups of elements and groups of attributes. Group is not a data type but acts as a container holding a set of elements or attributes.
<xsd:group name = StaffType>
<xsd:sequence> <xsd:element name=StaffNo type=StaffNoType/> <xsd:element name=Position type=PositionType/> <xsd:element name=DOB type =xsd:date/> <xsd:element name=Salary type=xsd:decimal/> </xsd:sequence> </xsd:group>
XML Schema provides XPath-based features for specifying uniqueness constraints and corresponding reference constraints that will hold within a certain scope.
<xsd:unique name = NAMEDOBUNIQUE> <xsd:selector xpath = STAFF/> <xsd:eld xpath = NAME/LNAME/> <xsd:eld xpath = DOB/> </xsd:unique>
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
Key Constraints
Similar to uniqueness constraint except the value has to be non-null. Also allows the key to be referenced.
<xsd:key name = STAFFNOISKEY> <xsd:selector xpath = STAFF/> <xsd:eld xpath = STAFFNO/> </xsd:key>
Even XML Schema does not provide the support for semantic interoperability required. For example, when two applications exchange information using XML, both agree on use and intended meaning of the document structure. Must rst build a model of the domain of interest, to clarify what kind of data is to be sent from rst application to second. However, as XML Schema just describes a grammar, there are many different ways to encode a specic domain model into an XML Schema, thereby losing the direct connection from the domain model to the Schema.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
19
Problem compounded if third application wishes to exchange information with other two. Not sufcient to map one XML Schema to another, since the task is not to map one grammar to another grammar, but to map objects and relations from one domain of interest to another. Three steps required:
reengineer original domain models from XML Schema; dene mappings between the objects in the domain models; dene translation mechanisms for the XML documents, for example using XSLT.
RDF is infrastructure that enables encoding, exchange, and reuse of structured meta-data. This infrastructure enables meta-data interoperability through design of mechanisms that support common conventions of semantics, syntax, and structure. RDF does not stipulate semantics for each domain of interest, but instead provides ability for these domains to dene meta-data elements as required. RDF uses XML as a common syntax for exchange and processing of meta-data.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
Basic RDF data model consists of three objects: Resource: anything that can have a URI; e.g., a Web page, a number of Web pages, or a part of a Web page, such as an XML element. Property: a specic attribute used to describe a resource; e.g., attribute Author may be used to describe who produced a particular XML document. Statement: consists of combination of a resource, a property, and a value.
Components known as subject, predicate, and object of an RDF statement. Example statement:
Author of http://www.dh.co.uk/staff_list.xml is John White <rdf:RDF xmlns:rdf=http://www.w3.org/1999/02/22-rdf-syntax-ns# xmlns:s=http://www.dh.co.uk/schema/> <rdf:Description about=http://www.dh.co.uk/staff_list.xml> <s:Author>John White</s:Author> </rdf:Description> </rdf:RDF>
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
20
RDF Schema
Species information about classes in a schema including properties (attributes) and relationships between resources (classes). RDF Schema mechanism provides a basic type system for use in RDF models, analogous to XML Schema. Denes resources and properties such as rdfs:Class and rdfs:subClassOf that are used in specifying application-specic schemas. Also provides a facility for specifying a small number of constraints such as cardinality.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
Example XML-QL
Data extraction, transformation, and integration are wellunderstood database issues that rely on a query language. SQL and OQL do not apply directly to XML because of the irregularity of XML data. However, XML data similar to semistructured data. There are many semistructured query languages that can query XML documents, including XML-QL, UnQL, and XQL. All have notion of a path expression for navigating nested structure of XML.
<LNAME> $L
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
21
W3C formed an XML Query Working Group in 1999 to produce a data model for XML documents, set of query operators on this model, and query language based on query operators. Queries operate on single documents or xed collections of documents, and can select entire documents or subtrees of documents that match conditions based on document content/structure. Queries can also construct new documents based on what has been selected.
Ultimately, collections of XML documents will be accessed like databases. Working Group has produced four documents:
XML Query (XQuery) Requirements; XML XQuery 1.0 and XPath 2.0 Data Model; XML XQuery 1.0 and XPath 2.0 Formal Semantics; XQuery 1.0 A Query Language for XML; XML XQuery 1.0 and XPath 2.0 Functions and Operators; XSLT 2.0 and XPath 1.0 Serialization.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
XQuery
Species goals, usage scenarios, and requirements for XQuery Data Model and query language. For example:
language must be declarative and must be dened independently of any protocols with which it is used; queries should be possible whether or not a schema exists; language must support both universal and existential quantiers on collections and it must support aggregation, sorting, nulls, and be able to traverse inter- and intra-document references.
XQuery derived from XML query language called Quilt, which has borrowed features from XPath, XML-QL, SQL, OQL, Lorel, XQL, and YATL. Like OQL, XQuery is a functional language in which a query is represented as an expression. XQuery supports several kinds of expression, which can be nested (supporting notion of a subquery).
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
22
Uses syntax of XPath. In XQuery, result of a path expression is ordered list of nodes, including their descendant nodes, ordered according to their position in original hierarchy, top-down, left-to-right order. Result of path expression may contain duplicate values. Each step in path expression represents movement through document in particular direction, and each step can eliminate nodes by applying one or more predicates.
Result of each step is list of nodes that serves as starting point for next step. Path expression can begin with an expression that identies a specic node, such as function doc(string), which returns root node of named document. Query can also contain path expression beginning with / or //, which represents an implicit root node determined by the environment in which query is executed.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
Four steps:
rst opens staff_list.xml and returns its document node; second uses /STAFFLIST to select STAFFLIST element at top; third locates rst STAFF element that is child of root element; fourth nds STAFFNO elements occurring anywhere within this STAFF element.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
23
Five steps:
rst two as before; third uses /STAFF to select STAFF elements within STAFFLIST element; fourth consists of predicate that restricts STAFF elements to those with branchNo attribute = B005; fth selects LNAME element(s) occurring anywhere within these elements.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
FLWOR (ower) expression is constructed from FOR, LET, WHERE, ORDER BY, RETURN clauses. FLWOR expression starts with one or more FOR or LET clauses in any order, followed by optional WHERE clause, optional ORDER BY clause, and required RETURN clause. FOR and LET clauses serve to bind values to one or more variables using expressions (e.g., path expressions). FOR used for iteration, associating each specied variable with expression that returns list of nodes. FOR clause can be thought of as iterating over nodes returned by its respective expression.
LET clause also binds one or more variables to one or more expressions but without iteration, resulting in single binding for each variable. Optional WHERE clause species one or more conditions to restrict tuples generated by FOR and LET. RETURN clause evaluated once for each tuple in tuple stream and results concatenated to form result. ORDER BY clause, if specified, determines order of the tuple stream which, in turn, determines order in which RETURN clause is evaluated using variable bindings in the respective tuples.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
24
Note, predicate seems to compare an element (SALARY) with a value (15000). In fact, = operator extracts typed value of element resulting in a decimal value in this case, which is then compared with 15000.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
= operator is a general comparison operator. XQuery also defines value comparison operators (eq, ne, lt, le, gt, ge), which are used to compare two atomic values. If either operand is a node, atomization is used to convert it to an atomic value. If we try to compare an atomic value to an expression that returns multiple nodes, then a general comparison operator returns true if any value satisfies predicate; however, value comparison operator would raise an error.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
25
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
List branches with at least one member of staff with salary > 15,000.
<BRANCHESWITHLARGESALARIES> FOR $B IN distinct-values(doc(staff_list.xml)//@branchNo) LET $S := doc(staff_list.xml)//STAFF/[@branchNo = $B] WHERE SOME $sal IN $S/SALARY SATISFIES ($sal > 15000) RETURN <BRANCHNO>{ $B/text() }</BRANCHNO> </ BRANCHESWITHLARGESALARIES >
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
26
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
27
Abstract description of information available in well-formed XML document that meets certain XML namespace constraints. XML Infoset is attempt to define set of terms that other XML specifications can use to refer to the information items in a wellformed (although not necessarily valid) XML document. Does not attempt to define complete set of information, nor does it represent minimal information that an XML processor should return to an application. It also does not mandate a specific interface or class of interfaces (although Infoset presents information as tree).
XML documents information set consists of two or more information items. An information item is an abstract representation of a component of an XML document such as an element, attribute, or processing instruction. Each information item has a set of associated properties. e.g., document information item properties include: [document element]; [children]; [notations]; [unparsed entities]; [base URI], [character encoding scheme], [version], and [standalone].
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
XML Infoset contains no type information. To overcome this, XML Schema specifies an extended form of XML Infoset called Post-Schema Validation Infoset (PSVI). In PSVI, information items representing elements and attributes have type annotations and normalized values that are returned by an XML Schema processor. PSVI contains all information about an XML document that a query processor requires.
Denes the information contained in the input to an XSLT or XQuery Processor. Also denes all permissable values of expressions in XSLT, XQuery, and XPath. Data Model is based on XML Infoset, with following new features:
support for XML Schema types; representation of collections of documents and of simple and complex values.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
28
Decided to make XPath subset of XQuery. XPath spec shows how to represent information in XML Infoset as a tree structure containing seven kinds of nodes (document, element, attribute, text, comment, namespace, or processing instruction), with XPath operators defined in terms of these seven nodes. To retain these operators while using richer type system provided by XML Schema, XQuery extended XPath data model with additional information contained in PSVI.
Data Model is node-labeled, tree-constructor, with notion of node identity to simplify representation of reference values (such as IDREF, XPointer, and URI values). An instance of data model represents one or more complete documents or document parts, each represented by its own tree of nodes. Every value is ordered sequence of zero or more items, where an item can be an atomic value or a node. An atomic value has a type, either one of atomic types defined in XML Schema or restriction of one of these types. When a node is added to a sequence its identity remains same. Thus, a node may occur in more than one sequence and a sequence may contain duplicate items.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
Root node representing XML document is a document node and each element in document is represented by an element node. Attributes represented by attribute nodes and content by text nodes and nested element nodes. Primitive data in document is represented by text nodes, forming the leaves of the node tree. Element node may be connected to attribute nodes and text nodes/nested element nodes. Every node belongs to exactly one tree, and every tree has exactly one root node. Tree whose root node is document node is referred to as a document and a tree whose root node is some other kind of node is referred to as a fragment.
Information about nodes obtained via accessor functions that can operate on any node. Accessor functions are analogous to an information items named properties. These functions are illustrative and intended to serve as concise description of information that must be exposed by Data Model. Data Model also specifies a number of constructor functions whose purpose is to illustrate how nodes are constructed.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
29
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
30
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
goal is to complement XPath/XQuery spec, by defining meaning of expressions with mathematical rigor. A rigorous formal semantics clarifies intended meaning of the English specification, ensures that no corner cases are left out, and provides reference for implementation. Provides implementors with a processing model and a complete description of the languages static and dynamic semantics.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
31
Parsing, ensures input expression is instance of language defined by the grammar rules and then builds an internal parse tree. Normalization, converts expression into an XQuery Core expression. Static type analysis (optional), checks whether each (core) expression is type safe and, if so, determines its static type. If expression is not type-safe, type error is raised; otherwise, parse tree built with each subexpression annotated with its static type. Dynamic evaluation, computes value of the expression from parse tree. May result in a dynamic error, either a type error (if static type analysis has done) or a non-type error.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
Takes full XQuery expression and transforms it into an equivalent expression in the core XQuery. Written as follows: [Expr]Expr == CoreExpr States that Expr is normalized to CoreExpr (Expr subscript indicates an expression; other values possible; e.g. Axis).
FLWOR expression covered by two sets of rules; first splits expression at clause level then applies further normalization to each clause:
[(ForClause | LetClause | WhereClause | OrderByClause) FLWORExpr]Expr == [(ForClause | LetClause | WhereClause | OrderByClause)]FLWOR ([FLWORExpr]Expr) [(ForClause | LetClause | WhereClause | OrderByClause) RETURN Expr]Expr == [(ForClause | LetClause | WhereClause | OrderByClause)]FLWOR ([Expr]Expr)
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
32
Second set applies to FOR and LET clauses and transforms each into series of nested clauses, each of which binds one variable. For example, for the FOR clause we have:
[FOR varRef1 TypeDec1? PositionalVar1? IN Expr1 , , varRefn TypeDecn ? PositionalVarn? IN Exprn]FLWOR(Expr) == FOR varRef1 TypeDec1 ? PositionalVar1? IN [Expr1 ]Expr RETURN FOR varRefn TypeDecn ? PositionalVarn? IN [Exprn] Expr RETURN Expr
WHERE clause normalized to IF expression that returns an empty sequence if condition is false and normalizes result:
[WHERE Expr1 ]FLWOR(Expr) == IF ([Expr1]Expr) THEN Expr ELSE ( )
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
Normalization - Example
FOR $i IN $I, $j IN $J LET $k := $i + $j WHERE $k > 2 RETURN ($i, $j) FOR $i IN $I RETURN FOR $j in $J RETURN LET $k := $i + $j RETURN IF ($k > 2) THEN RETURN ($i, $j) ELSE ( )
XQuery is strongly typed so types of values and expressions must be compatible with context in which they are used. After normalization static type analysis may optionally be performed. Static type of an expression is defined as most specific type that can be deduced for that expression by examining the query only, independent of the input data. Useful for detecting certain types of error early in development. Also useful for optimizing query execution; e.g. may be able to conclude that result of query is an empty sequence.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
33
Based on set of inference rules used to infer static type of each expression, based on static types of its operands. Bottom-up process, starting at leaves of expression tree containing simple constants and input data whose type can be inferred from schema of input document. Inference rules used to infer static types of more complex expressions at next level of tree until entire tree processed. Type error raised if static type of some expression is inappropriate.
Static typing takes a static environment and an expression and infers a type. Written as:
statEnv |- Expr : Type
States that in environment statEnv, expression Expr has type Type. This is called a typing judgment (a judgment expresses whether a property holds or not). Inference rule written as a collection of premises and a conclusion; for example:
statEnv |- Expr1 :xsd:boolean statEnv |- Expr2:Type2 statEnv |- Expr3:Type3 statEnv |- IF Expr1 THEN Expr2 ELSE Expr3 : (Type2 | Type3 )
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
Dynamic Evaluation
Dynamic Evaluation
All implementations of XQuery must support dynamic typing, which checks during dynamic evaluation that type of a value is compatible with context in which it is used. Type error raised if an incompatibility is detected. Based on judgments, called evaluation judgments: dynEnv |- Expr ! Value States that in dynamic environment dynEnv, the evaluation of expression Expr yields value Value.
Inference rule is written as collection of hypotheses (judgments) and a conclusion, written respectively above and below a dividing line. Consider logical expressions:
dynEnv |- Expri ! false 1<= i <= 2 dynEnv |- Expr1 AND Expr2 ! false dynEnv |- Expri ! RAISES Error 1<= i <= 2 dynEnv |- Expr1 AND Expr2 ! RAISES Error
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
34
Dynamic Evaluation
If left-hand expression evaluated first it will raise an error (divide by zero) and overall expression will raise an error (no need to evaluate the right-hand expression). Conversely, if right-hand expression evaluated first, overall expression will evaluate to false (no need to evaluate the left-hand expression).
Need to handle XML that: may be strongly typed governed by XML Schema; may be strongly typed governed by another schema language, such as a DTD or RELEX-NG; may be governed by multiple schemas or one schema may be subject to frequent change; may be schema-less; may contain marked-up text with logical units of text (such as sentences) that span multiple elements; has structure, ordering, and whitespace that may be significant; may be subject to update as well as queries based on context and relevancy.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
In past the XML would have been stored in an attribute whose data type was CLOB. More recently, some systems have a new native XML data type (e.g. XML or XMLType). Raw XML stored in serialized form, which makes it efficient to insert documents into database and retrieve them in their original form. Relatively easy to apply full-text indexing to documents for contextual and relevance retrieval. However, question about performance of general queries and indexing, which may require parsing on-the-fly. Also, updates usually require entire XML document to be replaced with a new document.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
35
Schema-Independent Representation
XML decomposed (shredded) into its constituent elements and data distributed over number of attributes in one or more relations. Storing shredded documents may make it easier to index values of some elements, provided these elements are placed into their own attributes. Also possible to add some additional data relating to hierarchical nature of the XML, making it possible to recompose original structure and ordering, and to allow the XML to be updated. With this approach also have to create an appropriate database structure.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
Schema-Independent Representation
Could use DOM to represent structure of XML data. Since XML is a tree structure, each node may have only one parent. The rootID attribute allows a query on a particular node to be linked back to its document node. While this is schema independent, recursive nature of structure can cause performance problems when searching for specific paths. To overcome this, create denormalized index containing combinations of path expressions and a link to node and parent node.
Standard does not define any rules for the inverse process; i.e., shredding XML data into an SQL form, with some minor exceptions.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
36
SQL/XML Operators
XMLELEMENT, to generate an XML value with a single element as a child of its root item. Element can have attributes specified via XMLATTRIBUTES subclause. XMLFOREST, to generate an XML value with a list of elements as children of a root item. XMLCONCAT, to concatenate a list of XML values. XMLPARSE, to perform a non-validating parse of a character string to produce an XML value. XMLROOT, to create an XML value by modifying the properties of the root item of another XML value. XMLCOMMENT, to generate an XML comment. XMLPI, to generate an XML processing instruction.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
SQL/XML Functions
XMLSERIALIZE, to generate a character or binary string from an XML value; XMLAGG, an aggregate function, to generate a forest of elements from a collection of elements.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
37
SQL/XML also defines mapping from tables to XML documents. Mapping may take as its source an individual table, all tables in a schema, or all tables in a catalog. Standard does not specify syntax for the mapping; instead it is provided for use by applications and as a reference for other standards. Mapping produces two XML documents: one that contains mapped table data and other that contains an XML Schema describing the first.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
Number of issues had to be addressed to map SQL identifiers to XML Names: range of characters that can be used within an SQL identifier larger than range for an XML Name; SQL delimited identifiers (identifiers within double-quotes), permit arbitrary characters to be used at any point in identifier; XML Names that begin with XML are reserved; XML namespaces use : to separate namespace prefix from local component. Resolved using escape notation that changes unacceptable characters in XML Names into sequence of allowable characters based on Unicode values (_xHHHH_).
SQL/XML maps each SQL data type to closest match in XML Schema, in some cases using facets to restrict acceptable XML values to achieve closest match. For example:
SMALLINT mapped to a restriction of xsd:integer with minInclusive and maxInclusive facets set. CHAR mapped to restriction of xsd:string with facet length set. DECIMAL mapped to xsd:decimal with precision and scale set.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
38
Create root element named after table with <row> element for each row. Each row contains a sequence of column elements, each named after corresponding column. Each column element contains a data value. Names of table and column elements are generated using fully escaped mapping from SQL identifiers to XML Names. Must also specify how nulls are to be mapped, using absent (column with null would be omitted) or nil.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
Two types:
text-based, which stores XML as text, e.g. as a file in file system or as a CLOB in an RDBMS; model-based, which stores XML in some internal tree representation, e.g., an Infoset, PSVI, or representation, possibly with tags tokenized.
COMP 302
Valentina Tamma
COMP 302
Valentina Tamma
39