On Supporting Structure-Agnostic Queries for XML

Won Kim, Cyber Database Solutions, Austin, Texas, USA
Wol Young Lee and Hwan Seung Yong, Department of Computer Science and Engineering, Ewha Women’s University, Seoul, Korea

COLUMN

PDF Version

Abstract

XML has rapidly emerged as the standard for the interchange of data in numerous application areas. Many software vendors now offer XML servers to store and retrieve XML documents. To allow for content-based queries, query languages have also been designed. The query languages require the users to know the structure of the XML documents and specify search conditions on the structure. This navigational access to XML documents is a natural consequence of the hierarchical structure of XML. However, it is also desirable to allow the users to formulate structure-agnostic queries against XML documents, to complement the current navigation-based queries. We have developed a query processor to evaluate structure-agnostic XML queries. In this article, we motivate the need for structure-agnostic, that is, non-navigational, queries against XML, and offer it as a worthy subject of further research and development.

1 INTRODUCTION

XML [XML00] provides simple yet flexible ways to represent the structure and contents of arbitrary documents. An XML document is constructed based on a few simple rules. The core feature of an XML document is the “element”. An element has a start tag (element name), data value, and an end tag. Further, an element can have attributes, can have children elements, and can be empty. The fact that an element can have children elements is of course what allows it to express hierarchical structures of arbitrary documents. The simplicity and flexibility of XML have made it possible for XML to be adopted as the basis of data interchange standards in a wide variety of application areas, including electronic business (ebXML, X12, Rosetta Net), financial services (FIXML, FPML), molecular biology (bioML, SBML, CellML, RNAML), chemistry (CML), multimedia (SMIL, SVG, MPEG-7), scientific research (RIXML), metadata management (XMI), web services (WSDL, UDDI), data mining (PMML), etc.

Many XML servers are currently on the market. Some store XML documents in XML-optimized file systems, while others use relational databases (by “shredding” XML documents into rows and columns in tables). XML database servers on the market include Oracle XML DB [Oracle]. DB2 XML Extender [IBM], SQL server 2000 [Hhowlett and Jennings], BEA Systems, Actuate/Nimble, Poet, BlueStone, eXcelon, IPEDO, XAware, etc. It is of course necessary for the users to be able to issue “content-based” queries against XML documents, that is, queries in which search conditions are specified as predicates (e.g., director = “Luc Besson”, to retrieve movies directed by “Luc Besson” from a database of “movie” XML documents).

There have been many proposals for a query language for semi-structured data in general, and XML in particular. These include Lorel [Abiteboul, et al], XML-QL [Deutsch et al]. X-Path [Xpath], XSL [XSL], YATL, Quilt, and XQuery [XQuery]. These query languages require the users to know the structure of XML documents, including all the element and attribute names, data types of the data values, and the hierarchical structure of the elements. Since a search condition needs to involve an element/attribute name, which appears somewhere in the hierarchical structure of an XML document, these query languages force the users to specify the navigational access paths to the elements/attributes that are involved in search conditions. Insofar as XML gives rise to hierarchical structure, and navigational access to selected parts of an XML document is a natural access pattern, the navigational content-based query languages clearly make sense. However, we believe that structure-agnostic, that is, non-navigational, content-based query languages for XML, and general semi-structured data, are also desirable. In other words, we believe that both navigational and non-navigational query languages are necessary to support all query needs against XML databases.

2 RATIONALE

A brief review of the history of generational shifts in database systems can provide a helpful insight to understand the necessity of both types of query facility for XML. The transition from hierarchical and network database systems to relational database systems took place in the 1980s. The hierarchical and network models of data, in which a record contains physical pointers to other records, made sense for applications that modeled data that naturally forms hierarchical structure, such as the bill of materials. Applications had to access the records by navigation, going from one record to the next records by following the pointers embedded in the first record. However, the use of physical pointers was problematic, since the pointers had to be changed in all the records if the records to which they pointed changed their physical locations. (Further, if the data structures used to store the database changed, all the application logic that made explicit use of the knowledge of the data structures had to be changed.) The relational model of data explicitly excludes the use of pointers, both logical and physical, and instead has a record in one table to be “linked” dynamically to a record in another table if the two records have a common value in some of the user-specified fields. This dynamic “linking” of records in two or more tables is called join. The relational model of data offers a high degree of “data independence” and makes “ad hoc content-based” queries possible. “ad hoc content-based” queries mean queries in which the users only specify the search conditions on selected columns (and output columns), but not how the data is to be located and accessed (e.g., navigational paths) or how the query is to be processed (e.g., using specific access methods such as indexes or hashing or sorting). The onus of accessing the data and processing the queries falls entirely on the database system. To support ad hoc queries, relational database systems had to develop the query optimizer and query processor modules. The query optimizer computes the estimated costs of all feasible plans of evaluating a given query, examining different orderings of tables when joining them and all the access methods at its disposal. The least-cost plan (which typically consists of a sequence of query-processing steps) selected by the query optimzier is passed to the query processor, which then executes each step of the plan. However, relational database systems make modeling and querying of data that is naturally hierarchical in structure awkward and inefficient. The object-relational model of data, which combines the relational model of data and an object-oriented model of data, reintroduced the pointer concept of hierarchical and network database systems in the form of object identifiers. An object identifier is a logical pointer, not a physical pointer, and makes it possible to naturally model and access hierarchical structures of data. Object-relational database systems represent a combination of the best of both the hierarchical database systems and relational database systems by allowing both “ad hoc” content-based queries and navigational queries.

XML has come about completely outside the database systems domain. The hierarchical structure of elements in XML certainly called for query languages that would support navigational access to different parts of XML documents at different levels of hierarchy. However, just as the users of database systems have discovered over the past 30 years, we believe that users of XML databases will benefit significantly from structure-agnostic, that is, non-navigational, content-based queries. Simply put, a good segment of XML users would want to specify in their queries only the search conditions (and output elements), without having to know and specify the detailed hierarchical structure of the XML documents or have to give hints for efficiently processing the queries.

3 ILLUSTRATIONS

To make the concept concrete, suppose that we have XML documents on movies, represented as in Figure 1a. If we are to search for movie titles whose genre is ‘action’, release year is ‘1994’, and whose stars include ‘Jean Reno’, we would like to be able to state the search conditions simply as

genre = “action” and year = “1994” and actor = “Jean Reno”

regardless of the structure of the XML documents. Then a non-navigational content-based query (in Xquery) that includes the above search conditions would look something like

for $t in doc()//title

“action” and year = “1994” and actor = “Jean Reno”

return $t

The same query, when expressed in navigational style, would look like

for $m in doc ()//movie

where $m/genre[text() = “action”] and $m/year[text() = “1994”] and $m/actor

[text() = “Jean Reno”]

return <title>$m/title/text()</title>

Figure 1a. an XML document about a movie (using elements)

Of course, to be able to formulate non-navigational content-based queries against XML documents, the users would need to know at least the names of the elements/attributes and data types for the data values. Just as was the case with relational database systems, the onus of computing all possible paths along the XML documents’ hierarchical structure would fall on the query optimizer and query processor of the XML server.

There are a few reasons that we believe that it is imperative that XML users be able to formulate non-navigational content-based queries, besides the obvious observation that it can be tedious to have to specify the navigational paths in a query. First, the flexibility of XML allows users to specify the same contents in a number of different ways, and as such it becomes problematic for the users to have to content with all the disparate ways in which contents may be represented in XML. At the risk of belaboring the point, let us consider Figures 1b and 1c. These figures show how the same simple contents, represented in Figure 1a, can be represented in different ways, using element hierarchies, and attributes. (There are various additional ways.) The same structure-agnostic query shown previously would work for each of these alternate representations. The navigational query for Figure 1b would look like

for $m in doc ()//movie

where $m//genre[text() = “action”] and $m//year[text() = “1994”] and

$m//actor [text() = “Jean Reno”]

return <title>$m//title/text()</title>

(* We note that the above query would also work as navigational query for Figure 1a. *)

Figure 1b. The same XML document about a movie (using an element hierarchy)

For Figure 1c, the navigational query would look like

for $m in doc ()//movie
where $m/*[@genre = “action”] and $m/*[@year = “1994”] and

$m/*[actor = “Jean Reno”]

return <title>$m/*/@title</title>

Figure 1c. the same XML document about a movie (using attributes)

Further, the same contents may be organized in a number of different hierarchical structures. Figures 2a and 2b represent the XML document of Figures 1 using hierarchical structures. Again the same structure-agnostic query shown previously would work for each of these alternate representations. However, the navigational query for Figure 2a would look like

for $y in doc ()//year

where $y[text() = “1994”] // genre[text() = “action”] // actor[text()= “Jean Reno”]

return $y//title/text()</title>

Figure 2a. the same XML document about a movie (using an element hierarchy)

For Figure 2b, the navigational query would look like

for $g in doc ()//genre

where $g[@* = “action”] // year[*= “1994”] // actor[text() = “Jean Reno”]

return <title>$g//title/text()</title>

Figure 2b. The same XML document about a movie (using an element hierarchy and attributes)

4 CONCLUDING REMARKS

In this article, we discussed why structure-agnostic or non-navigational content-based query facilities will complement the current navigation-based query languages for XML. To make the discussion concrete, we showed some examples of how many different ways in which the same contents may be represented in XML. It is obvious that once we consider “similar” contents, rather than the exact same contents, the problem becomes compounded. To retrieve “similar” XML documents that satisfy certain search conditions, one needs to consider non-identical XML documents that nonetheless contain some common elements and attributes, in a wide variety of hierarchical structures. The problem becomes even more difficult in a distributed, multiple-database environment, including the Web. Although XML is being touted as “the” standard for data interchange, it will be used as the basis of data interchange standards for a wide variety of different application areas, each with its own vocabulary and data semantics. In other words, there will not be a single universal XML data interchange standard. As such, even if different segments of the XML users adopt particular data interchange standards, the need to integrate XML documents across multiple segments will drive the need for non-navational content-based query facilities.

We have designed and implemented a non-navigational content-based query language and a query processor for the language. We will report details of our implementation and experiments in a forthcoming paper. Meanwhile, we believe that non-navigational query support for XML, in particular, query optimization and query processing, is an important area for further research, and would like to encourage the XML data management research community to follow up on our research.

REFERENCES

[XSL] S. Adler, A. Berglund, J. Caruso, S. Deach, T. Graham, P. Grosso, E. Gutentag, A. Milowski, S. Parnell, J. Richman, and S. Zilles: Extensible Stylesheet Language (XSL) Version 1.0, W3C Proposed Recommendation Aug. 2001

[Abit1997] S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. L. Wiener, “The Lorel Query Language for Semistructured Data”, International Journal on Digital Libraries, Apr. 1997

[Oracle] S. Banerjee, Oracle XML DB, Oracle Corporation Technical White Paper. release 9.2, Jan. 2002

[XQuery] S. Boag, D. Chamberlin, M. F. Fernandez, D. Florescu, J. Robie, J. Siméon, XQuery 1.0: An XML Query Language, W3C Working Draft 16 Aug. 2002

[Deut1998] A. Deutsch, M. Fernandez, D. Florescu, A. Levy, and D. Suciu, XML-QL: A query language for XML, Submitted to the W3C 19 Aug. 1998

[Howl2002] S. Howlett and D. Jennings: “SQL Server 2000 and XML: Developing XML-enabled data solutions for the web”, MSDN magazine, Jan. 2002

[IBM] IBM Corporation, DB2 XML Extender, IBM Corporation, 2000

[XML] W3C Consortium, XML1.0 (Second Edition), W3C Recommendation 06 Oct. 2000, available at http://www.w3.org/TR/REC-xml

[XPath] W3C Consortium, XML Path Language (XPath) Version 1.0, W3C Recommendation 16 Nov. 1999

About the authors

		Won Kim is President and CEO of Cyber Database Solutions (http://www.cyberdb.com) and MaxScan (www.maxscan.com) in Austin, Texas, USA. He is also Dean of Ewha Institute of Science and Technology, Ewha Women's University, Seoul. Korea. He is Editor-in-Chief of ACM Transactions on Internet Technology (http://www.acm.org/toit), and Chair of ACM Special Interest Group on Knowledge Discovery and Data Mining (http://www.acm.org/sigkdd). He is the recipient of the ACM 2001 Distinguished Service Award.
		Wol-Young Lee is a Ph.D. candidate in the Department of Computer Science and Engineering at Ewha Women’s University in Seoul, Korea. Her research interests include programming languages, XML, and database systems). Her email is wylee at ewha.ac.kr.
		Hwan-Seung Yong is associate professor with the Department of Computer Science and Engineering at Ewha Women’s University in Seoul, Korea. His research interests include multimedia database systems, data mining and bioinformatics, Internet computing. He received a Ph.D. degree in computer engineering, Seoul National University. His email is hsyong at ewha.ac.kr.

Cite this column as follows: Won Kim, Wol Young Lee, Hwan Seung Yong: “On Supporting Structure-Agnostic Queries for XML”, in Journal of Object Technology, vol. 3, no. 7, July-August 2004, pp. 27-35. http://www.jot.fm/issues/issue_2004_07/column3

Previous column

Next column