Update: From MARC to XML Database
by Kevin S. Clarke (ksclarke At stanford Dot edu)
[SLIDE 2] Most library system vendors now offer Web-based versions of their public access catalogs, but do patrons, who prefer to search integrated online resources, choose the OPAC or the Internet? To investigate better integrating library and non-library generated data, Lane Medical Library converted our 200,000+ bibliographic and authority records into XML, the eXtensbile Markup Language. [For those of you who don't know] XML is a content markup meta-language designed to store and display documents on the World Wide Web. By separating content from presentation, XML enables librarians to create library information that can be more easily integrated with other Web resources.
Last year, Lane released XMLMARC, software to convert MARC records into XML. This year, we investigated several storage and retrieval options for our XML records. Among the options we considered: a relational database, an object-oriented database, a native XML database, and two XML enabled free-text search engines. The goal of our investigation is to match the complex data structure of our bibliographic records to the most appropriate data storage model. At the same time, our storage solution needs to be able to provide quick access to large stores of bibliographic information. Today, I will present our experiences with each of these databases and discuss a possible future for the storage of library data.
Data Models for Storage and Retrieval
[SLIDE 3] Whether XML should be used for storage or just as a transport format is open to debate. Many believe its hierarchical structure is ill suited for storage while others, encouraged by its potential for modeling complex relationships, are developing new databases to store XML natively. Some of the databases we investigated over the past year treat XML as a storage format; others treat it as a transfer format. To better understand each approach, I will start with a few theoretical models for storing data. After that, I will detail our experiences with Oracle, Tamino, the XML Query Engine, Ozone, and Inktomi's search software.
[SLIDE 4] There are a variety of ways to describe databases that store XML. The XML:DB, a group formed to develop an advanced programming interface for XML databases, describes XML databases by their fundamental design. Using this scheme, there are three types of databases: Native XML Databases, XML Enabled Databases, and Hybrid XML Databases.
Native XML Databases have the storage of XML as their fundamental purpose. All data access is through XML and its related standards: XPath, SAX, DOM, and XQL. Tamino (a hierarchical database) and dbXML (a semi-structured database) are examples of Native XML Databases.
XML Enabled Databases have a mapping layer that manages the storage and retrieval of XML records. When using an XML Enabled Database, retrieved data is not guaranteed to have originated in XML form. XEDBs' may allow for the manipulation of data through XML specific technologies or through traditional data manipulation standards like SEQUEL. Examples of XML Enabled Databases include: the relational databases Oracle and SQLServer.
The last type of XML database, as described by the XML:DB group, is the Hybrid XML Database. Hybrid databases are databases that can be treated as Native XML databases or as XML Enabled Databases depending on the requirements of the application. Ozone, an open source object-oriented database, is an example of this kind. It can either store XML natively in its document object model form or can be used to store objects that are then mapped to XML output.
[SLIDE 5] Another way of describing databases is to describe the main trends in general data storage. These trends include: hierarchical data modeling, relational data modeling, object-oriented data modeling, and flat-file storage. Some of you may be surprised by the idea that you can store data in a file system. File systems, after all, store documents. Keep in mind, however, that, with the development of XML, the distinction between data and documents has blurred. Still, effectively storing XML data in a file system requires an XML indexer and, in the best-case scenario, uses a journaling file system like ReiserFS.
[SLIDE 6] At this stage in XML's development, most XML data is stored in individual documents in a traditional file system. There are reasons for this, of course. The first is that working with flat files is the easiest way to get started with XML. There are no databases to maintain, no servers to set up, just documents with XML data. If an XML indexer is used, the speed of querying the data is faster than it would be otherwise, but small numbers of XML files can be queried one by one if necessary. Most people learning XML start by creating individual documents and only consider moving to a traditional database when their collection grows beyond their ability to manage, or when they discover a need to frequently update their documents.
[SLIDE 7] If you choose to use a database, you might select a hierarchical database. Hierarchical databases contain a strictly defined tree of data nodes. These data nodes can contain data or other nodes. This may sound familiar to those of you already working with XML, with the exception that you probably think of 'nodes' as 'elements'. Unlike a hierarchical database, however, an XML database needs to be able to search across an unknown structure of elements.
[SLIDE 8] An advantage to using a hierarchical database or, in the case of Tamino, a native XML database built from a hierarchical database, is that getting data in and out does not require the deconstruction and reconstruction of the XML structure. Simply supply the database with a DTD or XML Schema and start inputting your records. You do not need to create tables or normalize your data like you would with a traditional relational database. The structure provided by the DTD or XML Schema is the structure the database will use.
[SLIDE 9] Another database that models XML's structure well is the object-oriented database. In fact, DOM, the document object model, which many XML developers use to process their XML, is a native object model that can be stored directly in many object-oriented databases. When DOMs are stored in an object-oriented database, PDOMs, or persistent DOMs, are created. PDOMs can be manipulated in the same way that DOMs can. To search a PDOM stored in Ozone, for example, one would perform an XPath query on the XML object known, in Ozone’s language, as an XML Container.
[SLIDE 10] Compared with hierarchical or object-oriented databases, relational databases are exponentially more complex to use. Despite this, they are the most commonly used database because of the incredible power that dissecting, or normalizing, your data can give you. In its simplest form, a relational database consists of collections of tables containing a fixed number of columns and an infinite number of rows. Each row has a primary key, a numeric identifier that guarantees its uniqueness. Often tables contain foreign keys that correspond to primary keys in other tables. In this slide you will see a very simple representation of data stored in a relational database.
[SLIDE 11] Lane Library looked at each of these types of storage and retrieval. To evaluate flat-files, we tested the XML Query Engine and Inktomi’s search software. To map our data into a relational database, we used Oracle (with and without iFS, Oracle’s Internet File System). We used Ozone to store and retrieve our XML in its native DOM form. And, to evaluate a hierarchical database, we used Tamino. We had mixed results with each of these approaches, but made enough progress, I believe, to decide on a method of storing and retrieving our XML.
[SLIDE 12] The XML Query Engine was the first flat file XML indexer that we evaluated. One disadvantage to using XQE is that, at this stage in the product’s development, some programming is required. Another possible disadvantage is that XQE’s index is stored entirely in memory. As a result, it is limited to 32 thousand documents and will not persist if the engine is shutdown. There are advantages, though, to maintaining the index in memory. First, XQE is incredibly fast. Also, since the XML Query Engine works with native XML, by default, we did not experience any difficulty getting it to index our XML.
[SLIDE 13] During the course of the year, we also experimented with Inktomi’s search engine. Like the XML Query Engine, indexing and retrieving our XML files with Inktomi’s product was easy. Though Inktomi maintains its index on disk, we found it very fast when performing simple queries of our XML documents. One disadvantage to using Inktomi’s search software is that the program only pairs XML elements to query fields. As a result, there is no support for complex XQL querying like with the XML Query Engine. However, unlike the XML Query Engine, Inktomi’s search software can be used to index and search unstructured text in conjunction with files marked up in XML.
[SLIDE 14] Relational databases are often the first choice for storing XML data because they are the first choice for storing most kinds of formally defined data. For the Medlane project, we investigated natively storing our XML in Oracle CLOBs. We also investigated storing our XML in Oracle tables by breaking it down into manageable data chunks. To accomplish the complex mapping between XML elements and relational database tables we used the Internet File System, an object-table mapping tool supplied by Oracle.
[SLIDE 15] Of the two methods, we were the most successful with the native XML approach. By storing our XML in CLOBs, we are able to use Intermedia, Oracle’s XML indexer, to provide element and attribute level access to our bibliographic records. Disadvantages to using CLOBs to store our bibliographic data include having to parse through a document each time we want to retrieve it and having to lock the whole document when we want to edit it.
[SLIDE 16] Our experience with Oracle’s iFS was not as positive. To map our XML into relational database tables, iFS required us to create a descriptive text file for every XML element. Our DTD for authority records, which is considerably less complex than our DTD for bibliographic records, required over 130 of these iFS data description files. As a result, we did not attempt to map our bibliographic DTD using iFS.
[SLIDE 17] After experimenting with Oracle, we built an Ozone version of our database. Ozone stores DOMs natively so we did not need to deconstruct our records like we did with Oracle. Working with Ozone was much easier, however the number of objects created when storing our bibliographic records in the database was discouraging. Five thousand of our bibliographic records created over a million objects in the database. Since the working set of objects in an object-oriented database should fit in memory, we decided that Ozone is not suited for the Medlane project.
[SLIDE 18] Tamino was one of the last databases that we evaluated. As a native XML database, Tamino had no difficulty loading records based on our DTD. We were able to supply XQL queries in the form of a URL and access any document in the database through the same method. Unfortunately, because Tamino would not run on 64 bit Solaris, we were only able to evaluate it on a student PC running NT. As a result, we are not able to get a feel for how it might perform in a real world situation. In addition, Tamino seems to lack a timeout function so invalid XQL searches often hang the system.
[SLIDE 19] What will Lane use for our future investigations? We are encouraged by the ease of working with native XML databases. The idea that databases should conform to the data rather than making the data conform to the database is one that has great appeal when working with XML. XML, by its nature, is flexible so why should we use a database that isn’t? We plan to continue to store our XML records in Oracle CLOBs and watch for the development of new native XML databases. One, in particular, that looks promising is dbXML. DbXML is a semi-structured native XML database that integrates many XML technologies into its data store, including XLinks. DbXML also appeals to us at the Medlane Project because it is open source software.
[SLIDE 20] So where to now? The Medlane Project will continue to investigate how XML can be used to benefit libraries. In particular we need a visual query construction interface that utilizes any supplied DTD or XML Schema. We are also interested in creating structured indexes (built with XML) to simplify the process of browsing a library collection. Perhaps most importantly, however, we are considering writing an XML editor specifically designed for library staff. Currently, most XML editors require that the user know XML.
[SLIDE 21] If you are interested in learning more about our project, in learning more about how XML can be used by librarians in general, or in converting your own MARC records into XML, please visit our website. This presentation will also be posted there once I return.
Thank you. Questions?