Adding Luster to Librarianship: XML as an Enabling Technology

MLGSCA/NCNMLG Joint Meeting

2002-01-31
Scottsdale,AZ
Dick R. Miller

[Slide 1 Title]

Thank you Kay [Deeney] … Good morning!

I appreciate this opportunity to share my understanding of XML, the eXtensible Mark-up Language, and hope that some of my enthusiasm for its potential will rub off on you.

[Slide 2 Redwood City]

I hail from Redwood City, California, where the "Climate is Best by Government Test"-- at least, according to the sign. However, as you can see from this recent photo, it can be a little drearier in winter, than here in Scottsdale. Before I get started, there's someone I need especially to acknowledge-- a certain newly elected MLA board member.

[Slide 3 Jerry]

Ever since Jerry persuaded me to do this presentation, I've found ways to avoid working on it.These are just a few examples, from over the past several months, of activities that I convinced myself were conveniently more urgent than working on this presentation. Somehow, Jerry neglected to mention all the fringe benefits of speaking here today.

[Slide 4 Writer's Block]

As if these weren't enough, not too long ago, I got "Writer's block", literally!I thought this cubical book entitled "Writer's block" was a nifty, physical example of how the lack of context or mark-up can lead to inaccurate communication and/or information retrieval.

[Slide 5 Exploration]

Today, I will explore four areas, hoping that you may:

  1. Learn some of the key reasons why XML is of strategic importance to libraries, especially regarding our Web presence.
  2. Better understand XML's main features, and how they underpin this.
  3. Appreciate of the more difficult, yet critical, role of a schema (or template) in managing XML documents. To illustrate this, I will provide an overview of work-in-progress at Lane in using XML to define bibliographic and authority data, in other words, recasting MARC data into XML.
  4. And, lastly suggest how XML can help polish the librarian's and library's image. I believe libraries can meld technological advantages afforded by XML with our unique traditional strengths to create and seize future opportunities, lest we become a lackluster profession in an increasingly Web-oriented world.

[Slide 6 Extra, Extra]

Have you noticed that X is showing up everywhere?

There is a lot of hype associated with the introduction of any new technology. It almost seems that X is today's mantra.I wonder if Roentgen's discovery of X-Rays in 1895 was a seminal event in X's rise to its extreme prominence nowadays?

[Slide 7 Mac OS X]

There is the shimmering blue X of Mac OS X [ten].

[Slide 8 Xbox]

A video game system that "blurs the lines between fantasy and reality."

[Slide 9 X-Files]

A veteran television series we would be advised to keep undercover.

[Slide 10 X-Prize]

And even an X-prize for potential space travel.

[Slide 11 Fractal Extreme]

I thought Fractal Extreme was a beautiful example of the "extreme" genre.

[Slide 12 Alt-X]

Then, there's Alt-X, "where the digerati meet the literati."

[Slide 13 Alt-X Press]

Their ebook Press page would make a great segue to E, but let's not go there.

[Slide 14 X in Sky]

I was just a bit surprised the other day when I walked out on the patio and looked up to see this X in the sky.

[Slide 15 XML Hype]

Not immune to hype, XML is showing up nearly as profusely as X itself. However, I am convinced that when you strip the hype away, XML has a powerful, elegant simplicity that will be of long-term value to libraries (and to the Web in general). Those of you who know me, realize that I am not prone to endorse fads.

[Slide 16 Exasperation]

In 1999, I characterized my frustrations in coping with proliferating digital resources.These included:

  • Klugey intersystem linkages
  • A multitude of incompatible interfaces
  • Users' known reluctance to search multiple systemss
  • Limited integrated library system interface flexibility
  • And, confusion in bibliographic control. Should we concentrate on making HTML lists, adding records in our Web catalogs, or both?

I believe that "library information", in MARC and in proprietary ILS formats, has been segregated too long from mainstream Web resources. Having a Web catalog isn't good enough any more, as users are likely to search other, more comprehensive resources first, and reluctantly turn to separate catalogs. How does XML address such limitations?

[Slide 17 Expansive]

XML is essentially a standard for creating standards, and thus a universal format for data and document exchange.It has been called the lingua franca of the Information Age. It offers the power, precision, control, and flexibility that should appeal to librarians at the gut level. [Next]

  • It is a key component of the open source movement that Scott [Garrison] has discussed, as it is not proprietary; anyone can use it to devise their own data structures.This has often been cited as the basis for cost savings for those who agree on a common structure.
  • Extensibility simple means that XML structures are relatively easy to expand. Libraries and vendors could add their own specifics to a core standard defined by the profession. This also makes it easier for standards to evolve over time, which they must in order to remain relevant.
  • Perhaps most valuable is the separation of content from display. This lets us divide and conquer. XML data can be readily re-used for many purposes, with just a little planning for the modularity. For example, reference folks may want to format a bibliography differently than done in the catalog. We should be able to display serials and books differently, unlike in current ILS systems where one size must fit all.
  • Linking techniques go beyond simple hyperlinking. They're part of the overall design of XML, and allow you to link to a specific location in a document.
  • Platform neutrality means that XML is the same whether you're using a Mac, PC, or Unix, eliminating a whole slew of technical problems.
  • XML's fixed character set is Unicode, which provides a single, comprehensive character set for all languages.This and platform neutrality underlie the promise of data longevity (or future-proofing) as hardware, software, and network protocols continue to change.This is also key to internationalization, allowing non-Roman characters to be treated just like ascii ones.You do, however, need to have fonts to support the display.
  • Lastly, XML can be used as a database interface, using a traditional database system in conjunction with XML input and output.

[Slide 18 PubMed XML]

NLM is the preeminent example of this approach, providing an XML server for the millions of PubMed records. [Next] Using this pull down menu, you can select XML and display the record in XML, [Next] now NLM's principal format.NLM has also made MeSH available in XML, and the PubMed Central repository is using XML for markup of fulltext articles.

[Slide 19 Google]

To underscore how XML's content markup has strategic potential, consider that even the best Web search engines are hampered by HTML's focus on appearance. A search in Google does retrieve Richard W. Price, but with a lot of false drops. If XML were more broadly deployed, it would be possible to distinguish between prices in dollars and people named price. We've lowered a lot of standards for Web convenience, and libraries should strive toward more sophisticated retrieval, [Next] as the simple example of surnames first in library records illustrates.

[Slide 20 Seaboard 1]

Another example is that of Seaboard Air Line. When Lindbergh flew the Atlantic in 1927, airline stocks rose dramatically in price, including Seaboard's-- [Next] despite the fact that Seaboard Air Line was a railroad company.

[Slide 21 Seaboard 2]

Markup indicating the category an organization belongs to could help eliminate such confusion.Much later, in 1946, an airfreight company, Seaboard & Western Airlines, was founded. It changed names to Seaboard World Airlines in 1961.Organization of information to account for time factors and name changes would also help improve search precision.Is it really acceptable to not be able to limit a search by date?

[Slide 22 Beware of Dog]

Names and information can be misleading in many ways, on the Web or not. Does this look too staged? Libraries, at least, have a reputation for trust.By combining selected new Web technologies, such as XML, with our good reputation, we may be able to contribute to solving such problems, and thus be thought of as even more reliable information resources.

[Slide 23 XML Basics]

Let's review the key aspects of XML.(This provides inspiration, when I look up from my desk, when coping with flawed information storage and retrieval systems.)

[Slide 24 Extensible Syntax]

XML is simply a syntax for marking up documents or records to delineate different kinds of information, unlike HTML which mostly marks up what a document should look like. [Next] Each XML document always begins with a line, or declaration, that says which version of XML, and which encoding standard have been used.Other than that, there are three main building blocks. [Next]

  • Elements identify what a particular chunk of data has been named, for example 'title', 'color', 'price', whatever you choose to identify. You might think of these as akin to fields of a record.This is also why XML has been called self-describing.
  • Attributes allow you to describe properties of an element, information about a particular element, discretely from the element itself, for example that a MeSH heading is 'primary' or 'secondary'. One of the challenges of XML is deciding what should be an element versus what should be an attribute.
  • Entities permit components of a document to be named and stored separately.They must be predefined and can be spotted easily in markup as they begin with an ampersand [&]. However, this means that ampersands in a document must be disguised.XML has predefined entity references for this purpose. In addition, they serve as a shorthand method of referencing the same information in different places, and function like a placeholder, so that non-XML data, such as images, can be incorporated into an XML document.
  • There are also predefined Comments tags [<!--], which are useful.
  • Remember, XML differs from HTML in marking up content, not appearance, and allowing you to define the tags.

[Slide 25 Hierarchy]

XML's power comes from the simplicity of its object-oriented, hierarchical structure.Each element in this hierarchy can function as an object. Thus, you can reference a Topic, without worry that it consists of two other objects.By applying basic concepts, you can build easily understood, yet complex structures. This simplified diagram shows how related elements are grouped in a hierarchy, or an inverted tree, with one root, branches, and leaves representing actual data values. XML requires a "root" element and that all subordinate elements nest properly under it.

Now, let's look at an example to see how basic XML looks.

[Slide 26 Example 1]

Here I've chosen a root element called BibliographicRecord. Note that an element name cannot contain spaces; sometimes an underscore is used to represent a space. Pairs of start and stop tags delineate each element.The angle bracket introduces the element.

  • We then have three attributes, 'control', 'created', and 'updated', which represent properties of our record.The value of each appears in quotation marks. That these are on separate lines is for clarity and not required. The closing angle bracket signifies the end of our start tag for BibliographicRecord.
  • Next we begin a title element with a single attribute, 'level', and its value of 'primary'.Note that attributes are always embedded in the start tag for the element.
  • Then we have the value of the title element.
  • And its stop tag with the distinctive preceding slash. The start and stop tags are on separate lines simply for clarity.Note that Title cannot be uppercase here and then lowercase here. Case matters.
  • Next, we have a 'person' element with one attribute, Role. This is known as a container element because it contains, or nests, other elements.
  • These are 'surname', 'forename', and 'dates.' In this case, note that the attribute value 'Author' may apply to all the elements within the container, while the 'Exact' attribute with value 'Yes' applies to the 'Dates' element alone.
  • And here's the closing tag for our 'Person' element.

[Slide 27 Example 2]

In the same manner, we have a topic, which is MeSH and primary.

  • It contains two other elements, descriptor and qualifier.
  • And another topic, which is MeSH and secondary, and its subelements.
  • Similarly, I've chosen to include a local 'Type' element to record that this is a chapter.
  • And finally, we conclude with the closing tag for the root element.

That's basically it. It is clear enough that people can read it, and precise enough that computers can read it. Documents following this syntax are said to be well-formed.But, what makes XML really powerful is the suite of tools, which can manipulate the marked up content.

[Slide 28 Extended Family]

  • XML, [Next] recommended by the World Wide Web consortium in 1998, describes our content markup in as much detail as we choose, but doesn't actually do anything else.
  • Remember I mentioned separation of content and display? XSL provides the mechanism for formatting, filtering, sorting, etc. for display purposes. The same document can be displayed in as many different ways as different stylesheets are written, and as precisely as the XML document itself is defined.
  • Since Web browsers don't fully support XML yet, it is often converted on the fly to HTML for display using the XML Transformation Language. XSLT can also be used when you want to convert your "old" XML documents to a new XML structure.
  • Last year, XLink joined the ranks to provide a standard for hyperlinking documents.Its sophistication supports a single link to reference multiple related documents. (Another tool, XPointer, permits linking into specific parts of a document.)
  • XML namespaces prevent ambiguities in unrelated documents, which contain otherwise duplicate element names.This would allow you to keep an element named 'Price' for library materials and another named 'Price' for library supplies discrete, when combining materials and supplies in one document.
  • XHTML is a version of HTML that conforms to XML. It handles display, but is more rigorous.
  • XQuery is a language for searching groups of XML documents.

These tools complicate the picture, but illustrate how separately addressing the different features allow each to be optimized. You don't need to know all of these to take advantage of XML.

[Slide 29 Flexibility]

One of the best features of XML is its flexibility. [Next]Unfortunately, this is a double-edged sword. [Next] It becomes increasingly difficult to sharewhen each of us defines the same information differently.For XML to achieve its potential in any particular field, the stakeholders in that field need to agree upon standards.

[Slide 30 Exertion]

Such standards for XML are known as [Next] document models, DTDs (document type definitions), and schemas, which add data typing, indicating for example, that a number is a price rather than a quantity. Essentially, these define a hierarchy of elements, including allowable elements, attributes, and whether they are optional, repeatable, etc. They are used to validate documents, which claim to adhere to the standard.Such compliant documents are said to be valid. Standards have been developed for biosequence data, mathematics, music, and there's even VoiceXML.

This is easier said than done.XML itself is fairly simple, as we've seen. Analyzing data and reaching agreement are the difficult part, where knowing the content is as important as knowing XML.As part of the Medlane Project at Lane, we are developing a schema for bibliographic and authority data.We plan to offer this as a framework for discussion, enhancement, and refinement in hopes that libraries will recognize the value of a Web-oriented format, and adopt such a schema. We anticipate that there would be designated elements to accommodate both vendor-specific and library-specific extensions to an agreed core of standard elements. While many issues are yet to be explored, I hope that the following sneak preview will help illustrate the scope and complexity of such data-- sometimes underestimated by those outside the library community. In contrast, the current standard, MARC, has become excessively complex in its 40 years of evolution. Its shortcomings are increasingly evident as web-delivery of information burgeons; it is difficult to integrate MARC records with other Web-oriented documents.

[Slide 31 Record]

I do not intend to diminish MARC's stature. Indeed, the years of debate and honing are crucial to the review that we're undertaking.In order to narrow our focus, let's ignore sets of records. For any given RECORD, we have identified 10 Principal Elements, which appear to accommodate all bibliographic and authority data. A significant problem is deciding how to handle each kind of data.For example, is the academic course, Biochemistry 101, a WORK or an EVENT? Basically, WORK represents bibliographic titles and the other nine represent types of authority data. Each RECORD would contain one of the Principal Elements.(This is the schema, or rules, for constructing a record that I'm referring to.) In addition, each RECORD would have a CONTROL element, and up to 10 RELATIONS elements, which parallel each of the Principal Elements.I will briefly review each of these 12 to illustrate their scope.

[Slide 32 Control]

The CONTROL element would contain all elements that relate to the RECORD itself, rather than to the content the RECORD represents. This would include IDs and which organization created it; the date a record is created, maintained, etc.; what kind of record it is, perhaps Original or Derived from a resource file; what language the record is in; etc. The date of publication, for example, unlike in MARC fixed fields, would not go here as it refers to the WORK, not the RECORD.

[Slide 33 Work]

The WORK element is likely what you think of as the result of cataloging. It is more narrowly defined here. It encompasses books, serials, and collections.It includes text, audio, video, software, maps, etc.-- approximately the MARC formats. The emphasis, however, is on the title of the WORK. A formal, or umbrella title, including edition and date serves as an anchor for the WORK.Other variant titles would be represented separately. We are also considering a Versions element, so that one WORK element could clearly delineate more than one version. This would address the problem of print and digital serials, which are identical in many regards and don't seem to merit separate records. Holdings, and usage or licensing restrictions would be associated with particular versions. These, and physical items, could be defined separately via XML entities, which I mentioned earlier.

[Slide 34 Nom]

The other nine Principal Elements are roughly equivalent to Authority records. Each supports synonyms or variants to handle equivalency.

NOM, as in nom de plume, is broadly interpreted to include names of real and fictional people, and even named animals. It excludes scientific names. The emphasis is on the name itself, referring to a specific being. Mickey, for example, could refer to a well-known mouse, a famous baseball player, or to our Pomeranian-- who thinks he's a person.Similar to Versions for WORK, pseudonyms may be designated under a single NOM, yet referenced individually to identify which works are associated which each pseudonym. Unlike MARC, forename and surname would be separate elements; I was very pleased that Medline has just introduced this distinction.

[Slide 35 Org]

ORG refers to named organizations, governmental jurisdictions, etc. It excludes EVENTs, which are treated separately.The distinction between a government and the geographic area that it governs is problematic in that ORG encompasses the government, but another of our Principal Elements, LOC, covers geography.If they are defined separately, is the resulting partial redundancy feasible?

[Slide 36 Event]

EVENT expands upon MARC's meeting name to also include all sorts of named happenings, which are usually treated as subjects.

Each of the Principal Elements has its own set of issues. The most obvious one with EVENT relates to meetings of organizations.We are considering proposing a qualifier element, which would permit the 'value' of one of the Principal Elements to modify another. For example, a Symposium of the Medical Library Association, could be marked up as an EVENT, with the EVENT name 'Symposium' qualified by the ORG name, 'Medical Library Association'. A separate ORG for the Medical Library Association alone, as sponsor, would be in order.The intent is to define a crisp listing of EVENTs, which often have varying and thus confusing names.

[Slide 37 Topic]

TOPIC includes the usual topical or conceptual headings and subheadings. There are many issues here as well. To keep things as simple as possible without losing information, a person as subject would not go here. An XML attribute of 'subject' could be defined for NOM, ORG, EVENT, and potentially other Principal Elements, to indicate that they have precedence over TOPIC. Some medical libraries, including NLM and Lane, have eliminated use of form, language, and geographic subheadings internally, preferring to rely on coordinate indexing for retrieval.In looking at LC subjects, the qualifier element I mentioned before may help resolve such issues.For example, History is clearly a topic, but one of limited value in a library with large historical collections.
Rules could be developed to accommodate qualifying a TOPIC by location (LOC) and time (CHRON, which is coming up).

[Slide 38 Type]

History implies time and place.It does not imply form and genre, or TYPE. Separating TYPE for the name of categories is extremely useful in cataloging.In retrieval, it is important to distinguish whether something is an X-ray versus is about X-rays.Lane is using this concept extensively to indicate, for example in an authority record, that Stanford School of Medicine is a Medical School, and in a bibliographic record, that Dorland's is a Dictionary. The distinction between TYPE and TOPIC is simple, yet powerful, and deserves wider application.

[Slide 39 Lang]

LANG covers languages, including artificial ones, but excludes computer languages, which are considered WORKs. Incidentally, access to languages should not be obscured by cryptic codes.

[Slide 40 Chron]

CHRON governs dates and time.MARC has multiple formats for dates in various places. Lane's approach has been to define the same element only once, and then reference this in other parts of the schema. Thus, the date of publication, dates associated with a personal name, and date a record was created, would all share an identical structure. Paired CHRON values would indicate ranges.Attributes would indicate which kind of date (copyright, publication, birth, etc.) is involved. The resultant hierarchy of chronological access points would provide improved retrieval in this neglected area. Our definition of CHRON follows the ISO standard, so that an XML Style Sheet can display dates and times consistently.

[Slide 41 Loc]

Only three more to go!LOC encompasses geographic locations and topographical features. It may also be the appropriate place for buildings or structures as they are in fixed positions.We still have the issue of government versus geographic location to resolve! Addresses fit here, but merit their own schema.Others may have already produced an adequate one, which could be incorporated by entity reference.

[Slide 42 Lex]

LEX is a concept that I've been exploring for some time. Its premise is that keyword searching should be supported more effectively, and encompasses both words and phrases. Linked clusters of variants could provide a significant improvement in this popular type of search. Lane selectively records variants, such as British spellings and slang, and cross-references between words that may be difficult to recall at a busy reference desk.

[Slide 43 Lex example]

In this example, bloodletting, with and without the hyphen, would be a useful automatic retrieval, while a searcher may appreciate being apprised of the other related words.Bleeding could reference hemorrhage and so forth, Roget-style.

  • Plain keyword searching can be misleading. In this cluster the American noun retrieves only 60% of the hits in PubMed; the American adjective adds another 20%; and the British spellings 20% more.Do users realize exactly what they are retrieving?

[Slide 44 Relations 1]

Lastly, the RELATIONS element.Our current thinking is that any Principal Element may be related to any other Principal Element and to other instances of itself as well. A WORK to WORK relation would be a bibliographic one, such as one serial title continuing another; a good example of a TOPIC to TOPIC one, or topical relation, is one MeSH term being broader than another.Then, there are the RELATIONS between the different Principal Elements. The most familiar of these are traditional access points, for example, a particular MeSH heading being a Topic of a particular WORK.

[Slide 45 Relations 2]

Here are our 10 Principal Elements. [Next]Each would have a parallel RELATIONS element.Details are subject to change as this is very much a work in progress.

[Slide 46 Expansion]

This slide represents all the potential relations that could occur in this schema.Some of these may seem unnecessary; however, more are useful than we may immediately recognize. [more]

  • Typically, we have a person as the author of a work.
  • Publication date relates the work to a time.
  • The publisher represents an organizational relation to a work.
  • And, we assign various topics, associating them with a work.
  • However, consider eponyms:Alzheimer Disease includes an implicit nominal relation from Dr. Alzheimer to the topic.
  • An acronym could identify a lexical relation to a topic.
  • Recording that someone is a Spanish speaker represents a linguistic relation to the person-- of value when seeking a translator for a patient.Potentially, this could provide a mechanism to restrict retrieval to works in languages that a user can read.

Marking up records to emphasize relations consistently could enhance future information retrieval in unforeseen ways.Lane has begun recording such relationships.To keep them from becoming chaotic, we plan to extend authority control to a 'type of relation' element.

Our nascent schema appears to form a neat package, with an orderly web of the various fundamental kinds of information. It has the potential to open the bibliographic apparatus to all concerned, rather than just catalogers, but could also reinvigorate cataloging.It is incomplete, and needs further testing and scrutiny in the context of related library information.

[Slide 47 Context]

This slide depicts our schema in the center and sketches a few relationships between potential schemas in other areas. Addresses is a good example of an element that ideally would only be defined once and shared by other schemas via entity references. A vendor schema might incorporate our bibliographic ORG element, adding elements needed for Acquisitions, such as discount, shipping options, etc.  [Incidentally, vendors have been experimenting with XML for electronic communications or EDI.] Likewise, library users' names could follow the same syntax of our bibliographic NOM.For ILL, why don't we add those difficult to verify transactions not in Medline to our collective bibliographic trove? All library information is connected, and should be integrated.The beauty and efficiency of XML is that we only need learn a single syntax, and define a particular element only once.

[Slide 48 XMLMARC]

Lane is planning to develop a more user-friendly interface for our XMLMARC software and will include our new schema as the default for converting MARC to XML to encourage experimentation.We believe the work up front in carefully defining the schema will pay off repeatedly in the future.

[Slide 49 Side-by-Side]

This screenshot shows MARC on the left and the resultant, automatically created XML on the right.Our software is only one of many tools available for working with XML. Many are free, and they are getting easier to use all the time.

[Slide 50 Libraries' Advantage]

Libraries are in a unique position to take advantage of XML. This stems from a tradition of purveying information selected with quality in mind, but not limited by source or producer and dovetails with our users’ known reluctance to search multiple interfaces. Today’s digital fragmentation, especially in a domain such as medicine, can best be addressed at the library level, especially as libraries generally lack a profit motive. While much work remains, libraries are inherently well-positioned to integrate disparate resources. XML and open source software can serve as enabling technologies that allow us to participate more effectively in defining our future. The recent dot.com excesses provide us a unique opportunity to catch-up technologically in a more reasoned fashion.

[Slide 51 Excellence]

Libraries can adopt the unifying technical infrastructure of XML far more easily than our users can be convinced by would-be competitors, that they too share our high standards, and values of impartiality, trust, confidentiality, thoroughness, and a service orientation. Sharing an infrastructure is the first step toward a distributed, integrated international resource, the sum of which would be far more valuable than its parts.

[Slide 52 Excitement Next]

Non-librarians are using XML and the Web to develop resources that would serve our users better if they were integrated with other library resources rather than being standalone resources.Thus, our prospects are excellent in these exciting times.

  • If you already haven't, experiment with XML.
  • Try your hand at writing a style-sheet to manipulate the appearance of an XML document.
  • Consider using XML on your website.
  • Participate in the development or review of the many needed schemas for library information.
  • And when reliable schemas have been demonstrated, support their rapid adoption as a replacement for MARC.
  • Don't be reticent; after all, you don't have to be a spring chicken to take a chance on XML.

[Slide 53 Stump]

It really can be invigorating.And, if you are more uncomfortable out on a limb than Shirley [Maclaine], [Next] you can always move the tree, so you're not so far out. (By the way, that's one of Jerry's stumps!) So, we of the Medlane Project ask--

[Slide 54 Got XML?]

Got XML?

[Slide 55 Exceptional]

I would like to acknowledge these folks who have been influential in my career.

[Slide 56 Extraordinary]

And, these too, without whom I would not have been here today.

[Slide 57 Strategic]

All you need is a little elbow grease to bring out the shine!

[Slide 58 Extensible]

If you can't find what you need on our website, try the XML4Lib discussion list, or email me.Thank you.