|
Thank you Kay [Deeney] … Good morning! I appreciate this opportunity to share my understanding of XML, the eXtensible Mark-up Language, and hope that some of my enthusiasm for its potential will rub off on you. |
|
|
I hail from Redwood City, California, where the "Climate is Best by Government Test"-- at least, according to the sign. However, as you can see from this recent photo, it can be a little drearier in winter, than here in Scottsdale. Before I get started, there's someone I need especially to acknowledge-- a certain newly elected MLA board member. |
|
|
Ever since Jerry persuaded me to do this presentation, I've found ways to avoid working on it.These are just a few examples, from over the past several months, of activities that I convinced myself were conveniently more urgent than working on this presentation. Somehow, Jerry neglected to mention all the fringe benefits of speaking here today. |
|
|
As if these weren't enough, not too long ago, I got "Writer's block", literally!I thought this cubical book entitled "Writer's block" was a nifty, physical example of how the lack of context or mark-up can lead to inaccurate communication and/or information retrieval. |
|
|
Today, I will explore four areas, hoping that you may:
|
|
|
Have you noticed that X is showing up everywhere? There is a lot of hype associated with the introduction of any new technology. It almost seems that X is today's mantra.I wonder if Roentgen's discovery of X-Rays in 1895 was a seminal event in X's rise to its extreme prominence nowadays? |
|
|
There is the shimmering blue X of Mac OS X [ten]. | |
|
A video game system that "blurs the lines between fantasy and reality." | |
|
A veteran television series we would be advised to keep undercover. | |
|
And even an X-prize for potential space travel. | |
|
I thought Fractal Extreme was a beautiful example of the "extreme" genre. |
|
|
Then, there's Alt-X, "where the digerati meet the literati." |
|
|
Their ebook Press page would make a great segue to E, but let's not go there. |
|
|
I was just a bit surprised the other day when I walked out on the patio and looked up to see this X in the sky. |
|
|
Not immune to hype, XML is showing up nearly as profusely as X itself. However, I am convinced that when you strip the hype away, XML has a powerful, elegant simplicity that will be of long-term value to libraries (and to the Web in general). Those of you who know me, realize that I am not prone to endorse fads. |
|
|
In 1999, I characterized my frustrations in coping with proliferating digital resources.These included:
I believe that "library information", in MARC and in proprietary ILS formats, has been segregated too long from mainstream Web resources. Having a Web catalog isn't good enough any more, as users are likely to search other, more comprehensive resources first, and reluctantly turn to separate catalogs. How does XML address such limitations? |
|
|
XML is essentially a standard for creating standards, and thus a universal format for data and document exchange.It has been called the lingua franca of the Information Age. It offers the power, precision, control, and flexibility that should appeal to librarians at the gut level. [Next]
|
|
|
NLM is the preeminent example of this approach, providing an XML server for the millions of PubMed records. [Next] Using this pull down menu, you can select XML and display the record in XML, [Next] now NLM's principal format.NLM has also made MeSH available in XML, and the PubMed Central repository is using XML for markup of fulltext articles. |
|
|
To underscore how XML's content markup has strategic potential, consider that even the best Web search engines are hampered by HTML's focus on appearance. A search in Google does retrieve Richard W. Price, but with a lot of false drops. If XML were more broadly deployed, it would be possible to distinguish between prices in dollars and people named price. We've lowered a lot of standards for Web convenience, and libraries should strive toward more sophisticated retrieval, [Next] as the simple example of surnames first in library records illustrates. |
|
|
Another example is that of Seaboard Air Line. When Lindbergh flew the Atlantic in 1927, airline stocks rose dramatically in price, including Seaboard's-- [Next] despite the fact that Seaboard Air Line was a railroad company. |
|
|
Markup indicating the category an organization belongs to could help eliminate such confusion.Much later, in 1946, an airfreight company, Seaboard & Western Airlines, was founded. It changed names to Seaboard World Airlines in 1961.Organization of information to account for time factors and name changes would also help improve search precision.Is it really acceptable to not be able to limit a search by date? |
|
|
Names and information can be misleading in many ways, on the Web or not. Does this look too staged? Libraries, at least, have a reputation for trust.By combining selected new Web technologies, such as XML, with our good reputation, we may be able to contribute to solving such problems, and thus be thought of as even more reliable information resources. |
|
|
Let's review the key aspects of XML.(This provides inspiration, when I look up from my desk, when coping with flawed information storage and retrieval systems.) |
|
|
XML is simply a syntax for marking up documents or records to delineate different kinds of information, unlike HTML which mostly marks up what a document should look like. [Next] Each XML document always begins with a line, or declaration, that says which version of XML, and which encoding standard have been used.Other than that, there are three main building blocks. [Next]
|
|
|
XML's power comes from the simplicity of its object-oriented, hierarchical structure.Each element in this hierarchy can function as an object. Thus, you can reference a Topic, without worry that it consists of two other objects.By applying basic concepts, you can build easily understood, yet complex structures. This simplified diagram shows how related elements are grouped in a hierarchy, or an inverted tree, with one root, branches, and leaves representing actual data values. XML requires a "root" element and that all subordinate elements nest properly under it. Now, let's look at an example to see how basic XML looks. |
|
|
Here I've chosen a root element called BibliographicRecord. Note that an element name cannot contain spaces; sometimes an underscore is used to represent a space. Pairs of start and stop tags delineate each element.The angle bracket introduces the element.
|
|
|
In the same manner, we have a topic, which is MeSH and primary.
That's basically it. It is clear enough that people can read it, and precise enough that computers can read it. Documents following this syntax are said to be well-formed.But, what makes XML really powerful is the suite of tools, which can manipulate the marked up content. |
|
These tools complicate the picture, but illustrate how separately addressing the different features allow each to be optimized. You don't need to know all of these to take advantage of XML. |
|
|
One of the best features of XML is its flexibility. [Next]Unfortunately, this is a double-edged sword. [Next] It becomes increasingly difficult to sharewhen each of us defines the same information differently.For XML to achieve its potential in any particular field, the stakeholders in that field need to agree upon standards. |
|
|
Such standards for XML are known as [Next] document models, DTDs (document type definitions), and schemas, which add data typing, indicating for example, that a number is a price rather than a quantity. Essentially, these define a hierarchy of elements, including allowable elements, attributes, and whether they are optional, repeatable, etc. They are used to validate documents, which claim to adhere to the standard.Such compliant documents are said to be valid. Standards have been developed for biosequence data, mathematics, music, and there's even VoiceXML. This is easier said than done.XML itself is fairly simple, as we've seen. Analyzing data and reaching agreement are the difficult part, where knowing the content is as important as knowing XML.As part of the Medlane Project at Lane, we are developing a schema for bibliographic and authority data.We plan to offer this as a framework for discussion, enhancement, and refinement in hopes that libraries will recognize the value of a Web-oriented format, and adopt such a schema. We anticipate that there would be designated elements to accommodate both vendor-specific and library-specific extensions to an agreed core of standard elements. While many issues are yet to be explored, I hope that the following sneak preview will help illustrate the scope and complexity of such data-- sometimes underestimated by those outside the library community. In contrast, the current standard, MARC, has become excessively complex in its 40 years of evolution. Its shortcomings are increasingly evident as web-delivery of information burgeons; it is difficult to integrate MARC records with other Web-oriented documents. |
|
|
I do not intend to diminish MARC's stature. Indeed, the years of debate and honing are crucial to the review that we're undertaking.In order to narrow our focus, let's ignore sets of records. For any given RECORD, we have identified 10 Principal Elements, which appear to accommodate all bibliographic and authority data. A significant problem is deciding how to handle each kind of data.For example, is the academic course, Biochemistry 101, a WORK or an EVENT? Basically, WORK represents bibliographic titles and the other nine represent types of authority data. Each RECORD would contain one of the Principal Elements.(This is the schema, or rules, for constructing a record that I'm referring to.) In addition, each RECORD would have a CONTROL element, and up to 10 RELATIONS elements, which parallel each of the Principal Elements.I will briefly review each of these 12 to illustrate their scope. |
|
|
The CONTROL element would contain all elements that relate to the RECORD itself, rather than to the content the RECORD represents. This would include IDs and which organization created it; the date a record is created, maintained, etc.; what kind of record it is, perhaps Original or Derived from a resource file; what language the record is in; etc. The date of publication, for example, unlike in MARC fixed fields, would not go here as it refers to the WORK, not the RECORD. |
|
|
The WORK element is likely what you think of as the result of cataloging. It is more narrowly defined here. It encompasses books, serials, and collections.It includes text, audio, video, software, maps, etc.-- approximately the MARC formats. The emphasis, however, is on the title of the WORK. A formal, or umbrella title, including edition and date serves as an anchor for the WORK.Other variant titles would be represented separately. We are also considering a Versions element, so that one WORK element could clearly delineate more than one version. This would address the problem of print and digital serials, which are identical in many regards and don't seem to merit separate records. Holdings, and usage or licensing restrictions would be associated with particular versions. These, and physical items, could be defined separately via XML entities, which I mentioned earlier. |
|
|
The other nine Principal Elements are roughly equivalent to Authority records. Each supports synonyms or variants to handle equivalency. NOM, as in nom de plume, is broadly interpreted to include names of real and fictional people, and even named animals. It excludes scientific names. The emphasis is on the name itself, referring to a specific being. Mickey, for example, could refer to a well-known mouse, a famous baseball player, or to our Pomeranian-- who thinks he's a person.Similar to Versions for WORK, pseudonyms may be designated under a single NOM, yet referenced individually to identify which works are associated which each pseudonym. Unlike MARC, forename and surname would be separate elements; I was very pleased that Medline has just introduced this distinction. |
|
|
ORG refers to named organizations, governmental jurisdictions, etc. It excludes EVENTs, which are treated separately.The distinction between a government and the geographic area that it governs is problematic in that ORG encompasses the government, but another of our Principal Elements, LOC, covers geography.If they are defined separately, is the resulting partial redundancy feasible? |
|
|
EVENT expands upon MARC's meeting name to also include all sorts of named happenings, which are usually treated as subjects. Each of the Principal Elements has its own set of issues. The most obvious one with EVENT relates to meetings of organizations.We are considering proposing a qualifier element, which would permit the 'value' of one of the Principal Elements to modify another. For example, a Symposium of the Medical Library Association, could be marked up as an EVENT, with the EVENT name 'Symposium' qualified by the ORG name, 'Medical Library Association'. A separate ORG for the Medical Library Association alone, as sponsor, would be in order.The intent is to define a crisp listing of EVENTs, which often have varying and thus confusing names. |
|
|
TOPIC includes the usual topical or conceptual headings and subheadings.
There are many issues here as well. To keep things as simple as possible without
losing information, a person as subject would not go here.
An XML attribute of 'subject' could be
defined for NOM, ORG, EVENT, and potentially other Principal Elements, to
indicate that they have precedence over TOPIC.
Some medical libraries, including NLM and Lane, have eliminated use of
form, language, and geographic subheadings internally, preferring to rely on
coordinate indexing for retrieval.In
looking at LC subjects, the qualifier element I mentioned before may help
resolve such issues.For example,
History is clearly a topic, but one of limited value in a library with large
historical collections. |
|
|
History implies time and place.It does not imply form and genre, or TYPE. Separating TYPE for the name of categories is extremely useful in cataloging.In retrieval, it is important to distinguish whether something is an X-ray versus is about X-rays.Lane is using this concept extensively to indicate, for example in an authority record, that Stanford School of Medicine is a Medical School, and in a bibliographic record, that Dorland's is a Dictionary. The distinction between TYPE and TOPIC is simple, yet powerful, and deserves wider application. |
|
|
LANG covers languages, including artificial ones, but excludes computer languages, which are considered WORKs. Incidentally, access to languages should not be obscured by cryptic codes. |
|
|
CHRON governs dates and time.MARC has multiple formats for dates in various places. Lane's approach has been to define the same element only once, and then reference this in other parts of the schema. Thus, the date of publication, dates associated with a personal name, and date a record was created, would all share an identical structure. Paired CHRON values would indicate ranges.Attributes would indicate which kind of date (copyright, publication, birth, etc.) is involved. The resultant hierarchy of chronological access points would provide improved retrieval in this neglected area. Our definition of CHRON follows the ISO standard, so that an XML Style Sheet can display dates and times consistently. |
|
|
Only three more to go!LOC encompasses geographic locations and topographical features. It may also be the appropriate place for buildings or structures as they are in fixed positions.We still have the issue of government versus geographic location to resolve! Addresses fit here, but merit their own schema.Others may have already produced an adequate one, which could be incorporated by entity reference. |
|
|
LEX is a concept that I've been exploring for some time. Its premise is that keyword searching should be supported more effectively, and encompasses both words and phrases. Linked clusters of variants could provide a significant improvement in this popular type of search. Lane selectively records variants, such as British spellings and slang, and cross-references between words that may be difficult to recall at a busy reference desk. |
|
|
In this example, bloodletting, with and without the hyphen, would be a useful automatic retrieval, while a searcher may appreciate being apprised of the other related words.Bleeding could reference hemorrhage and so forth, Roget-style.
|
|
|
Lastly, the RELATIONS element.Our current thinking is that any Principal Element may be related to any other Principal Element and to other instances of itself as well. A WORK to WORK relation would be a bibliographic one, such as one serial title continuing another; a good example of a TOPIC to TOPIC one, or topical relation, is one MeSH term being broader than another.Then, there are the RELATIONS between the different Principal Elements. The most familiar of these are traditional access points, for example, a particular MeSH heading being a Topic of a particular WORK. |
|
|
Here are our 10 Principal Elements. [Next]Each would have a parallel RELATIONS element.Details are subject to change as this is very much a work in progress. |
|
|
This slide represents all the potential relations that could occur in this schema.Some of these may seem unnecessary; however, more are useful than we may immediately recognize. [more]
Marking up records to emphasize relations consistently could enhance future information retrieval in unforeseen ways.Lane has begun recording such relationships.To keep them from becoming chaotic, we plan to extend authority control to a 'type of relation' element. Our nascent schema appears to form a neat package, with an orderly web of the various fundamental kinds of information. It has the potential to open the bibliographic apparatus to all concerned, rather than just catalogers, but could also reinvigorate cataloging.It is incomplete, and needs further testing and scrutiny in the context of related library information. |
|
|
This slide depicts our schema in the center and sketches a few relationships between potential schemas in other areas. Addresses is a good example of an element that ideally would only be defined once and shared by other schemas via entity references. A vendor schema might incorporate our bibliographic ORG element, adding elements needed for Acquisitions, such as discount, shipping options, etc. [Incidentally, vendors have been experimenting with XML for electronic communications or EDI.] Likewise, library users' names could follow the same syntax of our bibliographic NOM.For ILL, why don't we add those difficult to verify transactions not in Medline to our collective bibliographic trove? All library information is connected, and should be integrated.The beauty and efficiency of XML is that we only need learn a single syntax, and define a particular element only once. |
|
|
Lane is planning to develop a more user-friendly interface for our XMLMARC software and will include our new schema as the default for converting MARC to XML to encourage experimentation.We believe the work up front in carefully defining the schema will pay off repeatedly in the future. |
|
|
This screenshot shows MARC on the left and the resultant, automatically created XML on the right.Our software is only one of many tools available for working with XML. Many are free, and they are getting easier to use all the time. |
|
|
[Slide 50 Libraries' Advantage] Libraries are in a unique position to take advantage of XML. This stems from a tradition of purveying information selected with quality in mind, but not limited by source or producer and dovetails with our users’ known reluctance to search multiple interfaces. Today’s digital fragmentation, especially in a domain such as medicine, can best be addressed at the library level, especially as libraries generally lack a profit motive. While much work remains, libraries are inherently well-positioned to integrate disparate resources. XML and open source software can serve as enabling technologies that allow us to participate more effectively in defining our future. The recent dot.com excesses provide us a unique opportunity to catch-up technologically in a more reasoned fashion. |
|
|
Libraries can adopt the unifying technical infrastructure of XML far more easily than our users can be convinced by would-be competitors, that they too share our high standards, and values of impartiality, trust, confidentiality, thoroughness, and a service orientation. Sharing an infrastructure is the first step toward a distributed, integrated international resource, the sum of which would be far more valuable than its parts. |
|
|
Non-librarians are using XML and the Web to develop resources that would serve our users better if they were integrated with other library resources rather than being standalone resources.Thus, our prospects are excellent in these exciting times.
|
|
|
It really can be invigorating.And, if you are more uncomfortable out on a limb than Shirley [Maclaine], [Next] you can always move the tree, so you're not so far out. (By the way, that's one of Jerry's stumps!) So, we of the Medlane Project ask-- |
|
|
Got XML? |
|
|
I would like to acknowledge these folks who have been influential in my career. |
|
|
And, these too, without whom I would not have been here today. |
|
|
All you need is a little elbow grease to bring out the shine! |
|
|
If you can't find what you need on our website, try the XML4Lib discussion list, or email me.Thank you.
|