An Introduction to Using XML for the Management of Laboratory Data

Abstract

Extensible Markup Language (XML) and the XML Path Language (XPath) are introduced with software examples demonstrating how one can use them to write laboratory data management programs. Topics explored include XML document creation, manipulation, and searching. Programming examples make use of the Microsoft® XML Parser library and the Visual Basic programming language. The problem of managing microplate screening data is used as an illustration. Source code for all examples can be downloaded from http://www.labprogrammer.net.

Keywords

Extensible Markup Language XML XML Path Language XPath Visual Basic Data Management

INTRODUCTION

One of the main functions of many automated laboratory systems is the management and exchange of data. It is common for automation-related data to be exchanged with corporate data systems, with archival storage devices, and among automated instruments. A certain amount of local data management is almost always necessary. For example, it is well known that measured data destined for a database server should be cached locally before being uploaded. This allows an automated system to continue to run if the network or database management system stops operating.

How should these data be formatted? Many ad hoc formats have been created, and this continues to occur. The lack of standardization has made the exchange of automation data difficult and error prone. With the introduction of the Extensible Markup Language (XML)¹ open standard, ad hoc data exchange formats are no longer necessary. In fact, in many cases XML can be used not only for data exchange, but for data storage and manipulation as well. Furthermore, the openness of the XML standard translates into many benefits, including the ability to leverage an ever-growing list of third-party tools.

XML formats allow a much richer representation of data than is possible with a simple comma-separated tabular file. An XML document organizes data as a hierarchy - an inverted tree structure. It is no longer necessary to repeat long columns of plate barcodes or compound identifiers in large tabular files. A plate barcode or compound identifier can be listed once with any amount of structured data existing below it in the hierarchy. Moreover, XML documents are represented in plain text. An XML document stored in a file can be created or modified using a simple text editor.

By itself, XML is not enough to initiate the storage and exchange of data. XML comprises only the rules to be used to generate a representation. A structure must be built upon the XML standard foundation.

Many efforts are underway to build standard scientific XML-based data formats. For several examples refer to: references 2.3.4. Often, for automation systems of limited scope, a suitable custom XML-based format can be devised.

Once an XML-based format has been adopted, there is no reason to write a laboratory data parser. Every popular programming language has made freely available one or more XML processing libraries. These libraries provide the ability to parse XML documents into a structure that can be manipulated programmatically, as well as the ability to search for and transform XML data. The XML processor that is used for the examples in this paper is the Microsoft® XML Parser (MSXML) version 3.0.⁵ If Internet Explorer 5.x or later is installed on a computer, chances are good that a version of the Microsoft® XML Parser is also present.

In addition to parsing and searching XML documents, the XML standard also includes the ability to validate the structure of an XML document by providing what is called a Document Type Declaration (DTD).¹ DTDs provide a way to express a grammar that defines the structure allowed in an XML formatted document. Corrupted or malformed data documents can be detected automatically, potentially saving many hours of manual database repair. More recently, the XML community is moving toward the use of XML Schemas as the standard for defining XML document classes. Currently, the XML Schema specification is approved as a World Wide Web Consortium recommendation.⁶ Among the advantages of using XML Schemas over DTDs is that they are represented using XML and, thus, can be parsed and manipulated using the same tools as would be used for XML documents. The current version of MSXML (version 4.0) supports the World Wide Web Consortium's final recommendation for XML Schemas.

A BRIEF ANATOMY OF AN XML DOCUMENT

An XML document is a hierarchical collection of data storage units, called entities. An example of a complete entity is as follows.

This format is reminiscent of Hypertext Markup Language (H™L).⁷ Each entity has a start tag (in this case <measurement sequence=“25”>) and an end tag (</measurement>), with data or other markup between.^1,2,3

Entities can have attributes, which are name-value pairs stored within the start tag that delineates the entity. For example, our “measurement” entity has a “sequence” attribute - a number indicating where the measurement belongs in a sequence of data points.

XML documents should begin with an XML declaration, which specifies the version of XML being used. To date only the first version of XML has been released. Therefore, a proper declaration should read as follows.

<?xml version="1.0”?>

Comments can be inserted anywhere in an XML document except within attribute values or tags. Text contained in a comment is ignored by an XML processor (i.e., not parsed). In XML, comments are formatted using the same syntax as comments in Hypertext Markup Language - they are placed between

a "".

By definition, every XML document must have a single document element at its root. This is the element that contains all other elements in the document. The example XML document given in Listing 1 has at its root the “experiment” element. For a complete definition of XML documents, see reference 1.

THE DOCUMENT OBJECT MODEL

As mentioned, an XML document can be parsed into a set of objects that are used to manipulate the document. These objects belong to something called a Document Object Model (DOM).⁸ The DOM defines an application programmer's interface (API) for well-formed XML documents. This API is also an open standard that can be implemented in an XML parser library. Once you are familiar with the API, you can move easily from library to library and language to language and still have a common method for programmatically manipulating XML documents.

Several of the more popular DOM objects are listed in Table 1. Included in this table along with each object type is the associated variable type used by the MSXML library and a brief description.

Table 1.

Commonly used DOM objects.

Table 2 and Table 3 describe properties and methods of the Document, Node and Element DOM objects, that are used in the examples included in this paper. For a complete list with all details refer to reference 8.

Table 2.

Members of the Document object.

A SIMPLE LABORATORY DATA MANAGEMENT EXAMPLE

Consider the problem of storing information about compounds tested in one or more 96-well plates. We want to be able to store data identifying the compounds that are tested in each well, as well as measurement values. We would like this representation to be sufficiently flexible so as to allow zero or more compounds in a single well (mixtures) and zero or more single point test results for a well. Also, we would like to label each well positively (as opposed to indicating a well by its position in a document) and allow empty wells to be excluded from the document. Listing 1 is an example of a small XML document that satisfies these requirements.

Listing 1.

A sample XML file.

Since XML documents are represented using plain text, they can be created and modified easily with a simple text editor. There are several special-purpose editors created specifically for XML that offer many additional features. Online resources such as xml.org⁹ provide lists of tools currently available. A nice tool that one can use to get started is XML Notepad, which is available free of charge from the Microsoft® web site.¹⁰

LOADING AND DISPLAYING A DOM

In this paper all of our examples will make use of version 3.0 of the Microsoft® XML parser. To access that parser from the Visual Basic Integrated Development Environment (IDE), or from an application that includes Visual Basic as its scripting engine, one must first check off the “Microsoft XML, v3.0” item in the References dialog. In the Visual Basic IDE, the References dialog can be found under the Project menu. In Excel, look under the Tools menu in the Visual Basic Editor. Earlier versions of the XML Parser will also run the examples as is, or with minor changes.

An XML document can be parsed from a file into a DOM using the load method of the Document object. After creating a Document object, called DOMDocument in the Microsoft® XML Parser, call the Document object's load method with an XML file name as the argument (see Listing 2). The load method will return False if an error occurs while parsing the XML file. In the event that loading a document fails, the parseError property of the DOMDocument contains comprehensive information about the origin of the failure. This feature is indispensable for pinpointing problems when debugging or troubleshooting XML applications. In Listing 2, the parseError property is used to build and display a detailed error string on the occurrence of a parse failure.

Listing 2.

Loading an XML document from a file called “experiment.xml.”

Since a DOM's structure is hierarchical, a convenient way to write a routine that displays all the nodes is to use a recursive procedure. The DisplayElement subroutine in Listing 3 will print all element tag names (the nodeName property value) indented to reflect the level of the element in the DOM. Refer back to Table 1 Table 2 and 3 for a further explanation of each object's properties and methods.

Listing 3.

A procedure that displays all elements in a DOM hierarchy.

Listing 4 gives a complete example illustrating how to load an XML document and display it using the DisplayElement procedure. Since the root element of the DOM (the documentElement property value of the DOMDocument object) is passed to the display procedure, all elements in the document are displayed in the Immediate window of the Visual Basic Integrated Development Environment.

Listing 4.

A complete example showing how to load and display an XML document.

The following output is generated by running the program in Listing 4 on the XML document in Listing 1. The node names of all element objects are printed with an indentation that reflects the level of nesting in the document.

experiment plate well compound compound measurement well compound measurement measurement

MODIFYING THE STRUCTURE OF A DOM

The DOM API includes methods to create new objects and append them to a document as well as to remove existing objects in the document hierarchy. Once an XML document structure is devised, it is convenient to create functions that encapsulate the details of creating new entities. Listing 5 gives two such functions. The first is used to create a new plate entity and the second to create a new well entity. The Document object's createElement function is used to create the new element and the setAttribute method to add attributes to the element.

Creating a new element does not add it to the document hierarchy. Use the appendChild method to add one element as a child of another. In Listing 6 the new well element is added to the new plate element, and then the plate element, which now contains the new well element, is added to the documentElement of the DOM.

Listing 6.

Create a new plate and well elements, add them to the document element and display.

Following is the output resulting from running the Example2 subroutine given in Listing 6. Note the addition of the new plate and well elements.

experiment plate well compound compound measurement well compound measurement measurement plate well

USING XPATH TO SEARCH AND SELECT FROM A DOM

More than just storage and manipulation, it is possible to search and select from XML documents using another standard called XML Path Language, or XPath.¹¹ Similar to Structured Query Language (SQL), a popular language used to select from and manipulate relational databases, XPath defines a non-procedural query language in which queries can be formulated and executed on a DOM to retrieve nodes.

When using XPath, it is helpful to think of the DOM as a Unix-style directory/file hierarchy. In this analogy, elements are the files or directories that contain files or other directories. For example, the pattern

well/measurement

will match all measurement entities contained as a child in well entities starting from the current context.

Nodes selected at each level of the hierarchy can be limited further using filters. Filters are placed between square brackets right after the element name at any level of the hierarchy. For example, the pattern

well[measurement]

will find all well elements that have a child measurement element. The pattern

well[measurement>100]

will match only well elements that have a child measurement element with a value greater than 100.

Attributes can be named in a filter by preceding the attribute name with an “@” symbol. For example, the pattern

/experiment/plate[@barcode='10001']/well[@row='1']/measurement

will match any measurement element in any well with a row attribute value of ‘1’, in any plate with a barcode attribute value of ‘10001’, in the top-level experiment element.

To run an XPath query, the Microsoft® XML library has extended the Worldwide Web Consortium's (W3C) DOM API to include a function called selectNodes. This function takes an XPath query string as an argument and returns a NodeList object. The MSXML parser also provides a function called selectSingleNode, which returns a Node object containing the first entity matched by the query. Refer to reference 5 for other useful extensions of the DOM API.

The subroutine in Listing 7 loads the XML document stored in the file experiment.xml, selects all measurements from any well in column 1, and displays it along with the row and column of the well. The parentNode property is used to get the parent well node of each measurement so that the row and column attributes can be displayed.

Listing 7.

Select all measurements from column 1 wells and display.

Running the Example3 subroutine in Listing 7 produces the following output.

row=1 col=1 measurement=45.3 row=2 col=1 measurement=45.9 row=2 col=1 measurement=61.4

The XPath query in the previous example can be extended further to match only measurements in the first column that have a value greater than 60.

/experiment/plate/well[@column='1']/measurement[.>60]

The dot (.) in this query represents the current node in the hierarchy.

CONCLUSION

As an open standard, there are many benefits to be gained by adopting XML for the storage and exchange of laboratory data. To begin, it is not necessary to write a laboratory data parser since XML parsing libraries are available for all major programming languages and operating systems. Once the Document Object Model is understood, XML documents can be manipulated programmatically using a standard application programmer's interface.

Furthermore, the ever-growing body of related software tools automatically can be leveraged, and data can be exchanged readily with others who have adopted the standard. The plain text nature of XML makes it possible to create and modify XML documents using a simple text editor. No longer is it necessary to have special database drivers or other proprietary software to access data.

More than storing and exchanging laboratory data, the XPath standard allows those data to be searched and selected. While XML will never replace relational database systems, with XPath there exists a simple, but powerful, mechanism for processing data on a local computer. This is a feature that usually is not enjoyed by automated laboratory systems and offers the possibility of enhancing the capabilities of automated system interface functionality. The many benefits of adopting an open and widely accepted standard are available with the use of XML to manage laboratory data.

INSTALLING THE MSXML PARSER LIBRARY

If you have installed Internet Explorer 5.x or later, it is likely that you have a version of the MSXML parser installed (MSXML 3.0 SP2 ships with Windows XP and Internet Explorer 6.0). The parser constantly is being updated by Microsoft, however, and new versions are available regularly. The MSXML parser is available in two forms: technical previews and release versions. Technical previews usually have an implementation of the latest and greatest recommendations for the XML standard put forth by the W3C and are made public to solicit feedback from the developer community. The release version of the parser is considered stable and ready for use in distributable software. Unlike technology previews, release versions are supported by the Microsoft development team. The examples in this paper were prepared using MSXML version 3.0, which is currently available as service pack 2. To install it, direct your browser to

http://msdn.microsoft.com/downloads/default.asp and expand the document map tree on the left to expose the Web Development/XML node's children. Select the MSXML Parser 3.0 Service Pack 2 Release leaf to bring up the MSXML 3.0 download page. Click on Download U.S. English Setup to begin the download of msxml3sp2setup.exe. You will need the Windows Installer version 1.1 or later. The latest version of the Windows Installer is downloadable from http://www.microsoft.com/downloads/release.asp?releaseid=32832 (Windows NT/2000 users) or http://www.microsoft.com/downloads/release.asp?releaseid=32831 (Windows 95, 98, and ME users). Installing the parser in this manner will not install any help files. To obtain the help file, which contains a detailed description of all of the objects and members of the parser, download and install the MSXML software developer's kit (xmlsdk.exe) by selecting the MSXML SDK 3.0 Release leaf from document map tree (in the downloads page as above) and following the instructions. It is important to note that msxmlxx.exe (where xx is the MSXML version) is the only way you can distribute the MSXML parser when deploying your programs. This means that you cannot rely on Visual Studio's current package and deployment wizard to collect the appropriate files for a project that uses MSXML. You must either install the parser separately using msxm-lxx.exe or customize your installer to launch msxmlxx.exe. Also, recall that your client will need to have the latest version of the Windows installer. Microsoft also provides a cab file for Internet based installations of the MSXML parser. See http://msdn.microsoft.com/xml for more information.

Download source code for examples presented in this paper and find other resources at http://www.labprogrammer.net.

References

“Extensible Markup Language (XML) 1.0”, October 6, 2000, http://www.w3.org/TR/REC-xml.

“XML, bioinformatics and data integration”, Achard

Vayasseix

Barillot

, Bioinformatics Review, Vol. 17 No. 2, pp. 115–125.

“XML: A lingua franca for science?”, Barillot

Achard

, Tibtech, August 2000, Vol. 18, pp. 331–333.

“Consortium Works to Standardize Bio Data Formats”, Studt

, Research and Development, September 2001, Vol. 43, No. 9, pp. 18–19.

“The Microsoft® XML Parser”, http://msdn.microsoft.com/xml.

“XML Schema”, http://www.w3c.org/XML/Schema.

“W3C HyperText Markup Language Home Page”, http://www.w3.org/MarkUp/.

“Document Object Model (DOM) Level 2 Core Specification Version 1.0”, November 13, 2000, http://www.w3.org/TR/DOM-Level-2-Core/.

XML.org, http://www.xml.org/xml/resources_focus_programming.shtml.

10.

“XML Notepad”, http://msdn.microsoft.com/library/enus/dnxml/html/xmlpaddownload.asp.

11.

“XML Path Language (XPath) Version 1.0”, http://www.w3.org/TR/xpath.