Spectro ML-A Markup Language for Molecular Spectrometry Data

Abstract

SpectroML is a markup language for molecular spectrometry data that can be used as a “web-aware” mechanism for instrument-to-instrument, instrument-to-application, and application-to-application data interchange and archiving. SpectroML was developed using XML (extensible mark up language), and its vocabulary was gleaned from the terminology, data dictionaries, and concepts embodied in existing standards, instrument software, and data interchange formats. Currently, we have created a SpectroML vocabulary, document type definition, schema, stylesheets, and some demonstration applications for UV-visible spectroscopy data; however, the structure and flexible data model embodied in SpectroML should make it easily adaptable to other spectroscopy techniques.

Keywords

XML markup language molecular spectrometry data metadata data interchange data format SpectroML

INTRODUCTION

Ever since spectrometric instruments have been coupled to computers, there has been a need to “deal with” the result data produced by the instrument. If only the applications provided by the instrument manufacturer are used, one need not care about the way these data are stored or represented. The instrument's application usually maintains its data in a proprietary native format. However, when the necessity arises to process or exchange such data using other applications, some mechanism must be found to represent the result data in a way that they can be transferred and “understood” by other applications. Since the number of possible instrument-to-application combinations is nearly infinite, direct instrument-to-application translators have given way to approaches using common interchange formats. Instead of having to support myriad applications and/or instruments, the instrument software or application now needs only to be able to read and write data in the interchange format. Several interchange formats have been developed, ^1–3 but all have had some disadvantages, often occasioned by the ways the data and their corresponding metadata are represented.

Recently at NIST, we have seen an increasing need to interchange molecular spectrometry data with our Standard Reference Materials (SRMs) customers, with vendors in our NIST Traceable Reference Materials (NTRM) Program, with other National Metrology Institutes (NMIs), and with our optical filters database. We needed an interchange format that could handle data from a variety of instruments and applications and would exchange and maintain not only the result data, but also their accompanying metadata and critical information about the measurement and the sample. We also needed a mechanism compliant with the standards for data interchange and processing used in today's computer networks, because the actual interchange will be for the most part network-based. We examined existing data interchange tools and techniques, but were unable to find a suitable exchange mechanism, largely because these were all developed before the advent of modern computer network technologies. Accordingly, we began to develop the web-aware data interchange format based on the eXtensible Markup Language (XML) that we now call SpectroML. At present SpectroML has been developed only for UV/Vis spectroscopy to keep our project manageable and to permit timely development of useful applications; however, SpectroML was designed to be extensible to other fields of spectroscopy. In this article we will discuss the development of SpectroML, describe its structure and elements, and illustrate its use.

NATIVE AND INTERCHANGE DATA FORMATS

In the most rudimentary scenario, result data are provided by the instrument hardware in a stream of bits or bytes. The instrument application collects this data stream and stores it in its own internal format, which is most commonly binary, but sometimes data are stored as text. In either case, the data are usually not intended for use by other applications. The instrument's software alone knows how to read, write, and modify the data files. Binary and compressed binary formats were commonly utilized in the past to reduce the storage requirements for result data. Fortunately, in the current era of desktop computers with gigabyte disk drives and memories and gigahertz processors and networks, the need for austere data representations has passed.

As mentioned, to use result data in other applications, it is often necessary to modify its representation usually by using an interchange format such as JCAMP-DX¹ GRAMS SPC², or ANDI/NetCDF.³ These act as a central part in the interchange process, so that the instrument's software or other software tools only have to support an import and export function for the interchange format. However, today's interchange mechanisms have some disadvantages:

Immutability: additional elements cannot be easily added.

Fixed structure: the elements must have a precise order and syntax.

Formats differ: some data elements can be lost during the interconversion.

Wide coverage: many elements called for to achieve a broad scope may not be needed in all cases.

Application restriction: many applications do not support all formats.

Content limitations: result metadata and information about the sample and the measurement process are often omitted.

Inconsistent implementations: discrepancies between vendor implementations and incompatibilities between versions.

Disagreement: different organizations and groups have often failed to achieve consensus, so vendors may have to implement several interchange technologies.

Network awareness: current interchange mechanisms are not compatible with modern computer network technologies.

COMBINING XML AND EXISTING FORMATS

At present, there is no standard, “web-aware” way to exchange, process, store, or visualize molecular spectrometric data. We saw the need for creating a data format that eliminates the disadvantages of existing interchange formats and fulfills the following requirements:

Extensibility: new elements must be easy to add, and adding new elements must not break existing applications.

Flexibility: the structure must satisfy manifold needs.

Usability: the format must be easy to use in applications, and the tools for creating and applying the format must be readily available.

Acceptability: the format must use standard mechanisms for easy integration, and the format must permit the use of standard security mechanisms, e.g., digital signatures, to ensure the integrity of the data.

As a standard designed to handle data in today's network applications, XML can satisfy these requirements. XML is promulgated and maintained by the World Wide Web Consortium.⁴ XML tools and applications are readily available and many can be downloaded free of charge from the world-wide web.

It is important to realize that our decision to create SpectroML did not mean starting over. There was simply no need to do this. We believed that the terminology, data dictionaries, and concepts embodied in existing standards, instrument software, and data interchange formats could be leveraged to facilitate the development of SpectroML. We wanted to take advantage of the large body of work that has been done in the field of spectrometric data interchange rather than re-inventing it. With this concept of reuse firmly in hand, we studied terminology definitions in normative standards,⁵ spectrometer operation and software manuals, and existing native and interchange formats in hopes of extracting the most useful parts of each.⁶ After collecting this information, we organized it into five different categories:

File: header information.

Instrument: information about the instruments used.

Sample: information about the processed samples.

Measurement: information about the measurement process.

Data: the result data values and information about their structure.

The information in each category consists of two main parts: the “data,” which are the result data values from the experiment, and the “metadata,” which are descriptive data about the “data.” It is essential to keep both components together, because separately they both become useless. To structure information, XML is able to store data in a tree-like topology. The data and metadata elements in this structure have descriptive tag names. This makes it both human-readable and easy to process by a computer.

THE STRUCTURE OF SPECTROML

Based on our analysis of existing data interchange formats for molecular spectrometry data, we developed an initial vocabulary, organized it to get a logical and regular structure, and extended it to provide a linking mechanism. Being based on XML, the structure is arranged hierarchically like a tree, starting from a root and with increasingly detailed sub-elements, finally ending in the leaves as shown in Figure 1. The taxonomy of SpectroML contains the following components:

The root element (the ground in Figure 1) contains one or more experiments. The individual experiments are implicitly related by being grouped into one document; however, they can be explicitly related via linking references.

Each experiment (the tree trunk in Figure 1) contains five groups. The file group is a header group that describes all the datasets within an experiment. Each of the other four groups-instrument, sample, measurement and data-describes a different aspect of the dataset and contains the data values themselves.

Each of these four groups (a main branch in Figure 1) contains two different blocks. Generally speaking, the blocks divide the group data into a fixed part and a variable part. Each of these two blocks can appear several times. Its ID (identification string) affords the possibility of reusing one block for different datasets within an experiment. For example, one instrument can be used with several samples, without repeating it for each dataset.

Each block (a smaller branch in Figure 1), except for the core data, contains sections (a smaller branch in the figure). A section divides a block into different sub blocks. In this specification, each of these blocks has two sections; however, this is not mandatory and can be expanded in future versions.

Each section (a twig in Figure 1) contains data elements to hold the data and metadata.

Each element (a leaf in Figure 1) may contain sub-elements. This allows storing structured data in an element. Each element can also have an attribute, such as a format description for the data contained.

The spectroscopy method is an attribute of an experiment, which means several methods can be combined within one SpectroML file. The current elements focus on UV/Vis, but the required metadata for other methods can be added in the future since the structure that holds the data values was designed to accommodate a broad range of data structures. It is important to realize that even though XML files are human-readable, they are created to be processed by a computer. The hierarchy, its structure, its depth, and complexity are designed to make the XML file “parsable” by a computer and to foster flexibility and extensibility.

PATH-A DATASET WITHIN AN EXPERIMENT

A dataset is a path or linkage through the experiment blocks. Datasets are stored in the file group and connect all eight blocks (two of each remaining group) together. If a given block is needed in a number of datasets, it can be reused multiple times with different collections of other blocks without the need for maintaining copies of it. Figure 2 demonstrates the concept of experiment paths within SpectroML:

Figure 2.

Experiment paths in SpectroML.

The eight different colors (or patterns) represent the eight different block types.

Each block type can appear multiple times as a discrete block (an apple in the figure) in the experiment and must have a unique ID (each apple in the figure would need to have a unique ID, such as “id1,” “id2,” “id3” for the three “instrument description” blocks).

A collection of exactly one block of each block type is a dataset (the basket holding eight apples in the figure). A path is the list of the elements of this set consisting of the eight different IDs of the blocks.

Taking Figure 2 as a universe of possible UV/Vis experiments would mean that there were three available instruments each having the same properties; there was one sample with a single set of properties; three possible measurements with all the same properties; and three result data packages all with the same properties.

RESULT DATA HANDLING IN SPECTROML

SpectroML is capable of storing multiple types of experimental result data:

single data points

a single spectrum

multiple spectra

multi-dimensional data.

Using the typical XML mechanism, data values can be stored in a structure as illustrated in the following example showing three two-dimensional data points:

But since spectra often contain numerous data points, this simple approach, while functional, would be unwieldy because of its huge amount of overhead. To minimize the overhead, SpectroML can store values in a more compact form by using one tag for the values of one dimension while incorporating the data as a list of values separated by a whitespace character (e.g., space or tab):

The name of the dimension is not fixed in a tag, but is variable in an attribute; this allows having as many dimensions as needed. The dimension attribute provides the link between the data and the related metadata elements (e.g., a minimum value or a start value):

In cases where data values are mathematically related (such as evenly spaced × values), only a single value is needed:

However, when this approach is used, one has to provide the information necessary for calculating the actual values in the corresponding metadata block.

METADATA ELEMENTS IN SPECTROML

XML tags are case sensitive. Tags in SpectroML are formed according to the following rules:

Tags contain only letters from the English alphabet (ASCII characters 65–90 and 97–122).

Tags within the root tag <SpectroML> begin with a lower case letter.

Each new word in a tag starts with an upper case letter for better readability.

Abbreviations are avoided in tag names as much as possible.

Wherever a physical value occurs as element content, there must be an attribute for its unit.

Wherever a data value or calculating property occurs, there must be an attribute for its dimension.

To group related elements within a section, elements may have sub-elements (Figure 3). In such cases, the parent element has no attribute and cannot contain data. Each element that holds data and each attribute must have a datatype. An element always contains only character data, but it can represent a different datatype, e.g., a floating point value. The following types are used:

Figure 3.

SpectroML elements.

String: for character data

Unsigned Integer: for positive integer values

Double: for floating point values

Language: for language setting of elements

Date, Time: for dates and times

ID, IDREF, IDREFS: for identifier and references

SPECTROML APPLICATIONS

SpectroML was designed as an interchange language between applications such as:

An instrument's controlling software.

A program to display or edit spectrometric data locally or over a network.

A database that stores and provides spectrometric data for queries.

Standard office software such as a word processor or spread sheet application.

Figure 4 shows screen shots from some SpectroML programs.

Figure 4.

Some SpectroML applications–1: visualization stylesheet, 2: editor application, 3: visualization applet.

In all cases, the same data format is used-SpectroML. XML usage is wide-spread and has strong support for today's programming languages and standard software. A plethora of XML tools are available to work with SpectroML. Many of these tools can be downloaded free of charge from XML websites on the Internet.^4,9

In many laboratory environments, it is essential to demonstrate the integrity of experimental data to ensure that any manipulation or tampering can be detected. SpectroML files are regular ASCII text, and therefore it is easy to alter the data, either intentionally or not. Even when there is no need for completely secure data, there often is a need to establish the origin of the data and their subsequent history. In a laboratory notebook, one certifies a dataset with a written signature. In a similar fashion, computerized datasets can be “signed” by enclosing them with a digital signature. There are several mechanisms to do this, but basically, they all have an algorithm that calculates a unique byte sequence (a signature) based on the content of the file itself. This sequence is delivered together with the data file, and a recipient can validate it, as long as he/she knows the algorithm. If an element in a signed file were changed after applying the signature, the subsequent validation would fail. SpectroML has no built-in tags for signatures, but XML documents can be signed via the mechanism provided by the XML Signature routine.

At NIST, we are currently developing software tools to make using SpectroML easier. For example, we have created a SpectroML application program interface (API). This API enables the programmer of an application utilizing SpectroML to apply it on a higher level without having to deal with the intimate XML details. Screen shots from some of our applications and applets using SpectroML are shown in Figure 4.

Often a spectrophotometer's software does not provide enough information about the measurement and/or the sample to completely populate a SpectroML file. We are developing an environment that provides a generic way of generating complete SpectroML files while minimizing the amount of custom software necessary to do this. This environment will allow us to generate SpectroML files from instrument result data, database information (containing, for example, the instrument's calibration data), and manual entries.

EXTENSION OF SPECTROML

Since SpectroML currently focuses only on UV/Vis spectroscopy, its current elements constitute only an initial vocabulary; but since SpectroML is written in XML it can be extended to accommodate other situations:

SpectroML can be extended to other spectroscopy types by adding new sections and elements. This requires updating its schema (or DTD) and adding provisions for distinguishing different versions.

SpectroML can be extended to fit special needs of an organization by using the core SpectroML and adding elements to the vocabulary. This may require having different name space prefixes for the core and the extension.

When extending the language, it is important to do it in a way that does not break existing applications. Ultimately, it is the responsibility of the application programmer not to make too many assumptions about details in the structure and use the language as generically as possible. Version checking can provide some assistance, however, there is no general solution to this problem. If URLs change or disappear or if schemas change drastically, applications may break. The XML development community is working on this general problem, and there are some possible solutions, for example, repositories to store schemas in a central place or architectures to ensure application stability when changing vocabulary.¹⁰

PROJECT OUTLOOK

SpectroML, as depicted herein, is our initial attempt to create a mark-up language for molecular spectrometry in general and UV/Vis spectrophotometry in particular. In gleaning the existing normative standards definitions, instrument manuals, and data interchange applications, we found much common material that we incorporated into SpectroML. However, we found instances where different mechanisms were applied to solve the same or related problems. In these cases, we chose what we considered to be the best approach based in its generality, current acceptance, and consistency with the rest of SpectroML. Others might have chosen differently. The current versions of the SpectroML schema, DTD, stylesheet, and a sample file can be found on the xml.org website (http://www.xml.org); from the home page, click on the “XML Registry” button, and then select “Chemistry” on the following page. Although this version of SpectroML is now being applied at NIST to solve the problems for which it was created, its main use is to serve as a starting point for discussions that will hopefully lead to a standard markup language for molecular spectrometry. To that end, the formation of a working group under the AS™ Molecular Spectrometry E13 committee has been proposed to develop a mark up language for molecular spectroscopy. Anyone interested in participating in this activity is urged to contact us at NIST.

For automated analytical systems, SpectroML is not an end in itself, but rather a component of a larger construct known as a Device Capability Dataset (DCD).⁷ A DCD describes the idiosyncratic characteristics of laboratory equipment and provides a means for standardizing the interfacing of laboratory automation devices in a descriptive rather than a prescriptive manner. For automated systems, the DCDs for each device become part of the System Capability Dataset (SCD)⁸ that describes not only the devices making up the system, but also the relationships and interactions between the devices.

CONCLUSION

Outside NIST, SpectroML is still largely a proposal. Even though SpectroML is ready for use and some limited applications exist, it has yet to be applied to a major task. We are only now beginning to write the real-world applications to utilize SpectroML in our own work.

Now that the SpectroML structure and its elements together with its DTD/Schema are in hand, anyone can use SpectroML. All that is needed to work with XML is a text editor and some of the many free tools available on the Internet. To utilize SpectroML in an application, one can use an XML API or the soon to be available SpectroML API.

At present, SpectroML is focused on UV/Vis spectroscopy. But its structure and its flexible data model should make it easily adaptable to other fields of spectroscopy. Our ultimate goal is to build SpectroML into a standard that will benefit everyone who deals with spectrometric data.

DISCLAIMER

Certain commercial equipment, instruments, or materials are identified in this paper to foster understanding. Such identification does not imply recommendation or endorsement by the National Institute of Standards and Technology, nor does it imply that the materials or equipment are necessarily the best available for the purpose.

Acknowledgement

This project is funded [in part] by NIST's Systems Integration for Manufacturing Applications (SIMA) Program. Initiated in 1994 under the federal government's High Performance Computing and Communications effort, SIMA is addressing manufacturing systems integration problems through applications of information technologies and development of standards-based solutions. With technical activities in all of NIST's laboratories covering a broad spectrum of engineering and manufacturing domains, SIMA is making information interpretable among systems and people within and across net worked enterprises.

We thank J.C. Travis and P.C. DeRose for helpful discussions during the development of SpectroML and for reviewing this manuscript.

A Short Introduction to XML

Since some familiarity with XML (Extensible Markup Language) and its related constructs is essential to understanding SpectroML, we provide this brief, basic background. The weblinks listed in reference 4 point to a huge pool of articles, tutorials, tools, and other resources on XML and related topics.

Markup

♣

The basic principle of markup is tagging-enclosing parts of a document within a start and an end tag: <title>This is a title.</title>

♣

Tags can be structured hierarchically to encapsulate or structure related data: <sample> <id>1001</id> <name>waier</name> </sample>

♣

Tags can contain attributes that contain data: <sample id="1001">water</sample>

♣

IDs are special attributes that permit unique identification and reference to them from other elements.

♣

An XML file is a fully tagged text file; it starts and ends with one root tag, it may contain an arbitrary number of subtags, and all content is enclosed in tags.

♣

An XML file is human-readable, but designed to be processed by computers.

DTD/Schema/Namespaces

♣

To ensure that an XML document is valid and well-formed, its document type must be defined. The standard way to do that is to write a DTD (Document Type Definition) and refer to the DTD in the header of the XML file.

♣

The DTD specifies the names of the elements and attributes and their order and number of appearance. This allows a parser to check a document and initiate further processing, for example to extract or to change data.

♣

XML schema is a newer way to define document types. It uses XML tags themselves instead of having a unique syntax such as that of a DTD.

♣

A schema is much more powerful than a DTD; for example, it provides a variety of datatypes and allows an arbitrary ordering of elements. XML Schemas might replace DTDs in the future.

♣

Defining tag vocabularies in document types raises the problem of name collision (multiple usage of the same name tag for different entities). The concept of namespaces introduces a unique prefix for each tag, so that multiply defined tags can be distinguished or even used within the same document: <person1:name>... <person2:name>...

♣

To declare a namespace, an URL (Universal Resource Locator) is assigned to each prefix. This requires that valid locations for namespace definitions be maintained; otherwise, applications that uses the namespace may be broken.

Transformation/Stylesheets

♣

A transformation language, XSLT, is used with XML to transform one class of XML documents into another.

♣

A common case is transforming an XML document into a HTML (HyperText Markup Language) document to display its data with a network browser. The mapping information for such transformations is contained in a stylesheet.

♣

Stylesheets contain rules that define patterns in the XML document and linkages to corresponding output elements.

References

McDonald

R.S.

Wilks

P.A.

Applied Spectroscopy 1988, 42, 151–162; Davies, A.N.; Lampen, P. Applied Spectroscopy 1993, 47, 1093–1099; Lampen, P.; Hillig, H.; Davies, A.N.; Linscheid, M. Applied Spectroscopy 1994, 48, 15451552; http://jcamp.isas.dortmund.de; http://jcamp.isas.dortmund.de.

http://www.galactic.com/instruments/spc.htm.

Standard Specification for Analytical Data Interchange Protocol for Chromatographic Data, ASTM E 1947–98; Standard Guide for Analytical Data Interchange Protocol for Chromatographic Data, ASTM E 1948–98; Standard Specification for Analytical Data Interchange Protocol for Mass Spectrometry Data, ASTM E 2077–00; Standard Guide for Analytical Data Interchange Protocol for Mass Spectrometry Data, ASTM E 2078–00; http://www.astm.org.

http://www.w3c.org/xml; http://www.xml.org/; http://www.xml.com/; http://www.xml101.com/.

Compilation of ASTM Standard Definitions, 8th Edition, ASTM, Philadelphia, PA, 1994; http://www.astm.org.

Rühl

M.A.

Schäfer

Kramer

G.W.

SpectroML-An Extensible Markup Language for the Interchange of Molecular Spectrometry Data NIST Interagency Report 6821, Gaithersburg, MD, 2001, Appendix A.

Staab

T.A.

Kramer

G.W.

J. Assoc. Lab. Auto. 1998, 3(5), 46–50; Staab, T.A.; Kramer, G.W. Initial CAALS Device Capability Dataset V1.0.7 NIST Interagency Report 6294, Gaithersburg, MD, 1998, 46 pp; http://www.lecis.org/downloads.htm.

Piotrowski

Richter

Schäfer

Kramer

G.W.

J. Assoc. Lab. Auto. 1998, 3(5), 51–55.

http://msdn.microsoft.com/xml; http://www.webdevelopersjournal.com/; http://www.webreference.com/xml; http://dndxml.sourceforge.net/; http://new.xmlspy.com/; http://www.xmlpitstop.com/.

10.

http://www.oasis-open.org/; http://www.ebXML.org/; http://www.uddi.org/; http://www.xml.gov/.