Science With the Virtual Observatory
2005 Summer School

Introduction to eXtensible Markup Language

For this to work you need the software listed on the the software page.
Specifically, we will be looking at an example XML file in $NVOSS/java/dev/XMLparse.

Overview

This lesson provides a brief introduction to XML.  We will explore the components that make up an XML document as well as how to form correct XML documents.    Finally, we will examine an example XML document.  The student exercise involves construction of a correct XML document.

We shall look at :

What is XML?

Anatomy of an XML Document

The xml directive

XML documents should start with the xml directive.  This line in the document states which version of xml is used in the remainder of the document.  The first line in Listing 1 is the xml directive.

Elements

XML like HTML is constructed using tags which define elements of the document. Each portion of the document is set apart by beginning and ending tags. Listing 1 shows a short XML document with 5 separate elements. Elements can have element content, mixed content, simple content, and empty content.

  • Element content elements contain only other elements. Example: <FAMILYTREE>
  • Mixed content elements contain both text and other elements. Example: <FAMILY>
  • Simple content elements contain only text. Example: <MOTHER>
  • Empty content elements contain no information within the tag block. Example: <CHILDREN>
  • <?xml version="1.0"?>
    <FAMILYTREE>
    <FAMILY> Krughoff Family
    <MOTHER> Noell </MOTHER>
    <FATHER> Tom </FATHER>
    <CHILDREN progeny="yes"></CHILDREN>
    </FAMILY>
    </FAMILYTREE>

    Listing 1

    Attributes

    XML elements can also have attributes associated with them. Attributes are name--value pairs associated with the element but not contained within the tag block. In Listing 1 the <CHILDREN> element has an attribute called 'progeny' with value 'yes' associated with it. Each element may have multiple attributes.

    Attributes are frequently used in HTML. In XML, however, it is a good idea to avoid overusing attributes. A rule of thumb is: If the attribute contains data, use a child element instead. Attributes in XML can be useful for storing identifier numbers in documents where many instances of the same element occur.

    Comments

    Comments in XML documents are the same as in HTML documents <!-- begins the comment and --> ends the comment.

    Special Characters

    Similar to HTML, XML treats some characters as special. The characters <, >, &, ", and ' are some the characters which cannot appear in the text of elements. These must be translated to the appropriate entity reference. For example: & must be written as &amp;.

    Namespaces and Grammars

    In order to modularize and categorize XML documents, XML documents may contain references to a namespace. The namespace gives an identifier to the document. A technical description of namespaces can be found here.

    Just as XML documents may be, but do not have to be, associated with a namespace, they may also have an associated document which describes the acceptable blocks within the document. There are two options for document description.

  • Document Type Definition (DTD): This is a document which is not strict XML but describes an XML document.
  • XML Schema: The XML Schema document is written in XML and serves the same purpose as the DTD, but is richer and extensible.
  • Both the DTD and XML Schema describe the form that a valid XML document should take. For example, they indicate which elements are valid children, which element is the root node, and even the datatype associated with and element. More discussion of Schema and DTDs will be carried out in the next session.

    Features of XML

    Well Formed, Parseable

    Unlike HTML, XML must be strictly well formed. As an example, most HTML parsers will ignore ending tags if they are left off of one line elements. This is not true of XML. All beginning tags must have an associated ending tag. Following is a list of several of the most important aspects of well formedness.

  • Every start-tag must have a matching end-tag
  • Tags can't overlap
  • There can be only one root element
  • Element names must obey the XML naming conventions
  • XML is case sensitive
  • Whitespace is preserved
  • Listing shows a few examples of common gotchas associated with XML well formedness.

    Case sensitive:
    <TAG> This is incorrect </tag>
    <TAG> This is correct </TAG>

    Overlapping tags:
    <TAG1>
    <TAG2>
    This is incorrect
    </TAG1>
    </TAG2>

    <TAG1>
    <TAG2>
    This is correct
    </TAG2>
    </TAG1>

    Single root node:
    Incorrect:
    <?xml version="1.0"?>
    <FAMILY> Krughoff Family
    <MOTHER> Noell </MOTHER>
    <FATHER> Tom </FATHER>
    <CHILDREN progeny="yes"></CHILDREN>
    </FAMILY>
    <FAMILY> Worland Family
    <MOTHER> Wilhelmina </MOTHER>
    <FATHER> Vincent </FATHER>
    <CHILDREN progeny="yes"></CHILDREN>
    </FAMILY>

    Correct:
    <?xml version="1.0"?>
    <FAMILYTREE>
    <FAMILY> Krughoff Family
    <MOTHER> Noell </MOTHER>
    <FATHER> Tom </FATHER>
    <CHILDREN progeny="yes"></CHILDREN>
    </FAMILY>
    <FAMILY> Worland Family
    <MOTHER> Wilhelmina </MOTHER>
    <FATHER> Vincent </FATHER>
    <CHILDREN progeny="yes"></CHILDREN>
    </FAMILY>
    </FAMILYTREE>

    Listing 2

    Well formedness has the primary benefit of making XML easily human readable and easy to parse by machine.

    Extensible

    There are no predefined elements in XML. This allows the user to define all the elements in use. It also makes it easy to extend XML documents to handle complex datatypes as they come into use.

    Hierarchical

    XML is inherently hierarchical. Each element is either a parent or a child element or both. The only element which is only a parent element is the root element. The hierarchical nature of XML makes it directly applicable to hierarchical datatypes like trees or tables.

    Plain Text

    XML is a plain text protocol. Thus, by nature, XML is human readable, mailable, and easily editable.

    An Example XML Document: The Family Tree

    The XML Hierarchy

    <TREE>
    <FAMILY>
    <MOTHER> Billy </MOTHER>
    <FATHER> Vincent </FATHER>
    <CHILDREN>
    <SON> Peter </SON>
    <DAUGHTER> Sue </DAUGHTER>
    <FAMILY>
    <MOTHER progeny="true"> Noell </MOTHER>
    <FATHER progeny="false"> Tom </FATHER>
    <CHILDREN>
    <SON> Simon </SON>
    <DAUGHTER> Laura </DAUGHTER>
    <SON> Stephen </SON>
    <FAMILY>
    <MOTHER progeny="true"> Emily </MOTHER>
    <FATHER progeny="false"> Jarrod </FATHER>
    <CHILDREN>
    <SON> Henry </SON>
    </CHILDREN>
    </FAMILY>
    </CHILDREN>
    </FAMILY>
    </CHILDREN>
    </FAMILY>
    </TREE>
    Listing 3

    We will use the family tree to exemplify the hierarchy inherent in XML. Listing 3 shows an excerpt from an XML document describing my immediate family. Figure 1 shows a tree representation of the same XML document.

    Figure 1.
    Figure 1

    The family tree is a good metaphor for the XML hierarchy, partly because of similar terminology. Nodes closer to the root node are referred to as parent nodes to those directly below them. The lower nodes are known as child or progeny nodes. Nodes with the same parent node are siblings.

    Example File

    A more complete version of my family, with additional tags and attributes is available here. You may use it as a starting point for completing the student exercise.

    Student Exercise

    Write an XML description of your own family tree. You do not have to use the same hierarchy that I use in the example. Feel free to use the extensibility of XML to create a unique set of elements.

    Useful Links:

    XML Checker
    XML tutorial
    xml.org


    The NVO Summer School is made possible through the support of the National Science Foundation and the National Aeronautics and Space Administration.