Welcome to the first of what I hope will be a series of irregular guides aimed at providing an introduction to a given technology. This article focuses on XML and how it can be used to store structured data. XML is used extensively on the web as a means of passing data between databases, websites and applications. It enables applications that might not otherwise be compatible to share data by passing it in a format that both applications can decipher and use.
XML stands for extensible markup language. XML is just a way of structuring data. It doesn't do anything else. It relies on a software application to make use of the data it holds.
XML generally consists of a series of bracketed tags, otherwise known as 'elements' or 'nodes'. Elements can contain other elements or actual data.
Below is an example of some XML, in this case, a catalogue of some of my guitars.
<model>Les Paul Standard</model>
<model>Les Paul Custom</model>
In this example, there is an element called 'instruments' that contains several (child) elements called 'guitar', each of which has other child elements nested inside it (such as 'make' and 'model'). Several of those child elements have textual or numerical data in them.
XML needs to be in a certain format so applications can understand it.
All elements must have an opening tag and a closing tag. These sit either side of the data the element contains. Both the opening and closing tags take the same basic format. Both use angled brackets either side of the name of the element. The only difference is that the closing tag includes a forward slash before the name.
The name of an element in an XML document doesn't matter. You can call your elements anything you like. There are, however, a few restrictions.
- The letter case must be the same in both the opening and closing tags. Upper and lower case characters can be used, but both tags must use the same format. For example, <guitar> does not match with </Guitar> because the case of the G is different.
- The name must start with a letter (A-Z, upper or lower case). It can contain numbers and other characters, but they cannot fall at the start of the name.
- The name cannot contain any spaces.
- The name cannot start with the letters 'xml' (in any case).
The data relating to an element is stored in between its opening and closing tags. Data can include text, numbers, and/or other elements, as we saw above.
When you don't have any data in an element, you can either have the opening and closing tags together…
…or you can close the element by finishing the opening tag with a forward slash. This is the only situation where you wouldn't also require a separate closing tag.
Both of these methods signify that the element is empty.
An alternative is to leave the empty element out altogether. This reduces the size of the XML document, making it easier to transfer, as it means it isn't including elements that don't have any data in them. Sometimes, however, it's useful to include the empty element just so that the structure of the XML is preserved. This is particularly useful if there are multiple elements of the same type (such as the 'guitar' elements in the example above) that all have the same child elements.
Data can also be stored in attributes. Attributes are appended to the opening tag before the closing bracket (>) and consist of an attribute name, followed by an equals sign (=) and the data it contains in quotation marks.
The attribute name has the same restrictions as for element names above (except that it can start with 'xml' in any case if you wish), but can otherwise be called anything you like. An element can have as many attributes as you like but each attribute for a given element must have a different name.
The closing tag of an element will never have any attributes.
XML data is structured so you can see that any data stored within an element, including that of other elements inside it, relate to that element.
For example, if we take one of the elements from the XML above...
<model>Les Paul Standard</model>
...we can see that the 'guitar' has both a 'make' and a 'model'. The 'make' of the guitar is Gibson and the 'model' is Les Paul Standard. All of this data relates to the same guitar.
This is a very basic example. XML documents can be much more complex and contain whole catalogues of data. However, the principles are always the same.
Data, encoding and special characters
Elements can store any type of data of any length, including text, numbers and special characters. Attributes are best for small items of data, but there aren't really any limits and you don't have to use them at all if you don't want to.
Some characters serve a specific purpose in XML. For this reason, they cannot be included in items of data without first being encoded. Encoding is a way of replacing a character with something else so that it doesn't break the XML.
Encoding normally takes the form of some text or numbers between an ampersand (&) and a semi-colon (;). This is called an entity reference as it refers to the character entity it replaces. It can take the form of an 'entity name' (where the characters between the ampersand and semi-colon are text) or an 'entity number' (where the characters are numbers).
There are five characters that cannot be used within XML data without being encoded. XML includes some predefined entity references for these five characters so that they can be easily encoded.
The characters are as follows...
|Character description||Character||Entity reference|
|Opening angled bracket or 'less than' symbol||<||<|
|Closing angled bracked or 'greater than' symbol||>||>|
|Apostrophe or single quotation mark||'||'|
|Double quotation mark||"||"|
XML uses these characters for other things. Applications reading an XML document that has any of them within the data will have problems. The angled brackets (< and >) are a good example of this. Because XML uses angled brackets to define the tags of an XML element, any software trying to parse an XML document will get confused if it finds an angled bracket that isn't part of an element tag. A maths expression such as 10 > 2, for example, would cause problems. Encoding gets around this.
The XML standard recognises a number of different entity references. Any character (including letters and numbers) could feasibly be encoded in this way if you know the right entity reference. It is only necessary, however, with certain characters that would otherwise cause problems parsing the XML or that XML doesn't recognise (such as the pound sign (£), which becomes £ or £).
XML is a great way of storing data and sharing it between applications (including websites). It's also a common standard, meaning that any developer familiar with XML will be able to retrieve structured data from it one way or another, regardless of what the elements are called.
But XML doesn't do anything else. In order to actually use it, you first need to do something with it. This could mean importing it into a database or a spread sheet or parsing it with another kind of application.
If you want to use XML data on a web page, you first need to transform it into another format. That's where XSL comes in. But that's for another day...
Our monthly helping of digital goodness straight from the soil.