XML schema
An XML schema is a description of a type of XML document, typically expressed in terms of constraints on the structure and content of documents of that type, above and beyond the basic syntactical constraints imposed by XML itself. These constraints are generally expressed using some combination of grammatical rules governing the order of elements, Boolean predicates that the content must satisfy, data types governing the content of elements and attributes, and more specialized rules such as uniqueness and referential integrity constraints.
There are languages developed specifically to express XML schemas. The document type definition language, which is native to the XML specification, is a schema language that is of relatively limited capability, but that also has other uses in XML aside from the expression of schemas. Two more expressive XML schema languages in widespread use are XML Schema and RELAX NG.
The mechanism for associating an XML document with a schema varies according to the schema language. The association may be achieved via markup within the XML document itself, or via some external means.
The XML Schema Definition is commonly referred to as XSD.
Validation
The process of checking to see if a XML document conforms to a schema is called validation, which is separate from XML's core concept of syntactic well-formedness. All XML documents must be well-formed, but it is not required that a document be valid unless the XML parser is "validating", in which case the document is also checked for conformance with its associated schema. DTD-validating parsers are most common, but some support XML Schema or RELAX NG as well.Validation of an instance document against a schema can be regarded as a conceptually separate operation from XML parsing. In practice, however, many schema validators are integrated with an XML parser.
Languages
There are several different languages available for specifying an XML schema. Each language has its strengths and weaknesses.The primary purpose of a schema language is to specify what the structure of an XML document can be. This means which elements can reside in which other elements, which attributes are and are not legal to have on a particular element, and so forth. A schema is analogous to a grammar for a language; a schema defines what the vocabulary for the language may be and what a valid "sentence" is.
There are historic and current XML schema languages:
| Language | Abbrev. | Versions | Authority |
| Constraint Language in XML | CLiX | 2005 | Independent |
| Document Content Description facility for XML, an RDF framework | DCD | v1.0 | W3C |
| Document Definition Markup Language | DDML | v0 | W3C |
| Document Structure Description | DSD | 2002, 2005 | BRICS |
| Document Type Definition | DTD | 1986 | ISO |
| Document Type Definition | DTD | 2008 | ISO/IEC |
| Namespace Routing Language | NRL | 2003 | Independent |
| Namespace-based Validation Dispatching Language | NVDL | 2006 | ISO/IEC |
| Content Assembly Mechanism | CAM | 2007 | OASIS |
| REgular LAnguage for XML Next Generation | RELAX NG, RelaxNG | 2001, Compact Syntax | OASIS |
| REgular LAnguage for XML Next Generation | RELAX NG, RelaxNG | v1, v1 Compact Syntax, v2 | ISO/IEC |
| Schema for Object-Oriented XML | SOX | ||
| Schematron | 2006, 2010, 2016, 2020 | ISO/IEC | |
| XML-Data Reduced | XDR | ||
| ASN.1 XML Encoding Rules | XER | ||
| XML Schema | WXS, XSD | 1.0, 1.1 | W3C |
The main ones are described below.
Though there are a number of schema languages available, the primary three languages are Document Type Definitions, W3C XML Schema, and RELAX NG. Each language has its own advantages and disadvantages.
Document Type Definitions
Tool support
DTDs are perhaps the most widely supported schema language for XML. Because DTDs are one of the earliest schema languages for XML, defined before XML even had namespace support, they are widely supported. Internal DTDs are often supported in XML processors; external DTDs are less often supported, but only slightly. Most large XML parsers, ones that support multiple XML technologies, will provide support for DTDs as well.W3C XML Schema
Advantages over DTDs
Features available in XSD that are missing from DTDs include:- Names of elements and attributes are namespace-aware
- Constraints can be defined for the textual content of elements and attributes, for example to specify that they are numeric or contain dates. A wide repertoire of simple types are provided as standard, and additional user-defined types can be derived from these, for example by specifying ranges of values, regular expressions, or by enumerating the permitted values.
- Facilities for defining uniqueness constraints and referential integrity are more powerful: unlike the ID and IDREF constraints in DTDs, they can be scoped to any part of a document, can be of any data type, can apply to element as well as attribute content, and can be multi-part.
- Many requirements that are traditionally handled using parameter entities in DTDs have explicit support in XSD: examples include substitution groups, which allow a single name to refer to a whole class of elements; complex types, which allow the same content model to be shared by multiple elements; and model groups and attribute groups, which allow common parts of component models to be defined in one place and reused.
- XSD 1.1 adds the ability to define arbitrary assertions as constraints on element content.
As well as validation, XSD allows XML instances to be annotated with type information which is designed to make manipulation of the XML instance easier in application programs. This may be by mapping the XSD-defined types to types in a programming language such as Java or by enriching the type system of XML processing languages such as XSLT and XQuery.
Commonality with RELAX NG
RELAX NG and W3C XML Schema allow for similar mechanisms of specificity. Both allow for a degree of modularity in their languages, including, for example, splitting the schema into multiple files. And both of them are, or can be, defined in an XML language.Advantages over RELAX NG
RELAX NG does not have any analog to PSVI. Unlike W3C XML Schema, RELAX NG was designed so that validation and augmentation are separate.W3C XML Schema has a formal mechanism for attaching a schema to an XML document, while RELAX NG intentionally avoids such mechanisms for security and interoperability reasons.
RELAX NG has no ability to apply default attribute data to an element's list of attributes, while W3C XML Schema does. Again, this design is intentional and is to separate validation and augmentation.
W3C XML Schema has a rich "simple type" system built-in, while RELAX NG has an extremely simplistic one because it is meant to use type libraries developed independently of RELAX NG, rather than grow its own. This is seen by some as a disadvantage. In practice it is common for a RELAX NG schema to use the predefined "simple types" and "restrictions" of W3C XML Schema.
In W3C XML Schema a specific number or range of repetitions of patterns can be expressed whereas it is practically not possible to specify at all in RELAX NG.
Disadvantages
W3C XML Schema is complex and hard to learn, although that is partially because it tries to do more than mere validation.Although being written in XML is an advantage, it is also a disadvantage in some ways. The W3C XML Schema language, in particular, can be quite verbose, while a DTD can be terse and relatively easily editable.
Likewise, WXS's formal mechanism for associating a document with a schema can pose a potential security problem. For WXS validators that will follow a URI to an arbitrary online location, there is the potential for reading something malicious from the other side of the stream.
W3C XML Schema does not implement most of the DTD ability to provide data elements to a document.
Although W3C XML Schema's ability to add default attributes to elements is an advantage, it is a disadvantage in some ways as well. It means that an XML file may not be usable in the absence of its schema, even if the document would validate against that schema. In effect, all users of such an XML document must also implement the W3C XML Schema specification, thus ruling out minimalist or older XML parsers. It can also slow down the processing of the document, as the processor must potentially download and process a second XML file ; however, a schema would normally then be cached, so the cost comes only on the first use.
Tool Support
WXS support exists in a number of large XML parsing packages. Xerces and the.NET Framework's Base Class Library both provide support for WXS validation.RELAX NG
RELAX NG provides for most of the advantages that W3C XML Schema does over DTDs.Advantages over W3C XML Schema
While the language of RELAX NG can be written in XML, it also has an equivalent form that is much more like a DTD, but with greater specifying power. This form is known as the compact syntax. Tools can easily convert between these forms with no loss of features or even commenting. Even arbitrary elements specified between RELAX NG XML elements can be converted into the compact form.RELAX NG provides very strong support for unordered content. That is, it allows the schema to state that a sequence of patterns may appear in any order.
RELAX NG also allows for non-deterministic content models. What this means is that RELAX NG allows the specification of a sequence like the following:
When the validator encounters something that matches the "odd" pattern, it is unknown whether this is the optional last "odd" reference or simply one in the zeroOrMore sequence without looking ahead at the data. RELAX NG allows this kind of specification. W3C XML Schema requires all of its sequences to be fully deterministic, so mechanisms like the above must be either specified in a different way or omitted altogether.
RELAX NG allows attributes to be treated as elements in content models. In particular, this means that one can provide the following:
This block states that the element "some_element" must have an attribute named "has_name". This attribute can only take true or false as values, and if it is true, the first child element of the element must be "name", which stores text. If "name" did not need to be the first element, then the choice could be wrapped in an "interleave" element along with other elements. The order of the specification of attributes in RELAX NG has no meaning, so this block need not be the first block in the element definition.
W3C XML Schema cannot specify such a dependency between the content of an attribute and child elements.
RELAX NG's specification only lists two built-in types, but it allows for the definition of many more. In theory, the lack of a specific list allows a processor to support data types that are very problem-domain specific.
Most RELAX NG schemas can be algorithmically converted into W3C XML Schemas and even DTDs. The reverse is not true. As such, RELAX NG can be used as a normative version of the schema, and the user can convert it to other forms for tools that do not support RELAX NG.