Tuesday, May 02, 2006

Understanding W3C Schema Complex Types By Donald Smith

Are W3C XML Schema complex types so difficult to understand that you shouldn't even bother trying? Kohsuke Kawaguchi thinks so; or so he claimed in his recent XML.com article, in which he offered assurances that you can write complex types without understanding them.

My response to that assertion is to ask why would you want to write complex types without understanding them, especially when they are easily understandable? There are four things you need to know in order to understand complex types in W3C Schemas. These four things are easy to understand. See what you think about Kawaguchi's argument after learning them.

One of the most important, but least emphasized, aspects of W3C schemas is the type hierarchy. The importance of the type hierarchy can hardly be overstated. Why? Because the syntax for expressing types in schemas follows precisely from the type hierarchy.

XML Schema Part 2: Datatypes, Section 3 contains a helpful graphic that explains the schema type hierarchy.

Schema types form a hierarchy because they all derive, directly or indirectly, from the root type. The root type is anyType. (You can actually use anyType in an element declaration; it allows any content whatsoever.) The type hierarchy first branches into two groups: simple types and complex types. Here we encounter the first two of the four things you need to know in order to understand complex types: first, derivation is the basis of connection between types in the type hierarchy; and, second, the initial branching of the hierarchy is into simple and complex types.

Kinds of Derivation
To derive a type means to take an existing type (called the "base") and modify it in some way so as to produce a new type. There are four kinds of derivation: restriction, extension, list, and union. This discussion looks at derivation by restriction and extension since they are the most commonly used.

Derivation by restriction takes an existing type as the base and creates a new type by limiting its allowed content to a subset of that allowed by the base type. Derivation by extension takes an existing type as the base and creates a new type by adding to its allowed content.

Simple Types versus Complex Types
Simple types and complex types differ in this way: simple types cannot have element children or attributes; complex types may have element children and attributes.

Tracing the type hierarchy down the branch of simple types, we see that the first simple type is anySimpleType, which is also type that you could actually use. W3C XML Schemas has 44 built-in simple types, each of which derives from anySimpleType, and all but three of which derive by restriction. Not a single one derives by extension. To extend a simple type would mean to add element children or an attribute. This contradicts the definition of simple type in W3C Schemas and is thus prohibited.

Deriving a New Simple Type
Thinking in terms of the type hierarchy, it ought to be relatively straightforward to derive a new simple, "myNameType", that restricts its base type, "string", to a specific, fixed subset, "Don Smith". The W3C XML Schema fragment for expressing myNameType is







As you can see, the XML Schema for this type definition follows the type hierarchy exactly -- except for the enumeration element, one of the twelve facets that can be used to qualify types. We won't look at facets here since our concern is with the relation between the type hierarchy and the Schema syntax for expressing types. I simply used this one to complete the example. (See Table B1.a.Simple Types and Applicable Facets in Schema Part 0: Primer for a convenient list of facets.)

Now I just have to associate myNameType with an element, and then I can use the type in an XML document. So the element declaration


lets me use Don Smith in an XML document instance.

What about Complex Types?
Does the syntax for complex types also follow the logic of the type hierarchy? Yes. But the type hierarchy diagram doesn't help us at this point because it doesn't provide two crucial pieces of information about complex types. However, once you understand these two points, complex types lose their indecipherable complexity and become quite intelligible.

The Two Forms of Complex Types
Complex types are divided into two groups: those with simple content and those with complex content. And that leads us to the third thing you need to know in order to understand complex types: while both forms of complex type allow attributes, only those with complex content allow child elements; those with simple content only allow character content.

In other words, the difference between complex types with simple content and complex types with complex content is that the former do not allow element children while the latter do. That's it. The two forms of complex type are represented in what I call the Schema Type Decision Tree (PDF) under the complex type branch.

Let's suppose I want to add an attribute to myNameType. Adding an attribute to a simple type always moves it into the complex type branch of the type hierarchy. Once on the complex type branch, I must ask a second question. Do I want the new type to allow element content? If I don't, then my new type must be a complex type with simple content. After that, I simply take dc:myNameType as the base type and extend it by adding an attribute:









Now, after declaring my element "employee" to be of myNewNameType, I can have Don Smith in my XML document instance.

It may seem odd that adding an attribute to a simple type requires the creation of a new complex type, one that has simple content to boot. But that's the logic of the type hierarchy: a type that has attributes must be a complex type, and that type can either allow element children or not. Perhaps an odd logic, but it is intelligible.

Let's suppose now that I want my complex type to have child elements. That requires a complex type with complex content. So I simply add my content model and attributes (if any). That's easy. But maybe too easy. We must be careful not skip over a crucial fact that makes a big difference.

Adding a content model is still a derivation of a new type from some base type. If I do not take an existing complex type as the base for the new derivation, what will I use for a base type? I'll use anyType. The vast majority of types that allow element content are restrictions of anyType. For example,















The type associated with "employee" now has an element named "name" followed by an element named "location". Further, personnel can have an attribute named "position":


Don Smith
Dallas, TX


The logic behind the syntax is straightforward. I want a type that allows child elements. That requires a complex type with complex content, while still deriving a new type from a base type. In this case I'm restricting anyType; I could as easily extend another type. I add my content model and an attribute declaration. I'm done, and it was all pretty easy.

Ambushed by Abbreviation

But can't this be expressed more concisely? Yes, it can. There is an abbreviated form for all complex type definitions that have complex content and restrict anyType. You simply leave out the and elements:









This type definition is equivalent to the previous one. And that leads us to the fourth thing you need to know in order to understand complex types: the default syntax for complex types is complex content that restricts anyType.

Why didn't I show you the abbreviated syntax first? Because the abbreviation obscures the logic behind the default syntax. If all you see is followed by a content model, it's totally confusing as to why complex types sometimes have or child elements or, often, neither.

Now that you know the logic behind the two forms of complex type, you won't be confused when you see a complex type that has neither nor . You know what the default is.

Those Tricky Empty Elements
Writing type definitions for empty elements turns out to be counter-intuitive, but, fortunately, the logic behind the complex type syntax still holds. Remember that an empty element is one that has neither data content nor child elements. It may have an attribute. Let's take the case of an empty element that doesn't have an attribute.

Your first inclination might be to associate the empty element with a simple type. But that won't work since simple types allow data content. So it must be a complex type. The, ask yourself the next question. Will it allow element children? No. We need a with , right?

Wrong. Complex types with simple content also allow data content, and we want an empty element. That leaves us with with , which ensures that there will not be any data content in the element. But we don't want child elements, either, and a complex type with complex content allows child elements. The key is that it doesn't require them. What do we do? Simply leave the content model out of the type definition:










Our type definition, now associated with the element "callMyApp", allows the markup to occur in my XML document instance.

Now apply the default syntax for complex types to this type definition. An definition equivalent to the one above is




It's no wonder that people get confused about complex types. They generally don't realize that all complex types are divisible into two kinds: those with simple content and those with complex content. The reason why people don't generally realize this is because they normally learn the abbreviated syntax first. But, as we've seen, if you learn the full syntax and the logic behind it first, then the abbreviated syntax, and complex types in general, cease to be a befuddingly conundrum.

If all of this is now as clear to you as it is to me, you don't have to trust anyone's assurances that you should use complex types without understanding them. You can now use and understand them.

(http://www.xml.com/lpt/a/2001/08/22/easyschema.html)

No comments: