Introduction to the Darwin Information Typing Architecture

Executive summary

The Darwin Information Typing Architecture (DITA) is an XML-based, end-to-end architecture for authoring, producing, and delivering technical information. This architecture consists of a set of design principles for creating "information-typed" modules at a topic level and for using that content in delivery modes such as online help and product support portals on the Web.

At the heart of DITA (Darwin Information Typing Architecture), representing the generic building block of a topic-oriented information architecture, is an XML document type definition (DTD) called "the topic DTD." The extensible architecture, however, is the defining part of this design for technical information; the topic DTD, or any schema based on it, is just an instantiation of the design principles of the architecture.

Background

This architecture and DTD were designed by a cross-company workgroup representing user assistance teams from across IBM. After an initial investigation in late 1999, the workgroup developed the architecture collaboratively during 2000 through postings to a database and weekly teleconferences. The architecture has been placed on IBM's developerWorks Web site as an alternative XML-based documentation system, designed to exploit XML as its encoding format. With the delivery of these signficant updates contains enhancements for consistency and flexibility, we consider the DITA design to be past its prototype stage.

Information interchange, tools management, and extensibility

IBM, with millions of pages of documentation for its products, has its own very complex SGML DTD, IBMIDDoc, which has supported this documentation since the early 1990s. The workgroup had to consider from the outset, "Why not just convert IBMIDDoc or use an existing XML DTD such as DocBook, or TEI, or XHTML?" The answer requires some reflection about the nature of technical information.

First, both SGML and XML are recognized as meta languages that allow communities of data owners to describe their information assets in ways that reflect how they develop, store, and process that information. Because knowledge representation is so strongly related to corporate cultures and community jargon, most attempts to define a universal DTD have ended up either unused or unfinished. The ideal for information interchange is to share the semantics and the transformational rules for this information with other data-owning communities.

Second, most companies rely on many delivery systems, or process their information in ways that differ widely from company to company. Therefore any attempt at a universal tool set also proves futile. The ideal for tools management is to base a processing architecture on standards, to leverage the contributed experience of many others, and to solve common problems in a broad community.

Third, most attempts to formalize a document description vocabulary (DTD or schema) have been done as information modelling exercises to capture the current business practices of data owners. This approach tends to encode legacy practices into the resulting DTDs or vocabularies. The ideal for future extensibility in DTDs for technical information (or any information that is continually exploited at the leading edge of technology) is to build the fewest presumptions about the "top-down" processing system into the design of the DTD.

In the beginning, the workgroup tried to understand the role of XML in this leading edge of information technology. As the work progressed, the team became aware that any DTD design effort would have to account for a plurality of vocabularies, a tools-agnostic processing paradigm, and a legacy-free view of information structures. Many current DTDs incorporate ways to deal with some of these issues, but the breadth of the issues lead to more than just a DTD. To support many products, brands, companies, styles, and delivery methods, the entire authoring-to-delivery process had to be considered. What resulted was a range of recommendations that required us to represent our design, not just as a DTD, but as an information architecture.

Main features of the DITA architecture

As the "Architecture" part of DITA's name suggests, DITA has unifying features that serve to organize and integrate information:

Topic orientation. The highest standard structure in DITA is the topic. Any higher structure than a topic is usually part of the processing context for a topic, such as a print-organizing structure or the helpset-like navigation for a set of topics. Also, topics have no internal hierarchical nesting; for internal organization, they rely on sections that define or directly support the topic.
Reuse. A principal goal for DITA has been to reduce the practice of copying content from one place to another as a way of reusing content. Reuse within DITA occurs on two levels:
- Topic reuse. Because of the non-nesting structure of topics, a topic can be reused in any topic-like context. Information designers know that when they reuse a topic in a new information model, the architecture will process it consistently in its new context.
- Content reuse. The SGML method of declaring reusable external entities is available for XML users, but this has several practical limitations in XML. DITA instead leans toward a different SGML reuse technique and provides each element with a conref attribute that can point to any other equivalent element in the same or any other topic. This referencing mechanism starts with a base element, thus assuring that a fail-safe structure is always part of the calling topic (the topic that contains the element with the conref attribute). The new content is always functionally equivalent to the element that it replaces.
Specialization. The class mechanism in CSS indicates a common formatting semantic for any element that has a matching class value. In the same way, any DITA element can be extended into a new element whose identifier gets added to the class attribute through its DTD. Therefore, a new element is always associated to its base, or to any element in its specialization sequence.
- Topic specialization. Applied to topic structures, specialization is a natural way to extend the generic topic into new information types (or infotypes), which in turn can be extended into more specific instantiations of information structures. For example, a recipe, a material safety data sheet, and an encyclopedia article are all potential derivations from a common reference topic.
- Domain specialization. Using the same specialization principle, the element vocabulary within a generic topic (or set of infotyped topics) can be extended by introducing elements that reflect a particular information domain served by those topics. For example, a keyword can be extended as a unit of weight in a recipe, as a part name in a hardware reference or as a variable in a programming reference. A specialized domain, such as programming phrases, can be introduced by substitution anywhere that the root elements are allowed. This makes the entire vocabulary available throughout all the infotyped topics used within a discipline. Also, a domain can be replaced within existing infotyped topics, in effect hiding the jargon of one discipline from writers dealing with the content of another. Yet both sets of topics can be appropriate for the same user roles of performing tasks or getting reference information.
Property-based processing. The DITA model provides metadata and attributes that can be used to associate or filter the content of DITA topics with applications such as content management systems, search engines, processing filters, and so on.
- Extensive metadata to make topics easier to find. The DITA model for metadata supports the standard categories for the Dublin Core Metadata Initiative. In addition, the DITA metadata enables many different content management approaches to be applied to its content.
- Universal properties. Most elements in the topic DTD contain a set of universal attributes that enable the elements to be used as selectors, filters, content referencing infrastructure, and multi-language support. In addition, some elements, whose attributes can serve a range of specialized roles, have been analyzed to make sure that their enumerated values provide a rich basis for specialization (which usually constrains values and never adds to them).
Taking advantage of existing tags and tools. Rather than being a radical departure from the familiar, DITA builds on well-accepted sets of tags and can be used with standard XML tools.
- Leveraging popular language subsets. The core elements in DITA's topic DTD borrow from HTML and XHTML, using familiar element names like p, ol, ul, dl within an HTML-like topic structure. In fact, DITA topics can be written, like HTML, for rendering directly in a browser. In more ambitious designs, DITA topics can be written, like SGML, to be normalized through processing into a deliverable, say XHTML or a well-formed XML format targeted for a particular browser's ability to handle XML. Also, DITA makes use of the popular OASIS (formerly CALS) table model.
- Leveraging popular and well-supported tools. The XML processing model is widely supported by a number of vendors. The class-based extension mechanism in DITA translates well to the design features of the XSLT and CSS stylesheet languages defined by the World Wide Web Consortium and supported in many transformation tools, editors and browsers. DITA topics can be processed by a spectrum of tools ranging from shareware to custom tailored products, on almost any operating platform.

Topic as the basic architectural unit

The various information architectures for online deliverables all tend to focus on the idea of topics as the main design point for such information. A topic is a unit of information that describes a single task, concept, or reference item. The information category (concept, task, or reference) is its information type (or infotype). A new information type can be introduced by specialization from the structures in the base topic DTD. Typed topics are easily managed within content management systems as reusable, stand-alone units of information. For example, selected topics can be gathered, arranged, and processed within a delivery context to provide a variety of deliverables. These deliverables might be groups of recently updated topics for review, helpsets for building into a user assistance application, or even chapters or sections in a booklet that are printed from user-selected search results or "shopping lists."

Benefits of the DITA architecture

Through topic granularity and topic type specialization, DITA brings the following benefits of the object-oriented model to information sets:

Encapsulation. The designer of the topic type only needs to address a specific, manageable problem domain. The author only needs to learn the elements that are specific to the topic type. The implementer of the processing for the topic type only needs to process elements that are special.
Polymorphism. Special topic types can be treated as more generic topic types for common processing.
Message passing. The class attribute preserves at all times the derivation hierarchy of an element. At any time, a topic may be generalized back to any earlier form, and if the class attributes are preserved, these topics may be re-specialized. One use of this capability would be to allow two separate disciplines to merge data at an earlier common part of the specialization hierarchy, after which they can be transformed into one, the other, or a brand new domain and set of infotyped topics.

DITA can be considered object-oriented in that:

Data and processors are separated from their environment and can be chunked to provide behaviors similar to object-orientation (such as override transforms that modify or redefine earlier behaviors).
Classification of elements through a sequence of derivations that are progressively more specific, possibly more constrained, and always rigidly tied to a consistent processing or rendering model.
Inheritance of behaviors, to the extent that new elements either fall through to behaviors for ancestors in their derivation hierarchy, or can be mapped to modified processors that extend previous behaviors.

With discipline and ingenuity, some of the benefits of topic information sets can be provided through a book DTD. In particular, techniques for chunking can generate topics out of a book DTD. In DITA, the converse approach is possible: a book can be assembled from a set of DITA topics. In both cases, however, the adaptation is secondary to the primary purpose of the DTD. That is, if you are primarily authoring books, it makes the most sense to use a DTD that is designed for books. If you are primarily authoring topics, it makes sense to use a DTD that is designed for topics and can scale to large, processable collections of topics.

DITA overview

The Darwin Information Typing Architecture defines a set of relationships between the document parts, processors, and communities of users of the information.

The Darwin Information Typing Architecture has the following layers that relate to specific design points expressed in its core DTD, topic.

Layers in the Darwin Information Typing Architecture

Delivery contexts
helpset	aggregate printing	Web site; information portal

Typed topic structures
topic	concept	task	reference

Specialized vocabularies (domains) across information types
Typed topic:	concept	task	reference
Included domains:	highlighting software programming user interface

Common structures
metadata	OASIS (CALS) table

A typed topic, whether concept, task, or reference, is a stand-alone unit of ready-to-be-published information. Above it are any processing applications that may be driven by a superset DTD; below it are the two types of content models that form the basis of all specialized DTDs within the architecture. We will look at each of these layers in more detail.

DITA delivery contexts

This domain represents the processing layer for topical information. Topics can be processed singly or within a delivery context that relates multiple topics to a defined deliverable. Delivery contexts also include document management systems, authoring units, packages for translation, and more.

delivery contexts
helpset	aggregate printing	Web site; information portal

DITA typed topic specializations (infotyped topics)

The typed topics represent the fundamental structuring layer for DITA topic-oriented content. The basis of the architecture is the topic structure, from which the concept, task, and reference structures are specialized. Extensibility to other typed topics is possible by further specialization.

typed topic structures
topic	concept	task	reference

The four information types (topic, concept, task, and reference) represent the primary content categories used in the technical documentation community. Moreover, specialized, information types, based on the original four, can be defined as required.

As a notable feature of this architecture, communities can define or extend additional information types that represent their own data. Examples of such content include product support information, programming message descriptions, and GUI definitions. Besides the ability to type topics and define specific content models therein, DITA also provides the ability to extend tag vocabularies that pertain to a domain. Domain specialization takes the place of what had been called "shared structures" in DITA's original design.

DITA vocabulary specialization (domains)

Commonly, when a set of infotyped topics are used within a domain of knowledge, such as computer software or hardware, a common vocabulary is shared across the infotyped topics. However, the same infotyped topic can be used across domains that have different vocabularies and semantics. For example, a hardware reference topic might refer to diagnostic codes while a software reference topic might refer to error message numbers, with neither domain necessarily needing to expose the other domain's unique vocabulary to its own writers.

Using the same technique as specialization for topics, DITA allows the definition of domains of special vocabulary that can be shared among infotyped topics. Domains can even be elided entirely, to produce typed topics that have only the core elements. The vocabulary of a domain can take the form of phrases, special paragraphs, and lists--basically anything allowed within a section, the smallest organizing part of a topic.

specialized vocabularies (domains) across information types
Typed topic:	concept	task	reference
Included domains:	highlighting software programming user interface

The basic domains defined as examples for DITA include:

Domain	Elements
highlighting	b, u, i, tt, sup, sub
software	msgph, msgblock, msgnum, cmdname, varname, filepath, userinput, systemoutput
programming	codeph, codeblock, option, var, parmname, synph, oper, delim, sep, apiname, parml, plentry, pt, pd, syntaxdiagram, synblk, groupseq, groupchoice, groupcomp, fragment, fragref, synnote, synnoteref, repsep, kwd
user interface	uicontrol, wintitle, menucascade, shortcut

By following the rules for specializing a new domain of content, you can extend, replace, or remove these domains. Moreover, content specialization enables you to name and extend any content element in the scope of DITA infotyped topics for a more semantically significant role in a new domain.

To enable specialized vocabulary, you declare a parameter entity equivalent for every element used in a DTD (such as topic or one of its specializations), and then use the parameter entities instead of literal element tokens within the content models of that DTD. Later, after entity substitution, because an element's parameter entity is redefined to include both the original element and the domain elements derived from that element, anywhere the original element is allowed, the other derived domain elements are also allowed. In effect, a domain-agnostic topic can be easily extended for different domains by simply changing the scope of entity set inclusions in a front-end DTD "shell" that formalizes the vocabulary extensions within that typed topic or family of typed topics

DITA common structures

One of the design points of DITA has been to exploit the reuse of common substructures within the world of XML. Accordingly, the topic DTD incorporates the OASIS table model (known originally as the CALS table model). It also has a defined set of metadata that might be shared directly with the metadata models of quite different DTDs or schemas.

common structures
metadata	OASIS (CALS) table

The metadata structure defines document control information for individual topics, higher-level processing DTDs, or HTML documents that are associated to the metadata as side files or as database records.

The table structure provides presentational semantics for body-level content. The OASIS/CALS table display model is supported in many popular XML editors.

Elements designed for specialization

DITA provides a rich base for specialization because of the general design of elements used in its archetype-like topic DTD.

For example, a section in the base topic DTD can contain both text and element data. However, a section can be specialized to eliminate PCDATA, yielding an element-only content model similar to the body level of most DTDs. Specialized another way, a section can eliminate most block-like elements and thus be characterized as a description for definitions, field labels, parts, and so forth.

In DITA, an effort has been made to select element names that are popular or that are common with HTML. Some semantic names have been borrowed from industry DTDs that support large SGML libraries, such as IBMIDDoc and DocBook.

The attribute lists within the topic DTD reflect this design philosophy. For example, one of the "universal attributes" (they appear on most elements) is importance, which defines values for weightings or appraisals that are often used as properties in specialized elements. This attribute shows up in several elements of the task topic specialization with only two allowed values out of the original set, "optional' and "required." In other domains, the elements are more appropriately ranked as "high" or "low," again values that are provided at the topic level.

The values of specialization

A company that has specific information needs can define specialized topic types. For example, a product group might identify three main types of reference topic: messages, utilities, and APIs. By creating a specialized topic type for each type of content, the product architect can be assured that each type of topic has the appropriate content. In addition, the specialized topics make XML-aware search more useful because users can make fine-grained distinctions. For example, a user could search for "xyz" only in messages or only in APIs, as well as search for "xyz" across reference topics in general.

There are rules for how to specialize safely: each new information type must map to an existing one and must be more restrictive in the content that it allows. With such specialization, new information types can use generic processing streams for translation, print, and Web publishing. Although a product group can override or extend these processes, they get the full range of existing processes by default without any extra work or maintenance.

A corporation can have a series of DTDs that represent a consistent set of information descriptions, each of which emphasizes the value of specialization for those new information types.

Role of content communities in the Darwin Information Typing Architecture

The technical documentation community that designed this architecture defined the basic architecture and shared resources. The content owned by specified communities (within or outside of the defining community) can reuse processors, styles, and other features already defined. But, those communities are responsible for their unique business processes based on the data that they manage. They can manage data by creating a further specialization from one of the base types.

The following figure represents how communities, as "content owners at the topic level," can specialize their content based on the core architecture.

Relationship of specialized communities to the base architecture

In this figure, the overlap represents the common architecture and tools shared between content-owning communities that use this information architecture. New communities that define typed documents according to the architecture can then use the same tools at the outset, and refine their content-specific tools as needed.

Notices

The information provided in this document has not been submitted to any formal IBM test and is distributed "AS IS," without warranty of any kind, either express or implied. The use of this information or the implementation of any of these techniques described in this document is the reader's responsibility and depends on the reader's ability to evaluate and integrate them into their operating environment. Readers attempting to adapt these techniques to their own environments do so at their own risk.

DITA Open Toolkit, version 1.8.M2