Publishing with XML - Ligaran - ebook

Publishing with XML ebook


26,79 zł


XML is now at the heart of book publishing techniques: it provides the industry with a robust, flexible format which is relatively easy to manipulate. Above all, it preserves the future: the XML text becomes a genuine tactical asset enabling publishers to respond quickly to market demands. When new publishing media appear, it will be possible to very quickly make your editorial content available at a lower cost. On the downside, XML can become a bottomless pit for publishers attracted by its possibilities. There is a strong temptation to switch to audiovisual production and to add video and animation to what we currently call a book, i.e. a written, relatively linear discourse representing a series of ideas. Publishers cannot ignore technology, however. It is better to recognize the threats of innovation and to maintain your business and your convictions by boarding the e-publishing ship. But make sure you carry a life preserver, XML, to ride above the waves of modern times.À PROPOS DES ÉDITIONS LIGARANLes éditions Ligaran proposent des versions numériques de qualité de grands livres de la littérature classique mais également des livres rares en partenariat avec la BNF. Beaucoup de soins sont apportés à ces versions ebook pour éviter les fautes que l'on trouve trop souvent dans des versions numériques de ces textes. Ligaran propose des grands classiques dans les domaines suivants : • Livres rares• Livres libertins• Livres d'Histoire• Poésies• Première guerre mondiale• Jeunesse• Policier

Ebooka przeczytasz w aplikacjach Legimi lub dowolnej aplikacji obsługującej format:


Liczba stron: 236

Bernard Prost
Publishing with XML
<Structure - Enter - Publish\>
Ligaran Publishing2015

EAN : 9782335086522

Copyright Ligaran 2015


71100 Chalon-sur-Saône



Summarizing the relation between XML and publishing in a short book is a difficult task, and I could never have carried it out on my own. First I wish to thank some key people at Editions Eyrolles (the publisher of the French edition of this book): my editor Stéphanie Poisson and her team, as well as Véronique Dürr who helped her with the proofreading. They have the art of giving meaning to my thoughts which occasionally get overwhelmed by technology.

I also wish to thank all those who worked with me on XML:

– the shareholders of Ligaran: Alain Pierrot, a remarkable designer of advanced taxonomies, connoisseur of the Open Office suite, XSLT author, and an expert in book scanning; Xavier Maurin, the code and graphic wiz at, who has a brilliant view of the consumer digital world; Olivier Desnoux, a software developer with impeccable methodology, author of elegant (and legible!) code, co-designer of the MyBookForge transformation engine; Adrien Vieilleribière, talented researcher, major XSLT artist able to put just about anything online and make XML transformation to any format accessible to all, who also co-designed the MyBookForge transformation engine; Patrick Pierre, a talented engineer and one of the most advanced minds in publication technology—his mastery of IDML (barely discussed in this book) is remarkable; and Hugues Cochard, serial-creator of high-tech companies, currently in Tahiti but very present via the Web.

– all those who trusted me with their professional or scientific projects, notably Mai Nguyen and Lionel Ridoux who know everything about medication and XML.

– two friends met along the way: Christian Brugeron for his clever scripts designed to work around the limitations of just about any page layout software—starting with InDesign; and Benoît Leprince who provided various examples of InDesign layouts used to illustrate this book.

Thanks to all those in the brand new e-book ecosystem which should take off at an astounding rate worldwide and perhaps in France as well: notably to Houriah Ghebalou (PREMICE, the regional business incubator in Burgundy) who financed the preliminary research for the Ligaran/Mybookforge project; the Burgundy region, which supported the project and its local set-up; and Nicéphore Cité, our home away from home in Chalon-sur-Saône which assists image and audio start-ups.

Finally I would like to thank Ray Charles, who understood that the medium influences the message: without the need to flip the 45 RPM record to listen to the other half, the famous break in What I’d Say would not exist!


Wait a minute, wait a minute, oh hold it! Hold it!


Hey (hey) ho (ho) hey (hey) ho (ho) hey (hey) ho (ho) hey

Ray Charles(What I’d Say)

If everything is under control, you are going too slow

Mario Andretti

The world of publishing is going through a sea change. Paper books are facing competition from an ever-expanding range of virtual devices: the Web (obviously); the compact, powerful, and aptly-named netbooks; and especially mobile phones and other nomadic devices like e-book readers and notepads which make complete professional and literary libraries available to all, urbi et orbi. Now publishers need to deliver content for these media, making use of their specific features while minimizing both costs and production lead times. At first publishers had to make several revisions of the same content for different target media. But today publishers are adopting a more industrial—yet also more standardized and restrictive—approach based on XML.

The flexible and universal nature of XML has attracted publishers—first and foremost those specializing in legal publications, who are used to working with SGML—as well as programmers, who can use the language to exchange data between a wide range of computer systems.


A portable device for reading electronic books (e-books). An e-reader is a hardware device using display technology called e-paper, the marketing term for a non-backlit screen requiring minimal energy and reputedly less tiring for the eyes. Along with e-paper, marketers have coined the term e-ink to describe a pixel...

The Extensible Markup Language (XML), standardized in 1999, has reached maturity. An XML ecosystem has emerged populated by specialized software (XML editors), on-shore, near-shore and off-shore service providers specializing in the language; application developers able to use the Document Object Model (DOM) to create innovative electronic products, and industry-specific document models for various types of publications.


The Document Object Model is a tree-based IT model for XML or HTML documents. DOM is independent of all other taxonomies. The DOM enables programs to manipulate document components.

Nevertheless, XML usage has not yet stabilized and practices vary among publishers. The purpose of this book is to provide a practical overview of how publishers can use XML, based on concrete, tested methods which, by nature, are limited to specific cases. Publishing with XML is neither a bible nor a dogmatic treatise on the subject, and readers can adapt the examples provided to suit their needs.

How the book is organized

This book includes three parts—Structure, Enter, and Publish—covering the entire XML cycle for publishing an “e-book”. Publishing with XML is mainly intended for publishers, editors/proofreaders, and production managers. But it also addresses managers wishing to understand the underlying techniques, and to comprehend how the medium influences the design and format of digital publications. Authors curious to learn more about XML's possibilities can also discover new ways to design their composition.

The book frequently refers to a sample encyclopedia article, similar to those found in Wikipedia. The example is based on a structure developed specifically for this publication (article_v1.2.dtd). The example meets simple editorial requirements:

– be able to publish the article in paper format, on the Web, or on a smartphone.

– include interactive publication objects regarding authors, bibliographies, filmographies and discographies. The interactive features must be independent of target databases.

For simplicity's sake, this book does not contain tables or mathematical formulas (except for a few included as images).

Structuring with XML

The first chapter focuses on document modeling and the XML markup method. The following chapter describes the main structures found in a publication, or more generally in a document. Chapter Three shows how to write a DTD, i.e. the simplest way of representing a taxonomy.


A set of tags used for encoding a document in XML. The taxonomy is usually written in a specialized language (such as DTD, XML Schema, or Relax NG).

Entering XML markup

Chapter Four concerns the actual entry of XML tags. In most cases, this job is outsourced, but publishers increasingly need to be able to modify a document using an XML editor in-house in order to correct minor errors or to make last-minute changes. This chapter focuses on configuring a commercial XML editor and using it with a specific DTD.

Chapter Five examines the relation with subcontractors: how to prepare the text to minimize errors when interpreting the structure, and how to create effective instructions.

Chapter Six discusses a step rarely described in the production process: proofing XML. It shows how to make sure the XML provided by the subcontractor meets the publisher's needs. This chapter also covers the various XML production models used for XML entry either before, during, or after the paper page layout.


Chapter Seven provides an overview of the techniques for transforming an XML document into a target format, including XML itself (e.g. input for InDesign), XHTML, or any other text format. Although highly technical, there is nothing mysterious about the XSLT transformation language. It is important for those involved in publishing to understand the mechanism in order to appreciate the impact of editorial decisions.

Chapter Eight briefly describes publishing on electronic media, but limits the discussion to the Web, e-readers, and the iPhone (currently the most advanced phone-based e-reader).

Finally, Chapter Nine investigates two approaches to paper-based publishing using an XML document:

– directly transforming XML into PDF using XSL-FO, a page layout language written in XML

– directly importing XML using a DTP tool (such as InDesign)

This book provides the keys to using XML in the editing process, but presents only the bare essentials of this modern publishing method. Interested readers can find books dedicated to each of these techniques.


XML terminology is relatively opaque. Many terms include references to SGML, style sheets, etc. but have lost their original meaning and the terms no longer reflect their actual role. You will need to apply them regardless of their usual meaning in English.

Chapter 1Separating content from format

The crucial challenge for publishers is how to build a methodology for publishing across a wide range of current or future media, with a single markup process performed either before or after publication, and at the lowest possible cost. The first step in this process is to separate content from format, far beyond the techniques of word processor style sheets.

Modeling a document

A book, or more generally any document in XML format, requires a sufficiently general model adapted to all likely publishing scenarios. You create an abstract model for a set or a “class” of documents and then submit them to a common computer process.

Identifying the three aspects of a document

Once you become familiar with XML, you will never look at a document the same way. The content of a document is created by juxtaposing words (without any typographical enrichment) and the document's form (which partially highlights the author's thoughts). But the structure is a new document component providing features which depend specifically on the planned use of the paper and electronic editions.

The content

The content is the text, i.e. what you read; it is independent of the format. The version with the least amount of format is an audio recording: each word only has its semantic value and is not supported by any typographical variations, although a few audio variants can give a word more meaning.

The format

The format enhances the information. It is based on a highly cultural and linguistic graphical translation providing an implicit manner of interpreting the text.

In our society, putting a character in bold highlights it, both for titles and within the body of the text. The character font and the position on the page reflect the level of importance: text which is “bigger and farther to the left” is usually the highest-level title.

The structure

Actually I should say “structures”: there is not just one structure, but an infinite number of structures depending on what you wish to identify for future use.

– For a novel to be published in both paper and electronic editions, you simply identify the chapters, chapter titles, paragraphs, and the text to be highlighted within each paragraph.

– For a journal article to be published on the Web with automatic search functions in Google Scholar or Google Books (or any other bibliographic database), you mark entries in the bibliography, the authors' name, and the titles of publications or journals cited.

Figure 1-1 Content, Format, Structure

The content (on the left) is made of the raw text—what can be read out loud (audio book).

The format (on the right) provides additional information which is heavily influenced by culture and practices. A title appears larger and in bold. It acts both as a “marker” and a “summary” to help readers as they discover the text.

The structure—shown here via callouts—is an abstract representation (in many cases guided by a pre-existing form) intended for multimedia use, without making any choices in principle regarding the final appearance.

Identifying document classes

There is no such thing as a generic document model able to represent any type of document. If one did exist, it would be so complex that it would be impossible to use. Therefore we try to define document “classes” that correspond to various ways of organizing information—such as a dictionary—or to natural groups such as the collections of a given publisher.

The process of defining document classes, called “document analysis”, involves extracting the structural elements for future use from a set of similar documents. You usually start from a limited number of available and representative publications, and then gradually build a model meeting your multimedia editorial requirements.

Structured documents

The most basic structured document is a novel or a dissertation. This model is the simplest, the most widely used, the most intuitive, but also the most complex for there is an infinite number of structural variations to manage (even if it means ignoring or simplifying them for electronic editions).


A graphical, textual, or numerical navigational indicator: numbering in a list, chapter numbers, etc.

A dissertation is often (but not always) divided into parts, which in turn are divided into chapters. Each chapter has an (optional) title, preceded by an (optional) number or label for positioning it in the book's organization. When the composition has neither chapter numbers nor a title, it is difficult to mark the chapters in an electronic edition. There are solutions, of course...

The most common structural component within a chapter is a paragraph: a semantic unit defined by the author and represented typographically by both an indentation on the first line—making it easy to see even if it appears at the start of a page—and a carriage return at the end. Within a paragraph, the author can highlight certain words or phrases using bold or italic font, for instance.

Finally, typographical variants related to a paragraph (such as flush right) express various concepts such as a quotation, an excerpt, an epigraph, etc. The number of possible variations is unlimited.


Each dictionary has its own structure; hence it's not realistic to speak of THE "dictionary" class structure You will fin either specific structure to each dictionnary, the target being to publish different paper version (for example a paperback dictionary) or electronic versions with advanced features (hypertext link, lookup functions, etc.)

A dictionary is closer to a database structure than to a book structure. It has entries, often sorted alphabetically, organized in semantic units more or less like a data base.

Usually, entries are structured in XML and look more or less like micro-documents which are stored within a database. Database offers powerful entry management featurdes (for example assigning an entry to an author, entry locking when editing is going on, version management...).

In case of a dictionary like the Merriam Webster dictionary, data would be organised the following way:

   Entry      Header      Definition      Variant      Illustration         Figure      Etymology   etc.

Within each entry, data would be split in as many blocks as meaning versions.

Figure 1-2 Organization of the entry "Violin" within the Merriam Webster dictionnary.
(source : Merriam-Webster)
Journals and articles

The structure of journal articles is fairly easy to share, and they represent a class of their own. Publishers of journals in the social and human sciences rely on fairly general models which are used for almost any journal, with only a few minor structural changes.

A very good example of this approach is which enables users to easily produce their own content, based on a document class whose taxonomy is published under license.

The book you are reading uses a simple article from an encyclopedia (i.e. without any tables or mathematical formulas) which could be used in production, with a few minor adjustments.

Other document classes

In cases where existing models are not suitable, a specific document analysis is required. This approach is applied by publishers of legal documents who have developed custom models. Companies publishing language-learning methods also use custom models, as they need both to publish for a range of media and to facilitate the localization (i.e. translation) of their publications while preserving everything concerning the language in question.

Other document classes serve multiple purposes. A publisher of schoolbooks, for instance, may create a set of interactive exercises based on its numerous manuals developed for education. In this case, the publisher defines a common class called “exercises” including the various types of exercises by structuring them for interactive use.

Table 1-1 A few elements in the “exercises” class for a publisher of educational booksClassMeaningMCQMultiple-choice question (several correct answers). Users must tick check boxes, with several possible choices.SCQSingle-choice question (only one correct answer). Users must choose a single radio button.Fill-in-the-blanksText with missing terms. Users must enter the appropriate text.Fill-in-the-blanks – pick listText with missing terms. Users must select the correct text from a list, which may contain false replies.Matching – drag&dropTwo lists with several graphical items.Users must drag and drop the items to the appropriate position in order to make both lists match.Matching – labelsTwo lists with text items identified by a letter or a number. Users must match the items by entering the corresponding letter or number.
Identifying structures to mark according to the target media and prospective uses

Each medium has its own specific characteristics to consider when using XML. Clearly the screen on a smartphone is unsuitable for long titles (covering four or five lines on the screen of the phone, but just one or two in a paper edition). You must either shorten the title to make it multimedia-ready from the outset, or provide a title for each target medium. In the worst case, you can leave the full title if you are only targeting media (like the iPhone) able to automatically truncate titles.


No other medium can match paper from a graphical standpoint. The publisher has total control and can decide everything: graphic identity, fonts, line spacing, hyphenation, page size, type of paper, ink, etc.

You can of course use XML to produce the information intended for paper. But this poses a problem: how can XML reflect all the typographical richness available in print? XML markup identifies the semantic aspects of the text more than its typographical ones, and the choice of the latter is highly subjective.

The first thing that comes to mind is to “force” the typography in the XML code. In fact, XML representations proposed by word processors or DTP editors use this method. There is a downside: it is very difficult (if not impossible) to produce electronic editions well suited to the target media. In order to obtain an electronic edition with the same aesthetic result as the paper edition, you can simply publish the printer's PDF used for the paper edition. Unfortunately it may be very hard to read due to the font size, for instance. Furthermore, trying to horizontally and vertically scroll an 8.5" x 11" PDF file on a smartphone screen may exasperate users.

Figure 1-3 PDF pages on a smartphone screen: the reader can enlarge the text and then scroll it, if necessary, but this is similar to using a magnifying glass to read a document: not very user-friendly...

The use of XML to produce paper will depend on a wide range of parameters, such as the structural complexity of the composition (although there are workarounds), the graphical rendering (the structure of this book, although very simple, is in marked contrast to the graphical richness of the layout), and especially the increased productivity of the DTP operator. For schoolbooks, which contain lots of graphics and often have strict seasonal constraints, XML structuring prior to page layout is not a realistic option.

As concerns paper publishing, publishers face a strategic choice as to when to introduce XML in their production cycle. Should this be done upstream, before any page layout work? During page layout via cogeneration? Or after the page layout, as if a paper edition of the document already existed? See Chapter 9 for further discussion.

In any case, your choice of an XML structure to prepare the paper edition will depend on the required level of automation. For a dictionary, automation is total, and human editing is very limited, mainly to optimize page breaks. For a novel or a dissertation, productivity will be improved by 50 to 80% in an InDesign environment, according to the complexity of notes and the index. For a schoolbook, productivity is reduced (at least with the complexity level of today's models): using XML to produce the page layout for such books would be counterproductive. This does not mean you should not use XML in this case, but rather produce the XML document after the paper publication, by simplifying and optimizing the structure for the electronic edition.


The term “XML stream” is used frequently to refer to an XML document. This expression comes from the Internet, where it is possible to start interpreting an HTML page before the entire page is transferred to the browser. In the remainder of this book, consider XML stream and XML document as synonyms.

Electronic editions

The two leading constraints for an electronic edition are the size of the screen and the synchronous connectivity of the Internet. This version also depends on whether the published text will be flowable, i.e. whether it can be adapted to the display screen and adjusted on the fly. Text flow is specific to a given medium (with default values), but it also depends on the user's choices, such as the text size or the use of a particular font.

If there is only one target device and it has a large screen, you can simply publish the PDF used for the paper page layout. In this case, no electronic publishing per se is required: you simply provide the text in a format different from the source paper document.

If the screen is small (such as a smartphone or a Nintendo DS) or if the target includes several screens of different sizes, publishers need to take this into account: titles in particular may not fit on the screen. You will need to create shorter titles (i.e. additional editing) and anticipate their location in the XML structure.

You may also wish to add browsing information specific to the electronic environment. You will need to revise the concept of the running head for its display in the tiny navigation system at the start of a chapter or section.

DEFINITION Tiny navigation

A navigation system displayed on a very small screen such as toc>part>chapter>section, where each level is accessible in hypertext.

Furthermore, the text itself must take into account the characteristics of a multimedia publication. Cross-references by page number in a paper edition are fairly disconcerting in an electronic format, unless a PDF copies the exact page layout of the paper edition and the PDF page flow is identical to the paper edition—which is not usually the case.

Organizing structural elements
Naming structural elements

You can define your own vocabulary for XML tags: you can choose tit, title, or Title to mark a title.

Figure 1-4 Structural elements of a simple book, identified using common terms.

During the document analysis, you simply apply common English terms to designate the structural elements, using the publisher's terminology. For instance, you could define “chapter title”, “level 1 section title”, “level 2 section title”, etc. These names are totally independent of your final choice; it does not affect the tag name.

Defining the hierarchy of structural elements

You must define a hierarchy of structural elements, as in a table of contents. Structural elements are arranged in tree format in order to create an acceptable representation of the document class.

Figure 1-5 Tree view of a document class (encyclopedia article)
Defining the behavior of structural elements

You can use XML to define the basic behavior of each structural element. A component may be “mandatory” or “optional”: the title of a newspaper article is usually defined as mandatory (if it is not included, the document is not valid), but the lead-in is optional. Components can also be either “unique” or “repeatable”: the article title is unique, while a paragraph is repeatable.

During the document analysis, you can use plain English by combining the adjectives “unique”, “repeatable”, “mandatory” and “optional”, or you can include an operator (?, *, +) appended to the component name (often used in DTDs—see Chapter 3).


You can combine several modeling aspects: one component may be unique and mandatory (e.g. the title of an article) while another may be repeatable and optional (e.g. a paragraph).

The resulting representation provides a good idea of the document model. It is later fine-tuned when you write the DTD (or via another system to represent the taxonomy), where you can define very accurate constraints concerning the structural elements.