Source files of,,,,, and Contribute:
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

235 lines
14 KiB

<?xml version="1.0" encoding="UTF-8" ?>
<title>Minimalistic Data Format – Open Standards</title>
<body class="article">
<p id="category">
<a href="/freesoftware/standards/standards.html">Open Standards</a>
<h1>The minimal principle: because being an open standard is not enough.</h1>
<p>A tool is useless without something to work on. So what do we shape with
our computing tools? Data, information, knowledge, opinions, art – in
short: Content. Content is created, processed and transmitted. Nowadays
much more often directly in some electronic format. The number of people
who own devices and connect to the internet is constantly rising. And they
use it to evolve their ways of working together. </p>
<p>Content is sent from one user to another and back. To do this, the
content must take on some form: The data-format. This defines how content
and its wrapping are to be handled, what is allowed and how each part looks
within a file or stream. Anyone who wants to participate in the data
exchange must use a software application that understands the data-format
in question. Otherwise the content would appear like an unknown foreign
language to your computer. If a data-format does not allow for the
inclusion of pictures, for example, then there is no way to include
pictures with it. The choice of data-format dictates the number of years
for which I may access the content (backwards compatibility) and what I am
able do with it. </p>
<p>A single user will probably not feel any effect of her decision when
saving a file in a particular data-format. When an IT-department or a
public administration decides upon a data-format the impact is far greater:
It will dominate their choice of software for several years, possibly
decades. The more an organisation saves its precious writings, recordings
or pictures electronically, the more important it becomes to secure
continued access to the documents. These decisions, directly or indirecty,
lead to the funding of the initial development and maintenance of
data-formats, whether they be "good or bad" formats. The choices taken at
one time naturally affect the available choices in the future: Many
software producers intentionally try to influence users to use a
data-format that they (the producer) control. For example when technical
schematics of vehicles, buildings or machinery are all held in a format
controlled by the software producer, the producer of the CAD application
can in essence hold the data for ransom when its time to renew the
contracts. From the vendor’s point of view this is a strong position to be
in for the next pricing negotiation. Occasionally, whole countries have
managed to maneuver themselves into the losing end of this situation. </p>
<p>As you can see, a good data-format can only be an <a
href="/freesoftware/standards/def.html">Open Standard</a>. This requirement
alone, however, is not enough. The data-format needs to solve a problem
adequately: It should be a good fit from a functional point of view, as
well as on a technical level. In order to judge this, there are a number
of things to consider. The <a
href="">Essay by Bert
Bos</a> explains the design principles of the W3C - the organisation which
develops the formats of the world wide web. He mentions efficiency,
maintainability, accessibility, extensibility, learnability, simplicity,
longevity and a few more.</p>
<p>Two central questions here are:</p><ul>
<li>How well does the data-format solve the problem? </li>
<li>Is there a simpler format that could solve the problem just as well?</li></ul>
<p>The first question is self-explanatory: Whoever wants to save, transmit
and search within a text would not want a format for pixel based images –
though it would be inevitable to use such a format during the first step of
scanning papers or incoming faxes.</p>
<p>The second question is much more interesting: Is the format as simple as
possible and as complicated as necessary? It is very hard to design or
choose a data-format which follows this principle of minimalism.</p>
<p>Firstly, there is the anti-pattern of <a
by Committee”</a>, where several decision makers participate in each decision.
Decisions about which software product to use within an organisation – especially in public ones – are also often made by large committees.
Then it easily happens that too many cooks spoil the broth and add more into the standards than is
actually necessary. The W3C at least <a
is aware of this pattern</a>. Many groups are not.</p>
<p>A second problem is the common use of checklists when evaluating
software solutions. Typically it goes like this: Every stakeholder can add
something to the list; the given wishes are often specific solution ideas
and get condensed into the checklist for the procurement departement; the
software product promising to fulfill most of the items on the checklist,
wins; most of the time this means buying a single data-format which has
many, rarely used and unneeded, features. It would be better to add
features with a focus on the problem (rather than the solution) to begin
with. The evaluation process should reward higher grades for solutions
which consist of a number of simple, easily extensible and complementary
data-formats which can be combined for the more complex needs.</p>
<p>But software vendors know their customers: The more features on a
checklist are ticked off, the more precious a software will appear. That is
because it seems to – at first glance – serve many needs. Except for the
need for simple elegance. And so this is what the software and the
data-format will look like: Bloated with many features, to reflect as many
specific solution ideas as possible. This gives the software producer
another advantage: Any competitor will have a hard time supporting the full
feature list of the format, or provide a superior solution to just a few
elements. The customer is forced to buy all or nothing. Why bother with
another data-format when there is already that claims to do everything?
<p>Every additional feature or guideline complicates the description of the
data-format exponentially. The disadvantages of bloated formats are
enormous. The developers of a software which needs to handle a data-format
must understand the description in total: this includes the complete text
of the specification and then all possible combinations of its elements.
Having to read and understand less means the resulting software
implementation will be simpler and more accurate. This leads to more
software which can handle the data-format on a high level. What follows is
more competition, choice and therefore more users of this format.</p>
<p>The more complex a data-format is, the greater the chance that it has
rarely needed features. So the format and the implementation are comparable
to a huge and sprawling mansion: Some rooms are very popular and
well-frequented, while other places are hardly ever visited by people. Of
course such a house is harder to secure. Burglars could push open a lonely
forgotten window to the basement or hide tools in a cobwebbed corner during
an official visit to the premises. </p>
<p>Experts see complexity as the biggest threat to software security. This is why
many of them are critical or even hostile towards standards.
<a class="fn" id="ref-complexity" href="#fn-complexity">1</a></p>
<p>To get an understanding of the risks let us take a look at how a
computer deals with written characters. A commonly used standard is Latin-9
(ISO/IEC 8859-15). It enables a computer to handle text in more than 20
languages - mostly western European ones. For a single electronic
character, encoded in Latin-9, there are 256 different possible values it
can have. A new standard called Unicode (ISO 10646) is supposed to encode
all languages of the world. Therefore it comes with more than a million
possible values per character. To make things worse, a single character
could be encoded in several different ways. For example in "UTF-8" or
"UCS-2". On one side Unicode is a blessing: Once implemented correctly an
application is prepared to handle hundreds of languages. On the other hand
a programmer cannot fully calculate in her head all the effects a character
might have when looking at the source code of a software. With the 256
cases of Latin-9 she could. With Unicode the possibility of overview gets
lost. A clever attacker might find combinations the developer did not
think of. This happens on a regular basis. Here are two examples: 1. <a
href="">the IDN homograph
attack</a> plays tricks on the users with similar looking Internet addresses.
Cyrillic from the Unicode-Fonts is well suited to this. 2. The developers of a
well known webserver fell prey to <a
href="" >the
possibilities of Unicode in URLs</a>.</p>
<p>Unsurprisingly there are more applications out there that can handle
Latin-9 better than Unicode. It is the same problem with every "fat"
data-format: There are applications that do not understand the more exotic
features, if not just because it has become impossible to test the myriad
of features. The software will advertise that it can read data-format “X”
- but whether this works in practice is questionable.</p>
<p>Some data-formats create this problem on purpose: They come in different
revisions. To be sure that software packages are compatible, the user has
to define the precise version of the data-format used. For example there
are three variants (1.0, 1.1 and 1.2) of the Open Document Format (ODF).
It is likely the complexity grows with the number. Certainly there are
many cases where using the features of version 1.0 would be completely
okay. But the default probably is to save files in the newest version the
software supports. For PDF this problem is even more significant. Some <a
href="">versions or parts of PDFs</a> do
not even make an open standard.</p>
<p>Who wants to understand how computers work, one of the first things they
are told is that there are 2 different kinds of things Data and programs
(aka "applications"). While data is merely being processed, the programs
contain the instructions that command the computer. Imagine a writing on a
piece of paper: Jump off the bridge! I can read the data, process it by
writing it down or handing it to someone else without problems. But if I
consider it to be instructions, I may easily get hurt following them. It is
the same for computers. Data-formats like ODF, DOC and PDF may, besides
data, also contain instructions for automatic execution ("macros") or
interactive elements (e.g. Javascript). This turns a regular file into a
potential application controlling your computer. Naturally attackers try to
take advantage of this. Like with the <a
href="" >Melissa Macro
Virus from 1999</a>.</p>
<p>Most texts that are being exchanged only need a small fraction of that
what common data-formats have to offer in terms of formatting, mark-up or
layout. A simple file composed of Latin-9 characters can be edited since
decades on every computer by means of a simple text editor or any word
processor. A small subset of HTML 2 could cater for advanced needs like
headlines, bullet-lists and hyperlinks. Alternatively any <a
href="">simple textbased
markup language</a> like used by Wikis would work for many tasks. The
Wikipedia pages and web-logs ("blogs") of the world are proof that lot of
content can be expressed by simple means.</p>
<p>Everyone – except vendors of proprietary software – profits from
different software products competing which each other, while being secure
and interoperable. The minimal principle for data-formats promotes all
this. It just has one rule: Remove everything that is not absolutely
necessary. Aim for your design to be <a
href="">simple and elegant</a>. A good
solution resembles a set of building blocks where an infinite number of
buildings can be made, just by combining a few types of elements.</p>
<p>Even though there may be good reasons to choose a data-format which
covers several requirements we should ask ourselves each time: “Can this be
done more simply?”</p>
<h2 id="fn">Footnotes</h2>
<li id="fn-complexity">"Complexity is the main enemy of security",
Ferguson, Niels, and Schneier, Bruce - Practical Cryptography, Wiley, 2003,
ISBN 0-471-22357-3. p146 "9.4.1 Simplicity", pp365- "23 Standards"
<a href=""></a> [<a href="#ref-complexity">&#8626;</a>]</li>
Thanks for suggestions, proof reading and translation work to Peter Bubestinger, Philipp Kammerer, the folks from the FSFE DE mailinglist and
Anna F J Morris .
<legal type="cc-license">
<license></license><notice>Neben der Standardlizenz der Webseite steht dieser Artikel unter der Creative Commons Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)</notice>
<author id="reiter" />
<original content="2014-02-27" />
<translator>Philipp Kammerer</translator>