fsfe-website/freesoftware/standards/minimalisticstandards.en.xhtml

<?xml version="1.0" encoding="UTF-8" ?>

<html>
  <version>1</version>

  <head>
    <title>Minimalistic Data Format – Open Standards</title>
  </head>
  <body class="article">
    <p id="category">
      <a href="/freesoftware/standards/standards.html">Open Standards</a>
    </p>
    <h1>The minimal principle: because being an open standard is not enough.</h1>

    <p>A tool is useless without something to work on. So what do we shape with
    our computing tools?  Data, information, knowledge, opinions, art – in
    short: Content. Content is created, processed and transmitted. Nowadays
    much more often directly in some electronic format.  The number of people
    who own devices and connect to the internet is constantly rising. And they
    use it to evolve their ways of working together.  </p>

    <p>Content is sent from one user to another and back. To do this, the
    content must take on some form: The data-format. This defines how content
    and its wrapping are to be handled, what is allowed and how each part looks
    within a file or stream. Anyone who wants to participate in the data
    exchange must use a software application that understands the data-format
    in question. Otherwise the content would appear like an unknown foreign
    language to your computer. If a data-format does not allow for the
    inclusion of pictures, for example, then there is no way to include
    pictures with it. The choice of data-format dictates the number of years
    for which I may access the content (backwards compatibility) and what I am
    able do with it.  </p>

    <p>A single user will probably not feel any effect of her decision when
    saving a file in a particular data-format.  When an IT-department or a
    public administration decides upon a data-format the impact is far greater:
    It will dominate their choice of software for several years, possibly
    decades. The more an organisation saves its precious writings, recordings
    or pictures electronically, the more important it becomes to secure
    continued access to the documents.  These decisions, directly or indirectly,
    lead to the funding of the initial development and maintenance of
    data-formats, whether they be "good or bad" formats. The choices taken at
    one time naturally affect the available choices in the future: Many
    software producers intentionally try to influence users to use a
    data-format that they (the producer) control. For example when technical
    schematics of vehicles, buildings or machinery are all held in a format
    controlled by the software producer, the producer of the CAD application
    can in essence hold the data for ransom when its time to renew the
    contracts. From the vendor’s point of view this is a strong position to be
    in for the next pricing negotiation.  Occasionally, whole countries have
    managed to maneuver themselves into the losing end of this situation.  </p>

    <p>As you can see, a good data-format can only be an <a
      href="/freesoftware/standards/def.html">Open Standard</a>.  This requirement
    alone, however, is not enough. The data-format needs to solve a problem
    adequately: It should be a good fit from a functional point of view, as
    well as on a technical level.  In order to judge this, there are a number
    of things to consider. The <a
    href="http://www.w3.org/People/Bos/DesignGuide/introduction">Essay by Bert
  Bos</a> explains the design principles of the W3C - the organisation which
develops the formats of the world wide web. He mentions efficiency,
maintainability, accessibility, extensibility, learnability, simplicity,
longevity and a few more.</p>

    <p>Two central questions here are:</p><ul>
      <li>How well does the data-format solve the problem? </li>
      <li>Is there a simpler format that could solve the problem just as well?</li></ul>

    <p>The first question is self-explanatory: Whoever wants to save, transmit
    and search within a text would not want a format for pixel based images –
    though it would be inevitable to use such a format during the first step of
    scanning papers or incoming faxes.</p>

    <p>The second question is much more interesting: Is the format as simple as
    possible and as complicated as necessary? It is very hard to design or
    choose a data-format which follows this principle of minimalism.</p>

    <p>Firstly, there is the anti-pattern of <a
    href="http://sourcemaking.com/antipatterns/design-by-committee">“Design
    by Committee”</a>, where several decision makers participate in each decision.
    Decisions about which software product to use within an organisation – especially in public ones – are also often made by large committees.
    Then it easily happens that too many cooks spoil the broth and add more into the standards than is
    actually necessary. The W3C at least <a
    href="http://www.w3.org/People/Bos/DesignGuide/committee.html">
	    is aware of this pattern</a>. Many groups are not.</p>


    <p>A second problem is the common use of checklists when evaluating
    software solutions.  Typically it goes like this: Every stakeholder can add
    something to the list; the given wishes are often specific solution ideas
    and get condensed into the checklist for the procurement departement; the
    software product promising to fulfill most of the items on the checklist,
    wins; most of the time this means buying a single data-format which has
    many, rarely used and unneeded, features. It would be better to add
    features with a focus on the problem (rather than the solution) to begin
    with. The evaluation process should reward higher grades for solutions
    which consist of a number of simple, easily extensible and complementary
    data-formats which can be combined for the more complex needs.</p>

    <p>But software vendors know their customers: The more features on a
    checklist are ticked off, the more precious a software will appear. That is
    because it seems to – at first glance – serve many needs. Except for the
    need for simple elegance. And so this is what the software and the
    data-format will look like: Bloated with many features, to reflect as many
    specific solution ideas as possible. This gives the software producer
    another advantage: Any competitor will have a hard time supporting the full
    feature list of the format, or provide a superior solution to just a few
    elements. The customer is forced to buy all or nothing. Why bother with
    another data-format  when there is already that claims to do everything?
    </p>

    <p>Every additional feature or guideline complicates the description of the
    data-format exponentially. The disadvantages of bloated formats are
    enormous. The developers of a software which needs to handle a data-format
    must understand the description in total: this includes the complete text
    of the specification and then all possible combinations of its elements.
    Having to read and understand less means the resulting software
    implementation will be simpler and more accurate. This leads to more
    software which can handle the data-format on a high level. What follows is
    more competition, choice and therefore more users of this format.</p>

    <p>The more complex a data-format is, the greater the chance that it has
    rarely needed features. So the format and the implementation are comparable
    to a huge and sprawling mansion: Some rooms are very popular and
    well-frequented, while other places are hardly ever visited by people. Of
    course such a house is harder to secure.  Burglars could push open a lonely
    forgotten window to the basement or hide tools in a cobwebbed corner during
    an official visit to the premises.  </p>

    <p>Experts see complexity as the biggest threat to software security. This is why
    many of them are critical or even hostile towards standards.
    <a class="fn" id="ref-complexity" href="#fn-complexity">1</a></p>

    <p>To get an understanding of the risks let us take a look at how a
    computer deals with written characters. A commonly used standard is Latin-9
    (ISO/IEC 8859-15).  It enables a computer to handle text in more than 20
    languages - mostly western European ones.  For a single electronic
    character, encoded in Latin-9, there are 256 different possible values it
    can have.  A new standard called Unicode (ISO 10646) is supposed to encode
    all languages of the world.  Therefore it comes with more than a million
    possible values per character.  To make things worse, a single character
    could be encoded in several different ways.  For example in "UTF-8" or
    "UCS-2". On one side Unicode is a blessing: Once implemented correctly an
    application is prepared to handle hundreds of languages. On the other hand
    a programmer cannot fully calculate in her head all the effects a character
    might have when looking at the source code of a software.  With the 256
    cases of Latin-9 she could. With Unicode the possibility of overview gets
    lost.  A clever attacker might find combinations the developer did not
    think of.  This happens on a regular basis. Here are two examples: 1. <a
    href="http://en.wikipedia.org/wiki/IDN_homograph_attack">the IDN homograph
  attack</a> plays tricks on the users with similar looking Internet addresses.
Cyrillic from the Unicode-Fonts is well suited to this.  2. The developers of a
well known webserver fell prey to <a
href="http://web.nvd.nist.gov/view/vuln/detail?vulnId=CAN-2000-0884" >the
possibilities of Unicode in URLs</a>.</p>

    <p>Unsurprisingly there are more applications out there that can handle
    Latin-9 better than Unicode. It is the same problem with every "fat"
    data-format: There are applications that do not understand the more exotic
    features, if not just because it has become impossible to test the myriad
    of features.  The software will advertise that it can read data-format “X”
    - but whether this works in practice is questionable.</p>

    <p>Some data-formats create this problem on purpose: They come in different
    revisions. To be sure that software packages are compatible, the user has
    to define the precise version of the data-format used.  For example there
    are three variants (1.0, 1.1 and 1.2) of the Open Document Format (ODF).
    It is likely the complexity grows with the number.  Certainly there are
    many cases where using the features of version 1.0 would be completely
    okay.  But the default probably is to save files in the newest version the
    software supports.  For PDF this problem is even more significant. Some <a
    href="http://pdfreaders.org/os.en.html">versions or parts of PDFs</a> do
  not even make an open standard.</p>

    <p>Who wants to understand how computers work, one of the first things they
    are told is that there are 2 different kinds of things Data and programs
    (aka "applications").  While data is merely being processed, the programs
    contain the instructions that command the computer. Imagine a writing on a
    piece of paper: Jump off the bridge!  I can read the data, process it by
    writing it down or handing it to someone else without problems. But if I
    consider it to be instructions, I may easily get hurt following them. It is
    the same for computers.  Data-formats like ODF, DOC and PDF may, besides
    data,  also contain instructions for automatic execution ("macros") or
    interactive elements (e.g. Javascript). This turns a regular file into a
    potential application controlling your computer. Naturally attackers try to
      take advantage of this. Like with the <a
        href="http://www.cert.org/tech_tips/Melissa_FAQ.html" >Melissa Macro
        Virus from 1999</a>.</p>

    <p>Most texts that are being exchanged only need a small fraction of that
    what common data-formats have to offer in terms of formatting, mark-up or
    layout.  A simple file composed of Latin-9 characters can be edited since
    decades on every computer by means of a simple text editor or any word
    processor.  A small subset of HTML 2 could cater for advanced needs like
    headlines, bullet-lists and hyperlinks. Alternatively any <a
    href="http://en.wikipedia.org/wiki/Creole_%28markup%29">simple textbased
    markup language</a> like used by Wikis would work for many tasks.  The
    Wikipedia pages and web-logs ("blogs") of the world are proof that lot of
    content can be expressed by simple means.</p>

    <p>Everyone –  except vendors of proprietary software – profits from
    different software products competing which each other, while being secure
    and interoperable.  The minimal principle for data-formats promotes all
    this.  It just has one rule: Remove everything that is not absolutely
    necessary.  Aim for your design to be <a
    href="http://www.paulgraham.com/taste.html">simple and elegant</a>.  A good
    solution resembles a set of building blocks where an infinite number of
    buildings can be made, just by combining a few types of elements.</p>

    <p>Even though there may be good reasons to choose a data-format which
    covers several requirements we should ask ourselves each time: “Can this be
    done more simply?”</p>

    <h2 id="fn">Footnotes</h2>
    <ol>
      <li id="fn-complexity">"Complexity is the main enemy of security",
    Ferguson, Niels, and Schneier, Bruce - Practical Cryptography, Wiley, 2003,
    ISBN 0-471-22357-3. p146 "9.4.1 Simplicity", pp365- "23 Standards"
    <a href="https://www.schneier.com/book-practical.html">https://www.schneier.com/book-practical.html</a>  [<a href="#ref-complexity">&#8626;</a>]</li>
  </ol>

Thanks for suggestions, proof reading and translation work to Peter Bubestinger, Philipp Kammerer, the folks from the FSFE DE mailinglist and
Anna F J Morris .
  </body>

<legal type="cc-license">
 <license>https://creativecommons.org/licenses/by-sa/3.0/</license><notice>Neben der Standardlizenz der Webseite steht dieser Artikel unter der Creative Commons Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)</notice>
</legal>
<author id="reiter" />
<date>
  <original content="2014-02-27" />
</date>
<sidebar/>
<translator>Philipp Kammerer</translator>
</html>