Thursday, June 07, 2007

Comparison of OpenDocument and Office Open XML formats

Overview
The OpenDocument format was originally defined by StarDivision (later acquired by Sun Microsystems) for their StarOffice product and was brought to OASIS by Sun and IBM who wanted it ratified as a standard. OpenDocument was approved as an ISO and IEC International Standard in May 2006, designated ISO/IEC 26300. It has been a published ISO/IEC standard since November 2006.

Office Open XML is defined by Microsoft and was approved as a standard by Ecma International in December 2006[1], designated ECMA 376.[2] Control of the Ecma standard will rest with Ecma International. It has been submitted to ISO/IEC for adoption under the ISO/IEC JTC 1 process.

The OpenDocument format is the native format of both OpenOffice.org 2.0 and KDE KOffice 1.5, and is targeted as a native format for multiple applications. Office Open XML is the native format for Microsoft Office 2007. A compatible plug-in has been released for some earlier editions of the Microsoft Office suite as well. At least three different OSS plug-ins for Microsoft Office [1] [2] [3] are being developed that will add support for opening and saving files in the OpenDocument format.


[edit] Advantages of OpenDocument over Office Open XML formats
Alex Hudson, J. David Eisenberg, Bruce D'Arcus and Daniel Carrera of the OpenDocument Fellowship wrote an article published by the online journal Groklaw that argues OpenDocument has several technical advantages over Office Open XML (Hudson, 2005[3]). The article examined some problems based on the original draft of the Office Open XML standard (which has since been superseded), and claims the following differences:

OpenDocument uses a mixed content model,[4] whereas the Office Open XML format does not. "Non-mixed documents usually represent structured data; mixed documents are usually used to represent narrative. MS XML uses the non-mixed model to represent narrative (word processing). "This sort of mismatch leads to awkward markup [...] The mixed-content model makes more sense, and is closer to what a developer will be familiar to."
OpenDocument is similar to XHTML, while MS XML is not. OpenDocument uses mixed content and marks styles in a similar way. "This makes it easier to transform data accurately between OpenDocument and XHTML, and also simplifies the reuse of existing skills."
OpenDocument gives better separation of style and content. "Both formats give you some separation, and neither format gives you perfect separation. But OpenDocument goes much further in that direction."
OpenDocument hyperlink URLs are embedded in the main file, whereas in Office Open XML the URL is placed in a separate file.
OpenDocument reuses existing standards whenever possible. It uses parts of SVG for drawings, MathML for equations, XLink for linking, Dublin Core for metadata, etc. "This makes the format infinitely more transparent to someone familiar with XML technologies. It also allows you to reuse existing tools that understand these standards", whereas "MS XML re-invents the wheel" (date format incompatible with ISO 8601, language specification incompatible with ISO 639). ODF's use of SVG is limited to some attributes and SVG content as such is not supported, even though this is widely touted.
Since the article written by the OpenDocument Fellowship, Office Open XML specification now incorporates the Dublin Core metadata as well. However, OpenDocument still claims the advantage of using a mixed model similar to XHTML, as well as separation of content from presentation. Perhaps most importantly, OpenDocument continues to reuse more existing standards wherever possible (such as SVG, MathML, and so on), instead of recreating their own unique format, simplifying implementation and interoperability (as well as reusing significant work from each of those pre-existing standards).


Advantages of Office Open XML formats over OpenDocument
Proponents of the Office Open XML format have addressed some of the criticism due to comparisons with OpenDocument, and offered their own criticisms of the OpenDocument format. Much of this criticism has been offered by Brian Jones, a Microsoft program manager for Microsoft Office who works on the XML functionality and file formats in the Office product.

Microsoft has stated that a design goal for its formats was 100% compatibility with the existing base of documents and formatting used by its customers. In particular, Jones states that OpenDocument is not able to capture all the information potentially held by a binary Office file, whereas Office Open XML should be able to do that.[5]
Microsoft also states that the OpenDocument format lacks support for the complete set of functionality in Microsoft Office applications (such as VBA and OLE support, support for highlighting,[6] international numbering,[7], tables in presentations[8] and other features), so any converter that saved information from an Office file (either binary or OpenXML) into an OpenDocument format would potentially be lossy. Counter arguments raised on GrokDoc on this matter claim that such features are allowed as part of the OpenDocument format as namespace extensions therefore negating this argument.
Microsoft Excel has a well-known formula language that has been defined in its entirety in the new XML formats, whereas the OpenDocument TC has at this stage published the ODF OpenFormula specification as a draft which is expected to be incorporated in ODF v1.2.
It has been suggested that Office Open XML supports several non-western languages better than ODF - specifically, that it has better Arabicization and Internationalization.[9][10] These issues have probably arisen due to the comments of the Egypt ISO member as part of the OpenDocument ISO standardisation process. The OpenDocument TC has addressed these comments, though, and states "OpenDocument v1.0 has BiDi support, as well as support for text orientations, directions, numeric digits presentations and calendars [...] The TC intents to add a non-normative appendix which explains these features to a future version of the OpenDocument specification". Text for this appendix is available [11] and explains where the TC suggest this support is derived.
The OpenXML spreadsheet format appears to be much faster than the ODF spreadsheet format. George Ou at ZDNet has tested both XML formats in their native applications OpenOffice.org 2.0 and Microsoft Office 2003. Office Open XML takes a distinctly different approach to the storage of spreadsheet data to OpenDocument, and implements several optimizations (such as sparsely populating worksheets, and sharing strings). Microsoft's Brian Jones had added some information[12] on this subject as well. Note also that the native proprietary Excel XLS binary format appears to be much faster than both XML implementations.[13]
All external references, such as hyperlinks or linked files, reside in a single relationships XML file contained in the document archive. This allows for easy access to all external references in the document. This makes it much easier to do link fix-up if you are moving files from one server to another. Or if you want to remove all external references for security reasons, you just edit the relationships.[14].

Shortcomings of OpenDocument
OpenDocument has no macro language specification. Java applets are described in the specification, but a macro language is not compulsory for compliance with the ODF standard.
The specification is incomplete : no syntax description of spreadsheet formulas (this is due in ODF v1.2). No description of passwords hashing yet (password hashes in an XML format would seem to be pointless, since the password can be bypassed by using a text editor to read or alter the content).
No native support of tables in presentations (this is due in ODF v1.2)
ODF 1.1 has no digital signature, (this is due in ODF v1.2)

Shortcomings of Office Open XML
The specification is incomplete : some parts are referencing the (not publicly specified) behaviour of other software, like "autoSpaceLikeWord95", without further explanation. This may be an issue for applications like archiving, where some content may become inaccessible or some clarity lost in future due to content being locked up in undocumented proprietary formats or behaviours. It may also be an issue for cross vendor interoperability and interchange of data, since only Microsoft or Microsoft licensed applications can access/process the content in these undocumented formats in OOXML with full clarity. These tags are deprecated so implementations should not create new documents containing these tags. The tags can only be applied in documents created or mutated in the original application the tags is referring to, which are converted to OOXML. There are no proposals to remedy the incomplete parts of the OOXML specification.
In SpreadsheetML, a markup language for spreadsheets used in Office Open XML one of the two numeric formats used for storing dates interpretes the number 60 as 1900-02-29 as if year 1900 were a leap year. Any implementation of this date1900 format needs to skip the number 60 when interpreting the numeric datevalue. This issue originated from Lotus 1-2-3, and was preserved by Microsoft Excel for backwards compatibility.
The specification for OOXML has a covenant not to sue which states that all patent claims from Microsoft needed to implement the spec can be used freely however the patent grant does not include future updates/versions. This means Microsoft can influence the development of future revisions or fixes by Ecma of this standard, similar to what Sun can do for ODF, so that future versions will be kept compatible with prior MS versions and will support MS Office featureset. Not granting such use of patent claims for future versions would make it unlikely that a new version would be used as any current versions will then still be available free for use.

[edit] Cross-platform interoperability
Microsoft Office 2007 for Windows uses Office Open XML as its native file format. Microsoft Office 2008 for Mac OS X, scheduled for release in late summer 2007, will also use Office Open XML as its native file format.[15] An ODF converter plugin for Microsoft Office XP/2003/2007 for Windows allows one to open and save OpenDocument word processing (.odt) files.
Corel has indicated that the WordPerfect Office X3 suite will include support for OpenDocument Format as well as Office Open XML by mid-2007.[16]
Gnumeric has included support for OpenDocument spreadsheet and preliminary support for Microsoft Office Open XML spreadsheet format since version 1.7.
IBM announced that Lotus Notes will use OpenDocument as the native format for its office productivity editors in the next release, due in 2007. IBM Workplace 2.6 already supports OpenDocument format.
Google Docs and Spreadsheets supports OpenDocument word processing and spreadsheet formats.
AbiWord 2.4 supports OpenDocument word processing format.
Scribus 1.3.3, a multi-platform, open source, page layout application, supports import of OpenDocument word processing files.
OpenDocument Format is currently supported in several office suites and individual applications[17], including as the native file format for KOffice 1.5, OpenOffice.org 2.0 and StarOffice 8. Support for OpenDocument was implemented independently, first in the KOffice 1.4 suite[18] and later in OpenOffice.org 2.0. Office suites which natively support OpenDocument Format are available on Windows, Mac OS X, Linux, BSD, Solaris, and Symbian OS.

Interoperability testing
The ODF Test Suite is a publicly available interoperability test suite developed by Intel and the University of Central Florida. Automated results are available for interoperability testing of KOffice and OpenOffice.org.

As of January 2007, no publicly available interoperability test suite exists for Office Open XML format. Since no currently released office suites provide native support for the format, it is not known to what extent documents saved in the Office Open XML format will be properly formatted in other office suites.


Example XML comparisons
First an example of the mixed vs non mixed examples as provided in the groklaw comparison of the two formats. Non-mixed documents usually represent structured data; mixed documents are usually used to represent narrative. MS XML uses the non-mixed model to represent narrative (word processing).

Non-Mixed (Open XML)



<w:p>
<w:r><w:t>This is a </w:t></w:r>
<w:r><w:rPr><w:b /></w:rPr><w:t>very basic</w:t></w:r>
<w:r><w:t> document </w:t></w:r>
<w:r><w:rPr><w:i /></w:rPr><w:t>with some</w:t></w:r>
<w:r><w:t> formatting, and a </w:t></w:r><w:hyperlink w:rel="rId4" w:history="1">
<w:r><w:rPr><w:rStyle w:val="Hyperlink" /></w:rPr><w:t>hyperlink</w:t></w:r>
</w:hyperlink>
</w:p>

Mixed (ODF):




<text:p text:style-name="Standard">
This is a
<text:span text:style-name="T1">very basic</text:span>
document
<text:span text:style-name="T2"> with some </text:span>
formatting, and a
<text:a xlink:type="simple" xlink:href="http://example.com">hyperlink</text:a>
</text:p>


Secondly an example (provided by Brian Jones weblog) to support Microsoft's choice for smaller tagging. For this example, the top example is using SpreadsheetML from the Office Open XML format. The second example is using the OpenDocument format.

Short tag example (Open XML):


<row><c><v>1</v></c><c><v>2</v></c><c><v>3</v></c></row>
<row><c><v>4</v></c><c><v>5</v></c><c><v>6</v></c></row>


Long tag example (ODF):



<table:table-row table:style-name="ro1">
<table:table-cell office:value-type="float" office:value="1">
<text:p>1</text:p>
</table:table-cell>
<table:table-cell office:value-type="float" office:value="2">
<text:p>2</text:p>
</table:table-cell>
<table:table-cell office:value-type="float" office:value="3">
<text:p>3</text:p>
</table:table-cell>
</table:table-row>
<table:table-row table:style-name="ro1">
<table:table-cell office:value-type="float" office:value="4">
<text:p>4</text:p>
</table:table-cell>
<table:table-cell office:value-type="float" office:value="5">
<text:p>5</text:p>
</table:table-cell>
<table:table-cell office:value-type="float" office:value="6">
<text:p>6</text:p>
</table:table-cell>
</table:table-row>


In the second example, it is important to note that the size of the document is only marginally impacted by the length of its tags, because OpenDocument files are usually compressed. However, according to Brian Jones, the length of tags does impact compression and parse time when manipulating big documents. A non-mixed content (such as in OOXML) is likely to be more compact than a mixed one.[citation needed]

Also noted that, in the second example, ODF holds extra two attributes about the value in each cell, attributes office:value-type and office:value, for cell's type and cell's value. Cell's type can be one of "float", "currency", "percentage", "date", or "time"[19]. These attributes explicitly describe the textual representation kept in element. These information are not captured by the OOXML tags shown in the example.

Example of ODF spreadsheet value vs its textual representation, a cell stored "45.6%":


No comments: