I looked at doing something like this last year... fortunately the need went
away before I had to implement it, but not before I'd spent some time
looking at the problem in detail.
The most promising approach looked like using Word 2000 itself to export to
HTML (Word 97 does not do as good a conversion). This provides you a huge
amount of markup for a typical word document, and the structure of things
like lists is lost (although you can recreate it from style information).
However, all the style information about the Word text mark up is preserved
and can be extracted from the HTML markup. Footnotes are stored as div
elements at the end of the document, marked as
<div style='mso-element:footnote' id=ftnnn>
Our plan was to export Word documents to HTML from Word 2000, and then parse
the resulting HTML and convert it to the XML grammar we wanted. We were
using MSHTML and MSXML to manipulate the DOM of the input document tree and
the output document tree respectively. By driving Word through its COM
interface we would have ended up with a single executable that could be run
as a batch process.
This is not an _easy_ approach - although it seemed preferable to parsing
RTF! But at the time a bespoke solution looked like the only way to do the
conversion we wanted. Maybe somebody now has an easier solution...
From: Torsten Reimer [mailto:[log in to unmask]]
Sent: 10 October 2000 08:48
To: [log in to unmask]
Subject: importing footnotes in xml
for a project concerned with publishing documents from 100-200 pages each we
are building an xml-dtd. Most of our text will come in as MS-Word or RTF
files with minimal formating (bold / italic and headlines) but with many
hundreds of footnotes. Is there an easy way of importing such documents in
xml? XML Spy for instance has no problem doing an import from MS-Word, but I
found no way of dealing with footnotes. Framemaker seems to be able to do
so, but if you use it just for this task it's a little bit expensive...