Save-To-XML

Marcelo Cantos marcelo at mds.rmit.edu.au
Sun Mar 21 06:26:31 GMT 1999


On Fri, Mar 19, 1999 at 05:47:36PM -0600, Paul Prescod wrote:
> "Simon St.Laurent" wrote:
> > 
> > Ah, but if MS Word had a simple "Save-To-XML" option that let users save
> > their documents using markup based on the styles they've built.  
> 
> I was thinking about this last week. Someone could build this relatively
> easily on top of the Office 2000 save as XML and the MSHTML DLL. 
> 
> > Three
> > times now, I've seen organizations that had done a lot of very good
> > informal work with Word styles, and no easy path for those structures or
> > the documents that use them to move to XML.  I guess the incentive just
> > isn't there for MS to make life easy.  There are tools to do it, but it's
> > still not much fun.  (Another painful case of asymmetry.)
> 
> Even if the tool to do it was a "Save-To-XML" option it would still be not
> much fun. 
> 
> After all, the goal is not to get it into any-old-XML (that's easy) but to
> get it into "our vocabulary". That's the harder part. There are tricky
> problems about setting up division structure, converting tables to a
> particular table model, cross-references to a particular linking model and
> so forth. In the end it is a transformation job no matter how you slice
> it. And even then you will likely have to do many manual fix-ups unless
> the writers are Zen monks.

The real problem lies with the flat nature of RTF.  Paragraphs are not
children of sections.  They are siblings of level 1 headings.  The
first pass always involves inferring structure from the sequence of
styles.  This stage cannot be avoided, though of course it can be
rolled together with other passes.

For EnAct (a legislation management system built on top of SIM) we
developed a configuration file approach that defines how to perform
such inference.  It works vaguely like a yacc grammar, though I know
almost nothing more about that particular aspect of Enact.  We are
currently looking at a completely generic import filter to do
essentially the same thing (EnAct is targetted at legislation).


Cheers,
Marcelo

-- 
http://www.simdb.com/~marcelo/

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)




More information about the Xml-dev mailing list