XML: Still Just HypedWebStuff

Sat Nov 27 13:22:46 GMT 1999

At the risk of sounding pretentious, contentious, or even both....

> To take one example, imagine that you're managing a custom publishing
> project.

Not hard to imagine from here; I publish my own zine, and have had to assemble my 
own funky way of managing an ever-growing site singlehandedly with minimal effort.

> The standard non-SGML/XML way to do this is to assemble fairly large components
> (sections, chapters, or whatever) using relational databases (where the components are
> blobs) and/or document-management systems (where the components are smaller
> documents), and you've probably bought an off-the-shelf package that does this.

I'm using document management, if I understand your categories properly.  Specifically, 
each episode of a given column is its own file, and I use a template processor to marry 
these data files to common templates.  This produces a file set that is ready for upload, 
and when a given file is requested, the server does some more processing at that time.  
The template processor (htp) is a small piece of freeware; the server processor is 
available through my ISP's hosting service.  Still, I'm pretty much with you.

> If (say) you decide to use XML for the assembly templates to make those templates
> system-independent (and reuse them elsewhere in the system), you have only a
> relatively small integration problem on your hands.

I might question that "relatively small" bit.  I would find it a large challenge to migrate 
from htp to an XML-based system without losing anything; the big advantage to what 
I've got is that I can set up the htp->XHTML "transformation" <g> to transform valid 
HTML tags into optimized XHTML constructs.  For instance, <center> can become 
<div align="center"> or even <div style="align: center;"> almost as quickly as I can 
envision the transformation.  OTOH, trying to recast the htp data files into full-fledged 
XML fragments would be a much more daunting task.  In fact, it'd be a lot like the 
complexity you go on to describe, except more so:

> But ... and here's the catch ... you realize that with XML it is possible to customize
> much further than you already are.  You read books and articles and find out that you
> can customize at the phrase or even the word level, like this (to take a silly example):
> 
>   <p>What <choice-group><choice loc="CA">colour</choice>
>      <choice loc="US">color</choice></choice-group> is that?</p>

Ultimately, that's a trivial example, and there are better ways (or at least easier ones) to 
do it.  For instance, define your own "color" entity which gets shifted on the fly 
according to location, and this could become:

<p>What &color; is that?</p>

Yes, that has the disadvantage of being something of a hack and requiring a growing 
library of international terms, but IMO, this is more than balanced by the increased 
legibility and decreased footprint of the document.

> Wow!  That's a lot better than the blunt instrument that you have now, 
> so you run ahead an invest a few $100K in an XML-aware DMS that may or 
> may not scale to your user base and another $500K for a phase-one
> implementation from an XML consulting house, etc., etc., and suddenly
> you decide that XML is very expensive.

This is where you make your key logical error.  The database doesn't need to know 
about what's in an element; I currently use Access 97 to handle iHTML script fragments, 
and Access is blissfully ignorant of the uses I'm putting it to.  In fact, if Access was 
bright enough to try making sense of the data I'm storing in there, it could seriously 
screw me over by munging that data.  The hard part of making a library of text granular 
is in going through that library and converting that data accordingly - making the entity 
(or definition) library itself is a snap by comparison.  For instance, if you remember my 
rants about the Q element, setting up the htp definition was only hard because I had to 
manually scout out the support levels for that tag in various browsers.  Implementing the 
definition, on the other hand, requires a lot of tedious work - specifically, poring through 
the various articles and re-marking the quotes accordingly.  There's no shortcut for that, 
no matter what system I use - but since the average user won't notice a big difference, 
there's no real time pressure.  I can make sure that new articles are marked properly as 
they come in, spread the archive re-marking task out over time (since the archives don't 
get a lot of hits), and the load becomes easier.

> But you're wrong -- it's not XML that's expensive; it's the kind of
> custom publishing that you're trying to do.

And I say that you're still half-wrong - it's not that kind of custom publishing that's 
expensive, but the way you're going about it.  Your complication requires no change of 
software, but it does incur a new burden of tedious re-marking of old data.  Spread that 
out a bit (because after all, unchanged documents are not unusable - they're just not 
optimal yet, and they're still the same documents you had up pre-change) and that 
burden becomes more of an inconvenience.  If I were to discipline myself to read 
through twenty source documents a day, I'd be done with my re-marking in a couple of 
weeks.  Five documents a day, and it's two months - which still isn't that bad.

> It's easy enough to demo this sort of thing on my notebook with ten sample
> documents, a browser and a couple of Perl scripts, but it turns out to be very hard
> and expensive to implement in a high-volume, high-performance environment.

And I say, not necessarily.  Take the lexical transformation above - wire the list of terms 
up to a grep bot, and it doesn't much matter whether you're handling ten documents or 
ten thousand.  (Especially with modern CPU speeds!)  After a template change, I can 
generate my 570-page site in a couple of minutes at most.  Sure, the upload will take a 
half-hour to 45 minutes, but I can set that to run while I'm asleep or at work, and that's 
not really a function of my development environment.  (Case in point - the other day, I 
changed the XHTML definition set to conform to the new Working Draft's !DOCTYPE 
and namespace revisions.  The generation took a couple of minutes, and the files went 
up last night as a weekly update.  The transfer broke down near the tail end, so I had to 
manually upload the remaining files this morning - but still, I saved about 35 minutes of 
upload tedium.)  Overall, my site is no harder to administrate now than it was two years 
ago - the only rough spot is when I decide to do a transformation that affects the actual 
data files (a retrofit), and that's just tedious...not difficult, and not expensive.

 Rev. Robert L. Hood  | http://rev-bob.gotc.com/
  Get Off The Cross!  | http://www.gotc.com/

Download NeoPlanet at http://www.neoplanet.com

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)