Industrial Strength XML Serving

John Robert Gardner jrgardn at
Thu Oct 7 19:38:47 BST 1999

I'm venturing this question as a general call for input--and pitches--with
regard to the following project we're undertaking:

	750,000 pages of journals, in both text form and gif images for 
		"canonical preservation" and cross-check

	Typed text version,  in XML (using TEI largely) yielding 
		~400,000,000 words (our initial estimates suggest 
		something in the range of 30-50 gigs of total content 
		including gifs), avg.'d to ~60,000,000 tag nodes, 
		searchable based on content of tags (word strings), 
		element heirarchy, and attribute values, with final form
		changing infrequently (archival/institutional memory)

	Primary access point being MARC records we're rendering into
		highly granular XML, for crosswalking to DC/RDF/GILS
		(we're starting with some 200 megs of MARC records alone)

I've been asking offlist for possible consultants as our systems staff has
a strong inclination to Oracle 8i and I'm hardly fluent enough on such
software to argue based upon what I know.  Based on Oracle's white paper,
it sounds viable . . . however:

In some of my offlist correspondence, I've detected a dichotomy between
the view that "it doesn't matter if it's XML, pizza's, or washing machines
you're storing, it's the size that counts (no pun intended)" -- so
Oracle's great.  ON the other side, is a sense that 8i's newness is a
potential unknown for such size in XML (we'll also likely be
subcontracting the serving of the gifs, likely out-of-state).  The
implication was that there were more SGML/XML-native packages out there if
we have the budget (we do, within the limits that, say, commissioning a
whole new softwre package is out of the question). :)

Our project is perhaps one of the best funded efforts in the humanities in
markup for some time, and surely in a class by itself viz. XML.  As it's
likely to be a model in various senses/case study, I really want to be
sure we commit down the "right" road on this, and be sure of our options
along that road.  The vision I'm implementing from teh XML side is meant
to go beyond another research resource to a full-scale research
environment which exploits XSLT for having our stuff accessible--e.g., the
MARC--in multiple tag vocabularies (DC, RDF, GILS, etc.), as well as very
sophisticate construction of the resources found through the search (e.g.,
with DOM, etc.).

At any rate, this question is in no way an obviation either of my offlist
inquiries for a consultant, nor of their input thus far.  Instead, since
the vichy soisse is not yet ready to be stirred, nor even on the stove,
all chef's are needed-- if there is a better mousetrap to be made without
a reinvention of the wheel, now's the time to know.



John Robert Gardner, Ph.D.
XML Engineer

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at
Archived as: and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo at the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo at the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at

More information about the Xml-dev mailing list