Repositories: From Argument to Architecture

Sun Jan 31 17:34:13 GMT 1999

Given the subject is now 'architecture' I hope the following comments
are not deemed 'off topic'.

Simon St.Laurent wrote on 30 January 1999 18:10:
> Basically, it would mean that I could retrieve XML documents 
> from it using
> HTTP, using familiar structures like URLs.  I'd love to see 
> support for
> XPointer queries on that same server, allowing me to pull out 
> fragments,
> and another standardized query language (XQL or whatever) 
> that would let me
> do more general searches.

I think this must move from 'nice to have' to 'must have'. If we are to
implement the next generation of web applications, as opposed to just
document management, then there must be hooks into the data at all
levels.

For example, at the level of quoting from a magazine article - by
including it inline in your own article, with your own formatting - you
should be able to do:

	http://www.mag.com/issue[num=65]/article[num=22]/para[id=7]

or whatever syntax becomes standardised. (We have implemented this
already, using an XSL-style syntax for now, because it just looks neater
(!) than some of the other syntaxes. I don't like the apparent
procedural appearance of some of the other proposals - but we'll use
whatever everyone else does, of course.)

Likewise a portal-type site should be able to pull article information
from us, allowing it to create links to the latest articles on our site,
without re-coding every week or month, e.g.:

	http://www.mag.com/issue[num=65]/article[type=promo]

Equally, a program should be able to pull figures from our company
database, so that it can average them, chart them, or do whatever it
wants:

http://www.mag.com/company[ticker=MSFT]/economic[year=1998]/turnover

And finally, a subscription fulfilment house should be able to retrieve
any address changes made by subscribers via the magazine site, and
synchronise them with their own databases. No more trying to make two
different databases talk directly to each other.

However, although we HAVE implemented all of this already, the only way
we could get the data out fast enough was to use an indexing server on
snapshots of the database. Not ideal, but OK for document-based projects
where the output data does not have to change immediately that the
database has been changed.

> ... but at its 
> foundation I'd
> like it to look like a vanilla Web server, whatever magic it's doing
> internally.

Definitely. We've done this as described above, but have an interesting
issue in relation to pulling out information that requires formatting.
Before we used a dot extension:

	http://www.mag.com/issue/65/article/22.htm
	http://www.mag.com/issue/65/article/22.xml

But this looks 'wrong' in our new syntax:

	http://www.mag.com/issue[num=65]/article[num=22].htm
	http://www.mag.com/issue[num=65]/article[num=22].xml

One possibility is to say that the server has a number of roots:

	http://www.mag.com/xml/issue[num=65]/article[num=22]
	http://www.mag.com/html/issue[num=65]/article[num=22]

and perhaps others (XSL, and so on). I like this myself because it
starts to say that the server is some sort of data repository, rather
than just a 'web server'. However, it's not really 'correct', because
the article is at the same position in the tree, regardless of how you
output it. This is an important issue at the moment for us, because we
obviously cannot assume that everyone is using XML-aware browsers to
view the site, so we have to merge XML and XSL on the server for older
browsers. Maybe we should really have:

	xttp://www.mag.com/issue[num=65]/article[num=22]
	http://www.mag.com/issue[num=65]/article[num=22]

Who knows! Anyway, once all browsers are XML-aware, then we will just
export XML - all we then have to do is work out how we tell the browser
in what way to display it, without embedding that information in the XML
document in the database through an explicit link to an XSL stylesheet.

>  The ability to modify and store document fragments would be 
> a significant
> advance, making management and editing a heck of a lot 
> simpler than it is
> now.

Exactly right. We actually do use a web interface on an object-like
database which allows you to drill down to any node in the tree. There's
no uploading or downloading, you just edit the node (through a web
browser).

This brings with it its own problems though, as I will try to explain.
To spell out the issues first; say we have something like:

	You live in <country id="USA">North America</country> and eat
<animal>turkey</animal> at Thanksgiving.

and

	I live in <country id="UK">Blighty</country> and have a friend
in <country id="TKY">Turkey</country>.

This gives us great search potential:

- you could just search for the word Turkey, and get both entries -
animal and country
- you could search for the COUNTRY Turkey and get only the second entry
- you could search for "Great Britain" and also find the second entry

To achieve the latter, you simply say things like:

	<country id="UK">Great Britain</country>
	<country id="UK">UK</country>
	<country id="UK">United Kingdom</country>
	<country id="UK">U.K.</country>
	<country id="UK">perfidious Albion</country>

and so on. Then a search for any of the strings inside the tag, is
converted to a search for id="UK".

(This is all 'pseudo-XML'. We actually use a more generalised link
syntax.)

So, to return to the problem, we can only achieve this at the moment by
the user actually typing these tags into the database. It's not a bad
solution - and is a lot better than manipulating 350K files in a text
editor - but what we really want is to be able to highlight a word or
expression and then apply a tag from a list of available ones. In other
words, to achieve what we really want, the user-interface is going to be
a major project in itself. For example, we also want to be able to
automate tagging of certain obvious connections, especially useful for
converting large quantities of legacy data.

> (I love making changes in 350K HTML files and FTPing 
> them to their
> home again and again.)

As said, thankfully we don't do that.

>  Versioning and security would be 
> great as well.

I don't think this is all that difficult. As far as security goes, our
system has that on every node already. It's quite cute really, because
two people can request the same document, and certain nodes can be
denied to one and granted to the other, appearing to present two
different documents.

As to versioning, these issues are not new, and the technology is out
there. Even with our relatively crude system, we could easily retain all
historical versions of a node, and even apply labelling and commenting,
like SourceSafe and PVCS do. Since we create our documents on the fly
from the database then you could re-create any document from any time,
and even search them. It would be more of a step for us to store these
as deltas, but the expertise is around.

> The management layer is a whole other set of things to consider, and I
> think I'll let vendors ponder that, but again, I'd love to 
> see it managed
> via the Web.

I agree. Our current interface is all in JavaScript, and doesn't need
the DOM. It has a tree structure that allows you to navigate through the
nodes in the database. All data is edited by opening a node, and new
nodes can be added at certain points dependent on whether they are
allowed. An important next step is being able to work offline and then
batch submit changes, whether just a few nodes or Tim's gigabytes of
documents. For that we will need to work out some tracking mechanism to
see if a node have been removed, altered, or whatever, but that isn't
that difficult really, and may well just be a simple use of a syntax
like XML-RPC. (I'm not trying to trivialise this stage; I know the
software will have to, for example, respond in a reasonable way when
someone tries to add data that might contain a node that they have no
rights to, conflicts must be resolved, and so on, but it isn't really
the most baffling of tasks.)

The structure of the objects is also defined through this tool, but here
I think is where we will need to do the most work. The ideal scenario is
for there to be a very close relationship between the DTD and the
storage structure. At the moment we can do it one way round - use the
database structure to 'create' a DTD, which is handy, but what if we
don't control the definition of the DTD? Just as you can 'import' your
XML files, we want to 'import' other people's DTDs and presto, have our
database structure. And more excitingly, there are certain types of
changes that could happen to that DTD which could be immediately
reflected by changes in the database. A dynamic database like that would
be very useful.

> 'Repository-in-a-box' is what I'd call this ...

Mmm - snappy :-)

> A lot more standards have to settle before there's 
> much chance of
> implementing such boxes

I don't know - I think we can already go a long way. We've already
managed to alter our stuff easily to keep up with the changes in XSL,
for example, and can't see much looking forward that will throw us out
provided we plan carefully (and pay attention to this discussion forum,
of course).

>From our side the issues are more to do with performance and resilience,
the same old issues we've always faced when building large distributed
applications. In the short-term we need to build on something like
Microsoft Transaction Server, for example, to ensure that everything is
industrial-strength. But that is an implementation - not a theoretical -
question.

<aside>
(This is perhaps really for another strand ...)
As I've intimated, many of the problems we are addressing are not that
new in software terms. There are however, some interesting conceptual
issues that do need resolving, which I feel genuinely are new (even
these may be old-hat in the SGML world - I know nothing of that, I'm
afraid, so I apologise). For example, the search issues I referred to
above present the need for a different type of search engine. Most of
the XML search examples I have read, would find the country Turkey by:

	"find Turkey within a country tag"

In my example above though, I would want to search for Turkey, and then
see a list that says Country and Animal. I then choose country, and see
all articles that are about Turkey, the country. We have taken a simple
step towards this by having 'search for country', 'search for person',
and 'search for industry' pages. They all cross-reference to each other,
so searching for 'Bill Gates' will find articles that mention him, his
individual profile, and Microsoft, because the latter it contains an
entry for him as CEO. But longer term we want a user interface model
that allows the user to start right at the top, not knowing what
'objects' we have available. (Imagine searching for Gates in a normal
search engine and the first one hundred entries are about gardening and
fence suppliers. Our way round, the first search results would be the
categories available, not the actual web pages, and so Gates the person
would be clearly visible.)

But to make this user interface more usable, I think that DTDs or
XSchema, or whatever, might need extending, to make the search results
more meaningful. For example, say we had:

	<country><name>Turkey</name></country>

We don't necessarily want a search for Turkey to show:

    Turkey
    + NAME
      + COUNTRY
    + ANIMAL

when the following is far more meaningful:

    Turkey
    + COUNTRY
    + ANIMAL

The results of these tags would be even less clear to a user:

	<ctry><nm>Turkey</nm></ctry>

and if the user of the search engine was French, wouldn't we want the
available objects to be shown in French? Anyway, you get the point; I
think DTDs themselves might need to have some more information in them,
or there may need to be some XSchema-type standard to handle this.
</aside>

> what it would
> take to create such a beast and make it a commodity product

Less than I think everyone thinks. To summarise, I think there is a lot
of mileage in merging the right existing technologies together, rather
than completely starting from scratch. There are a lot of developments
out there that when put together create far more of what you are after
than may at first sight be obvious.

Regards,

Mark Birbeck
Managing Director
Intra Extra Digital Ltd.
39 Whitfield Street
London
W1P 5RE
w: http://www.iedigital.net/
t: 0171 681 4135
e: Mark.Birbeck at iedigital.net

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)