Web Resource Identity

Paul Prescod paul at prescod.net
Fri May 28 17:37:18 BST 1999


There is a new document on the W3C site which is both encouraging and
disturbing:

"""In characterizing the structure and content of the Web, it is necessary
to establish precise semantics for Web concepts. The Web has proceeded for
a surprisingly long time without consistent definitions for concepts which
have become part of the common vernacular, such as "Web site" or "Web
page"."""

"""This document represents an effort on the part of the W3C Web
Characterization Activity to establish a shared understanding of key Web
concepts."""

http://www.w3.org/1999/05/WCA-terms/

It is encouraging because it is long needed. It is disturbing because I
believe it identifies a key problem with the Web (or with my understanding
of the Web). 

This document refers to the URI specification in its definition of
"resource": "...anything that has identity." This is troubling because
there is no definition of identity. In the HyTime and object oriented
worlds, I believe that the defining characteristic of things with identity
is that you can take two references and determine if they refer to the
same object.

I do not see how to do this on the Web. Consider the following URLs:

http://www.mitre.org/index.html
http://www.mitre.org/
http://www.mitre.org

Do they refer to the same resource? Let's try the answer both ways:

YES:

How do we know, other than common sense? What if the URLs were more
radically different -- if the mitre site was also accessible as miter
because French and English authors always swap their r's and e's? I would
love to hear that there is some such thing as a "canonical URL" that I can
retrieve through HTTP or WebDAV. If there is, it should be referred to in
WCA-terms.

Because the Web has a distinction between Web resources and resource
manifestations it is even possible that when you access the same logical
resource from different URLs it could return a different byte sequence
("entity" in HTTP terminology) so that even a byte compare will not reveal
that the URLs refer to the same _logical resource_. 

NO:

This is more disturbing. It makes robust, scalable hypertext linking
essentially impossible. Consider it from an RDF point of view. If I use
RDF to attach a hundred properties to one URL and someone else uses it to
attach a hundred properties to another one then our property groupings
cannot be merged. This also affects XLink. If one group of externally
imposed XLinks refers to the site under one name and another group refers
to the site under another, then those groups cannot be merged to create a
single view.

The only solution, if we assume a one to one correspondence between URLs
and objects is to have EVERY NON-CANONICAL name for the object explicitly
do a redirect to the canonical name. This is not common practice on the
Web and as long as URLs are human-typable it is not likely to become
common practice. If you move an object from the bowels of your Website (a
hundred character URL) closer to the "top" (a 20 char. URL ) you aren't
going to use HTTP redirect to redirect people from the nice new name to
the older, canonical name. But if you change the canonical name then
anything current attached to the document through out-of-line links will
break.

---
Summary:

I believe that the Web needs a concept of a canonical URL, if it doesn't
already have one. Retrieving a document or the HEAD for the document
should describe the canonical URL. I wouldn't mind if the canonical URL
was a totally unreadable UUID as long as I can take two URLs and figure
out whether they refer to two things that happen to have the same content
or actually refer to the SAME THING.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for only himself
 http://itrc.uwaterloo.ca/~papresco

Alabama's constitution is 100 years old, 300 pages long and has more than
600 amendments. Highlights include "Amendment 393: Amendment of Amendment
No.  351", "Validation of Laws Regulating Court Costs in Randolph County",
"Miscegenation laws", "Bingo Games in Russell County", "Suppression
of dueling".  - http://www.legislature.state.al.us/ALISHome.html

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)





More information about the Xml-dev mailing list