How to create URIs out of system ids

Paul Grosso paul at arbortext.com
Tue Jan 27 17:55:38 GMT 1998


The XML 1.0 spec indicates that:
  [t]he SystemLiteral that follows the keyword SYSTEM [which]
  is called the entity's system identifier is a URI, 
  which may be used to retrieve the entity." [1]

The questions I have are less issues with the spec but more
issues of practical implementation, hence this posting.

I am trying to grapple with the following question(s):

  Given an SGML external identifier with a possibly omitted 
  system identifer, how would an application most appropriately
  generate the system id part for a valid XML ExternalID.

The basic scenario is that I'm starting with something that is
not necessarily XML (perhaps it's SGML), and I'm trying to automate
the process of producing XML, specifically in this case, valid
XML ExternalIDs.

Here's my cut of the issues.

Consider the following cases all allowed by SGML:
1.  <!ENTITY foo PUBLIC "public id">
2.  <!ENTITY foo PUBLIC "public id" "sysid">
3.  <!ENTITY foo SYSTEM "sysid">
4.  <!ENTITY foo SYSTEM "">
5.  <!ENTITY foo SYSTEM>

In cases 1 and 5, we can assume that the application has some
way to determine an implicit system id at least some of the time.

Note that the relevant section in the XMP PR [1] goes on to say:

	Unless otherwise provided by information outside the scope 
	of this specification..., relative URIs are relative to the
	location of the resource within which the entity declaration 
	occurs.

A system identifier in general could be:

a.  a file pathname relative to the location of the resource within 
    which the entity declaration occurs;
b.  a file pathname relative to something else (e.g., the catalog in 
    which the sysid was found as a result of the public id lookup);
c.  an absolute file pathname on the local computer's file system;
d.  a URL relative to the encapsulating entity;
e.  a URL relative to some other base URL somehow specified;
f.  an absolute URL;
g.  empty;
h.  something else (e.g., "this is garbage").

The basic question is what SystemLiteral to generate to create the
most appropriate valid XML ExternalID in each case. 

Below I'm using the term "sysid" to refer to the system id as 
specified in the external id or in the catalog, and "URL" to refer 
to the SystemLiteral that will get put into the XML ExternalID.

For a, the relative file pathname would get converted to the
equivalent relative URL just by converting the syntax; on Unix
and NT, this would consist just of escaping characters not allowed
in URLs whereas on DOS-based machines and Macs, etc., it is also
the case that the path separator character (\ or :, etc.) would 
get converted to /.  Alternatively, the application could make 
the sysid absolute and then handle it as case c which would make 
the document more likely to work if it were moved elsewhere.  Thoughts?

For b, either the application could try to get fancy and translate
the sysid that is relative to something else into one that is relative
to the containing document and then handle as case a; otherwise, it
could make the sysid absolute and then handle it as case c.

For case d, there is nothing to do.  Alternatively, it could make the 
URL absolute and then handle it as case f.

For case e, either the application could try to get fancy and translate
the URL that is relative to something else into one that is relative
to the containing document and then handle as case d; otherwise, it
could make the URL absolute and then handle it as case f.

For case f, there is nothing to do.

For g, the application could leave it empty since that is a valid URL,
though probably not what's intended.  Or it could write some URL such
as "http://unknown.netloc/unknown.url".  Any other ideas?

For case h, the application could leave it alone and just pass on the
"garbage" or it could handle it as case g.  Thoughts?

For c, I'm not sure what makes the most sense.  Presumably, the application
could try to get fancy and, if the referenced file is in fact accessible
via some http-URL, make the conversion, but this seems tricky and questionable
and certainly can't work in all cases.  That leaves writing out the absolute
file name as a file-scheme URL.  (Am I missing some other alternative?)  My 
reading of RFC1738 seems to indicate that, for a file path name of 
c:\pbg\webpages\pbghome.htm on my local machine, the file-scheme URL could 
be either:
	file://localhost/c:/pbg/webpages/pbghome.htm
or
	file:///c:/pbg/webpages/pbghome.htm
The latter works in NS3.0 and IE3.0 on my W95 machine (the former
works in NS3.0 but not in MS3.0 per my experiments--I think I've
heard from others that "localhost" does work now in IE4.0).

So it sounds like what I'd do in case c is do the syntax conversion
as in case a (e.g., \ to / and escape characters as necessary), then
prepend "file:///" to the result.  Is that reasonable?

Another angle I've heard is that user-specified sysid's (cases 2-4 above)
should be left untouched since that's what the user said and only sysid's
that the application must intuit (cases 1 and 5) should be subject to
any of the massaging I've discussed in a-h above.  If you subscribe to
"my gun, my bullet, my foot, my health insurance", then I suppose I can
see that point.  If you subscribe to "do what I mean, not what I say,
I'd prefer you made my life smoother despite myself because all this
technical stuff shouldn't be so hard to figure out in the first place",
then I can see arguments for trying to turn all sysids that aren't
already absolute URIs into absolute URIs for maximal portability.

I'd be interested in hearing other's thoughts on this.

paul

[1] http://www.w3.org/TR/PR-xml#sec-external-ent
[2] http://www.w3.org/TR/PR-xml

Other sources include RFC1738 and RFC1808.

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)




More information about the Xml-dev mailing list