A processing instruction for robots

Thu Dec 9 10:28:32 GMT 1999

I suppose this is where I'm supposed to come in....  ;)

> > Adding the robots info to every DTD in the world requires
> > unanimous agreement. Adding a PI requires non-interference
> > with other PIs, a vastly simpler task. Waiting for XML
> > to support mixin vocabularies and for those to be widely
> > used, could take a few years.
> 
> Walter is right on both counts, but I'm having trouble getting comfortable
> with his PI idea.  Not violently against it, but two things make me
> uncomfortable.  First of all, PIs basically suck.  Having said that, if you
> gotta use them, this is the kind of thing to use them for.

Agreed.  This is an instruction to a specific class of processor, hence it's a good fit as a 
PI from that angle.

> But my big problem is with the idea that individual resources ought to embed
> robot-steering information. It just feels like the wrong level of
> granularity.  Either this ought to be done externally in something like
> robots.txt but smarter, at the webmaster/administrator level, or, with a
> namespaced vocabulary at the individual element level.

This has been tried.  The problem is that the current robots.txt idea just doesn't work for 
everybody - robots.txt is supposed to reside in the domain's root [1], and not everybody 
has that access.  (Big examples: Geocities, Angelfire, Tripod, AOL....)  Granted, a tweak 
to that specification that would allow local copies of robots.txt to affect their subdirectory 
tree would be *most* helpful in that regard, but that just doesn't exist.

[1] - See http://info.webcrawler.com/mak/projects/robots/norobots.html under "The 
Method" header.  The filename is "/robots.txt" - which forces the file into the root.

Because of this overwhelming gap, there's a hack in HTML that uses META to 
granularize this at a per-document level, and a few bots are good about obeying that 
syntax.

> The PI has the characteristic that it *has* to be in the document and can modify
> *only* the whole document.  Also I question the ability of authors to do the right
> thing with this kind of a macro-level control.

I do it all the time.  In fact, I have a default value I can specify in my templates.  (Yes, I 
could use a robots.txt file - the current method is a holdover from before I had a domain 
name for my site.)

> Also I question the ability of robot authors to do the right thing at the individual
> document level.

That's already a current issue.  Bot authors who are conscientious enough to obey the 
META hack will have no problem modifying their source to obey the XML PI as well; 
it's a trivial transformation.  (Especially if the PI uses syntax that's as close to the META 
version as possible!)

> In any case, there really should be a namespace with a bunch of predeclared
> attributes for this purpose; then for those who want to do fancy things,
> they can do so in a clean way at the individual element level.

Fine - swipe the existing values and go from there.  The fewer changes made, the better - 
from all viewpoints.  Not only will there be fewer deltas for page authors to learn, but 
bot authors will be better able to just reuse existing META code to accomodate the PI.

Note that I'm not saying that a local robots file wouldn't be a wonderful idea - just that 
since you currently have only the choices of "global" and "per document" with HTML, 
you ought to have *at least* those same choices with XML.  A local robots.txt would be 
tasty gravy indeed.

> Anyhow, is there enough XML on the web to make this interesting?  Serious
> question, I don't know the answer. -T.

I have enough X(HT)ML up to be very interested in this matter - and there's only going 
to be more online as the spec progresses.  Why not address the issue *before* there's a 
huge amount of X(HT)ML online, instead of waiting until a few assorted hacks come 
up?

 Rev. Robert L. Hood  | http://rev-bob.gotc.com/
  Get Off The Cross!  | http://www.gotc.com/

Download NeoPlanet at http://www.neoplanet.com

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)