A processing instruction for robots

Lars Marius Garshol larsga at garshol.priv.no
Mon Dec 6 09:25:51 GMT 1999


* Walter Underwood
|
| Comments are welcome.

First thought: this is fine for very simple uses, but for more complex
uses something along the lines of the robots.txt file would be very
nice. How about a variant PI that can point to a robots.rdf resource?


Second thought: "and the index attribute must be first". This is nice
for implementors, but is likely to clash with the expectations of
users and the cost of more generality is very low for implementors.

Why not follow the <URL: http://www.w3.org/TR/xml-stylesheet/ > style
of specifying PI pseudo-attributes?


Also: The robot PI, says the spec, "should be in the internal subset
(not in an external DTD or parameter entity). Since robots may be
non-validating, a robots PI in the external subset might not be seen
by the robot."

I think this is misleading, since "the internal subset" is usually a
short for "the internal DTD subset". A better way of putting it might
be "It should be in the document entity (not in an external entity,
including the external DTD subset and external parameter entities).
Since robots may skip external entities, PIs in external entities
might not be seen by the robot."

However, I don't think this will do either. Entities are what the
storage structure of SGML/XML documents are composed of, and I think
this spec needs to take some sort of stand as to how entities map to
WWW resources, and which entities the PI is really talking about.

One way is to say that every resource is an entity, and every
web-accessible entity is a resource. Then one might say that the
robots PI refers to

 a) the entity in which it is found

 b) the entity in which it is found and all entities included by this
 entity via entity references, regardless of any robots PIs in these
 included entities

 c) the entity in which it is found, and if "follow" is set to yes,
 all entities included by this entity via entity references,
 regardless of any robots PIs in these included entities

 d) the entity in which it is found, and if "sub-entities" is set to
 yes, all entities included by this entity via entity references,
 regardless of any robots PIs in these included entities

Once one agrees on a policy I think this is worth a subsection in the
spec, regardless of the choice made. b) is probably the easiest to
implement, since many APIs do not expose entity structure. It might
not be the best choice, though.

--Lars M.


xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)





More information about the Xml-dev mailing list