XML Search Engine Holy War - Attributes vs. Elements

Sat Oct 16 01:24:05 BST 1999

Duane Nickull wrote (re GoXML Context-based Search Engine):
>
>1. Ignore Attributes all together and index Elements and Character Data
>only.
>
>The feeling is that the use of attributes should be restricted (by
>authors).. [snip]
>
>2. Index attributes as text only and place the resulting text within the
>character data portion of the index.
>[snip]
>
>3. We should index attribute values as ____________? names as _____?

Robert DuCharme wrote [re #1]:
>
>This shouldn't even be considered. Attributes are used for far more than
>what the above paragraph describes. [snip]

I heartily agree with Robert here.  We are a publisher of medical and
veterinary reference materials, and we often use XML attributes to qualify
their associated elements.  We also use attributes extensively in
configuration files that are largely comprised of empty-element types (i.e.,
no character data or element data to index!) -- and we want our indexing
tools to handle any of our XML data (yet another reason we're looking long
and hard at replacing/supplementing our DTDs with XML Schemas [..not
intending to re-start the 'DTD vs. Schema' flame-war] ;-).

Neither option 1 nor 2 is acceptable.  If you are segregating indexed
content, then you need to add a section for attributes as well as character
data, without merging these two vitally different portions of the XML
structure.

To fill-in the blanks in option 3, you could simply treat attribute names as
analogous to element type names and attribute values as the "character
data".  Or you could treat these as a separate searchable category.  It may
be very nice to provide a mapping mechanism between element-chardata and
attribute name-value pairs to handle differences between websites.  For
example, site A might be pure elements, whilst site B uses elements with
attributes -- yet i'd want to be able to do a title search that would hit on
both, que no?

  A:
  <book_catalog>
    <book>
       <title>Professional XML<title>
       <pub>Wrox Press Ltd.</pub>
    </book>
  </book_catalog>

  B:
  <book_catalog>
    <book title="Professional XML" pub="Wrox Press Ltd."/>
  </book_catalog>

Also heed Robert's mention of ID/IDREF attributes -- these will be critical
for serious XML apps!

As for the remark "..the use of attributes should be restricted (by
authors)..", i hope that you're not serious about this!  IMHO, any XML
tool/product/whatever that attempts to narrow the use of XML features and/or
otherwise dictate structure to users of XML is doomed to a similarly narrow
market.

A related issue from GoXML's webpage "XML Meta Tags"
  @ http://www.goxml.com/about/xmeta.htm
>
>There is currently no standard for which we can index XML meta tags. We are
>working on a standard for XML meta tags which are actually comments:
>
><!--XMETA:KEYWORDS | keyword1 keyword2 keyword3-->
>[snip]
>
>Another proposed way of doing this is through the use of processor
instructions, >(PI's). [snip]
>
>This was a point recently brought up to us by Jacob Hammeken, and it looks
like >this approach would be a much cleaner way of placing meta markup in an
XML >document. Any comments?

it's not just cleaner -- you must use PIs for this purpose, since XML 1.0
specifically states in section "2.5 Comments" that "..an XML processor
[parser] may, but need not, make it possible for an application to retrieve
the text of comments".  And as Tim Bray states in his annotations: "This
means that if you're building an XML application, you should never rely on
anything that shows up in a comment (this sleazy trick is far too common in
HTML)."  The parser used by your indexer may provide you the comments, but
mine might not -- and i'm not necessarily going to be happy to change my
parser to use your indexer, eh?

Regards and best wishes,
-Nik O, Teton Data Systems, Jackson, Wyo.

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)