Re Whitespace

David G. Durand dgd at cs.bu.edu
Thu Aug 28 21:09:22 BST 1997


>From: Sean Mc Grath <digitome at iol.ie>
>
>>Just as it's not useful in processing HTML. Regexps that don't match across
>>line boundaries are the most common problem I've seen in HTML-processing
>>Perl scripts. Looks like that will continue until people figure out that
>>Perl's line "Feature" is just a bug when used with XML/HTML.
>>
>
>Bang goes the the notion of a lightweigth XML app. then! Thou shalt always
>parse!
>
>XML as a friendly format to, say, DPH needs some explaining. To use Perl to
>read/write XML
>you *must* use an XML parser. Indeed any tool intending to read/write XML
>needs to use a
>*fully blown parser* to get at the document. Bye bye the entire Unix family
>of line oriented text processing apps:-(

Come on, This is a crock. I've set that crytic little variable
(funny that everything in Perl deserves that description) so that
linend won't block regexp matches. Once that was done, I wrote a few
regexps and parsed HTML just fine (It takes 1 line for a simple tag
pattern match, and 10 for a loop to create a reasonably full parse
into elements, content, and attribute values). I'm sure a "real" Perl
programmer (unlike me) can shrink that down to 2-3 lines of
triwty little characters, all of them different.

XML should be no harder. My understanding of the goal for the DPH was
always that XML would be no worse than HTML -- ie. for quick and dirty
transformations or operations, quick and dirty parsers would work. As
far as I can tell, "dirty" means that you know (or are pretty sure)
they will work with one document or corpus of documents, not
necessarily that they will work with any arbitrary document.

If you never break tags across lines in your documents, your Perl
desperation may work without worrying about this case; if you do, you
have to have smarter desperation. For _reliable_ parsing of
_arbitrary_ documents, you probably do need a full parser of the
instance language (10 productions in the standard, or so, wasn't it?).
There's no reason that that level of parsing can't be implemented
within no more than 20 lines of Perl. I can't remember (or abide) the syntax of
Perl enough to write it, but I'm sure there's a DPH on the list wh
would love to volunteer.

>>IT Sounds to me like what we really need is a small paper (about 5
>>paragraphs) explaining whitespace for developers:
>>
>I think this is an excellent idea!

Well, I gave the three sentence version. Feel free to expand it...
Acually I think the three sentences sum it up pretty well.

  --
David------------------------------------------+----------------------------
David Durand                 dgd at cs.bu.edu| david at dynamicDiagrams.com
Boston University Computer Science        | Dynamic Diagrams
http://www.cs.bu.edu/students/grads/dgd/  | http://dynamicDiagrams.com/



xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)




More information about the Xml-dev mailing list