Whitespace

Andrew Greene agreene at bitstream.com
Thu Aug 28 18:56:14 BST 1997


But perl doesn't have to break $_ on newlines. Whenever I do SGML
"parsing" with perl, I start off with

    $/ = "<";

which says "the record-break character is '<', instead of newline."
Then, within my while (<>) loop, each $_ contains a single tag and
some content, (roughly) matching the regexp:

    ($etagP, $gi, $attlist, $content) =
       /(\/?)(\w+)\s*([^>]*)\>(.*)/;

[For purists only: Yes, GIs can contain a different set of characters
than \w+, and attributes can contain > if it's enclosed in quotations,
and this doesn't chop off the '<' at the end of all tags except the
last one, and so on and so forth.... For SGML, it assumes that the
first character of ETAGO is the same as STAGO; for XML, it doesn't
handle the /> syntax... but it's simplified to make a point.]

The point is that perl doesn't care whether you have whitespace or
not, and if your perl script is splitting on newlines then you're
probably not going to correctly handle tags that contain newlines,
such as

    <book
        id=TWENTYKDOWN
        authorid=VERNEJ
        pubid=PENGUIN
    ><title>20,000 Leagues Under the Sea</title
    ></book
    >

- Andrew Greene

        
        
        


xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)




More information about the Xml-dev mailing list