A little wish for short end tags

Sun May 17 18:38:45 BST 1998

At 10:21 PM 5/16/98 -0400, Paul Prescod graced us with:
> So what you're saying is that this sort of thing can work, even with all
> of the minimization features of SGML, because you knew the general layout
> of your data and/or you had a tool that could normalize the weird stuff
> for you. You succeeded at your task because you approached it (perhaps
> unknowingly) with either:
> 
>  a) the right data set: more or less already normalized SGML or
>  b) the right tool -- a normalizer: AE.

Um, no. I'm sorry - you didn't read what I said. I didn't say that I had
been running my SGML through A/E, I had been running Perl scripts on text
files that were output from a conversion system whose input was irregular
at best - "this page intentionally left blank", typos, and so forth were
the order of the day. This previous sentence may be translated as "the
input was not normalized".

Or did I not make this abundantly clear? If so, my apologies. I should have
used terms more appropriate to your dialog, like "non-normalized", "poorly
ordered data sets" or some such nonsense - which, if I understand Jon Bosak
correctly, the average numbnut with Perl and a job to do wouldn't know.

I never thought I'd ever show anyone this crud, but in the interests of
keeping XML accessible, here's an extremely horrid script I wrote, based
on other horrid scripts I inherited. I knew nothing about Perl at the time,
having only a passing understanding of the following while() construct,
and my only real useful knowledge was of regular expressions, which I'd
taught myself using A/E's search/replace function. The script below fixes
a number of problems with the output from another filter, whose input was
the text output from a proprietary conversion system. Illustrated,

 Garbage on Paper -> Garbage in Electronic Format -> ASCII Trash -> SGML

This script was the last in the pipeline. The references to 'paragrapgh'
and 'loping' below are jokes, based on a misspelling in the original 
script I rec'd from the programmer :)

#!/aie/newgateway/tools/perl

# enable paragrapgh mode, multiline mode

$/ = "";
$* =1;

# get rid of hard returns, spaces before and after tags
# add an "a" to board no. Take out endpara-beginpara in
# emphasis tagset. gets deg&amp right. gets rid of empty tagsets.
# puts a space after end sup/subscrpt and end emphasis tags.
# joins seqlists that ought to be joined. fixes numstyle attrib.
# rejoins paras broken by page or hyphen.
# substitutes para0 title tags for emphasis tags found in paras.
# puts figures on the outside of notes, warnings, and cautions.
# Attempts to fix the seqlist nesting problem.
# gets nested emphasis tags to titles of para0s.

# start loping

while (<>) {

# remove para tags from within emphasis tags.

	s:(<EMPHASIS EMPH="[uiboz]">.*)</PARA>\n<PARA>:\1\n:g;

# join hyphenated lines, take out hardreturns, replace multispaces with one
space.

#	s:-\n::g;
	s/\n+/ /g;
	s/( )+/\1/g;

# removes spaces after and before tags.

	s/> />/g;
	s/ </</g;

# puts "a" at beginning of board no.

#	s/(BOARDNO=")/\1a/g;

# remove common empty tags.

	s:<TITLE></TITLE>::g;
	s:<THEAD></THEAD>::g;
	s:<ROW></ROW>::g;
	s:<PARA></PARA>::g;
	s:<NOTE></NOTE>::g;
	s:<CAUTION></CAUTION>::g;
	s:<WARNING></WARNING>::g;

# adds space after emphasis and s-script tags.

	s:(</EMPHASIS>):\1 :g;
	s:(<EMPHASIS( EMPH="[biuoqz]+")?>): \1:g;
	s:(</SU[BP]SCRPT>):\1 :g;

# gets the entities right.

	s:&degree;:&deg;:g;
	s:&ampree;:&amp;:g;
	s:@reg;:&reg:g;

# joins seqlists broken by paras.

#	s:</SEQLIST></PARA><PARA><SEQLIST>::g;

# dehyphenates mistakenly broken paras.

	s:-?</PARA><PARA>([a-z]):\1:g;

# puts seqlists inside the previous para.

	s#(:|.)[\032]*</PARA><PARA>(<SEQLIST>)#\1\2#g;

# turns emphasis tags just inside paras into titles of the previous para0.
# works on nested emphasis tags as well, to two levels.

	s#<PARA0 LABEL="([0-9]+\-[0-9]+\.)"><PARA><EMPHASIS EMPH="[uibozq]">([0-z
\-\(\)/]*)\.</EMPHASIS> #<PARA0 LABEL="\1"><TITLE>\2.</TITLE><PARA>#g;
	s#<PARA0 LABEL="([0-9]+\-[0-9]+\.)"><PARA><EMPHASIS
EMPH="[uibozq]"><EMPHASIS EMPH="[uibozq]">([0-z \-\(\)/]*)\.</EMPHASIS>
</EMPHASIS> #<PARA0 LABEL="\1"><TITLE>\2.</TITLE><PARA>#g;

# puts in numstyle attrib. in seqlists.

	if(/<SEQLIST><ITEM LABEL="[0-9]/) {
		s:<SEQLIST><ITEM LABEL="([0-9]):<SEQLIST NUMSTYLE="ARABIC"><ITEM
LABEL="\1:g;
		}
		elsif(/<SEQLIST><ITEM LABEL="[a-z]*/) {
			s:<SEQLIST><ITEM LABEL="([a-z]*):<SEQLIST NUMSTYLE="ALPHALC"><ITEM
LABEL="\1:g;
			}
		elsif(/<SEQLIST><ITEM LABEL="[A-Z]*/) {
			s:<SEQLIST><ITEM LABEL="([A-Z]*):<SEQLIST NUMSTYLE="ALPHAUC"><ITEM
LABEL="\1:g;
			}
	if(/<\/PARA><\/ITEM><ITEM LABEL="a\./) {
		s:</PARA></ITEM><ITEM LABEL="a:<SEQLIST NUMSTYLE="ALPHALC"><ITEM LABEL="a:g;
		}
	elsif(/<\/PARA><\/ITEM><ITEM LABEL="A\./) {
		s:<\/PARA><\/ITEM><ITEM LABEL="A:<SEQLIST NUMSTYLE="ALPHAUC"><ITEM
LABEL="A:g;
		}
# splits seqlists at unnesting point - (only for ARABIC after ALPHALC.)

	if(/<ITEM LABEL="[b-z]\."><PARA>[^<>\/]*<\/PARA><\/ITEM><ITEM
LABEL="[0-9]+\."/) {
		s:(<ITEM LABEL="[b-z]\."><PARA>[^<>/]*</PARA></ITEM>)<ITEM
LABEL="([0-9]+)\.":\1</SEQLIST></PARA></ITEM><ITEM LABEL="\2":g;
		}

# puts figure outside of note | caution | warning tags.

	if(/<\/FIGURE><\/NOTE>/) {
		s:(<NOTE><PARA>[^<>/]*</PARA>)(<FIGURE LABEL="[0-9
\.\-]*"><TITLE>[^<>/]*</TITLE><GRAPHIC
BOARDNO="[0-z]*"></FIGURE>)+(</NOTE>):\1\3\2:g;
		}
	if(/<\/FIGURE><\/CAUTION>/) {
		s:(<CAUTION><PARA>[^<>/]*</PARA>)(<FIGURE LABEL="[0-9
\.\-]*"><TITLE>[^<>/]*</TITLE><GRAPHIC
BOARDNO="[0-z]*"></FIGURE>)+(</CAUTION>):\1\3\2:g;
		}
	if(/<\/FIGURE><\/WARNING>/) {
		s:(<WARNING><PARA>[^<>/]*</PARA>)(<FIGURE LABEL="[0-9
\.\-]*"><TITLE>[^<>/]*</TITLE><GRAPHIC
BOARDNO="[0-z]*"></FIGURE>)+(</WARNING>):\1\3\2:g;
		}

# puts untagged FIGURE inside tags.

	if(/<PARA>FIGURE [0-9 \-\.]* [^<>]*<\/PARA>/) {
		s:<PARA>FIGURE ([0-z\-\.]*\.) ?([^<>]*)</PARA>:<FIGURE
LABEL="\1"><TITLE>\2</TITLE><GRAPHIC BOARDNO=""></FIGURE>:g;
		}

# check for FIGURES with part of the label inside the title.

	if(/<FIGURE LABEL="[^"]*"><TITLE>[0-9\.]+ /) {
		s:(<FIGURE LABEL=")([^"]*)("><TITLE>)([0-9\.]+) :\1\2\4\3:g;
		}

# puts cautions, warnings, notes inside the previous item tag.

	if(/<\/PARA><\/ITEM><\/SEQLIST><\/PARA><NOTE>/) {
		s:(</ITEM></SEQLIST></PARA>)(<NOTE><PARA>[^<>]*</PARA></NOTE>):\2\1:g;
		}
	if(/<\/PARA><\/ITEM><\/SEQLIST><\/PARA><WARNING>/) {
		s:(</ITEM></SEQLIST></PARA>)(<WARNING><PARA>[^<>]*</PARA></WARNING>):\2\1:g;
		}
	if(/<\/PARA><\/ITEM><\/SEQLIST><\/PARA><CAUTION>/) {
		s:(</ITEM></SEQLIST></PARA>)(<CAUTION><PARA>[^<>]*</PARA></CAUTION>):\2\1:g;
		}
	if(/"><PARA>[A-Z \-\/\(\)\.]*\. /) {
		s:(">)(<PARA>)([A-Z \-\/\(\)\.]*\.) :\1<TITLE>\3</TITLE>\2:g;
		}

# joins seqlists that were created by bad tagging of cautions, warnings,
and notes.

	if(/<\/CAUTION><\/ITEM><\/SEQLIST><\/PARA><PARA><SEQLIST/) {
		s:(</CAUTION></ITEM>)</SEQLIST></PARA><PARA><SEQLIST(
NUMSTYLE="[A-Z]*")?>:\1:g;
		}
	if(/<\/NOTE><\/ITEM><\/SEQLIST><\/PARA><PARA><SEQLIST/) {
		s:(</NOTE></ITEM>)</SEQLIST></PARA><PARA><SEQLIST(
NUMSTYLE="[A-Z]*")?>:\1:g;
		}
	if(/<\/WARNING><\/ITEM><\/SEQLIST><\/PARA><PARA><SEQLIST/) {
		s:(</WARNING></ITEM>)</SEQLIST></PARA><PARA><SEQLIST(
NUMSTYLE="[A-Z]*")?>:\1:g;
		}

# fix erring para0s (figures, lb-in.)

		s:</PARA></ITEM></SEQLIST></PARA></SUBPARA1></PARA0><PARA0
LABEL="([^<>"]*)"><PARA>(lb in.)</PARA><PARA><SEQLIST NUMSTYLE="[A-Z]*">:\1
\2:g;
		s:</PARA></SUBPARA1></PARA0><PARA0 LABEL="([^<>"]*)"><PARA>(lb
in.)</PARA><PARA>:\1 \2:g;
		s:</PARA></ITEM></SEQLIST></PARA></PARA0><PARA0
LABEL="([^<>"]*)"><PARA>(lb in.)</PARA><PARA><SEQLIST NUMSTYLE="[A-Z]*">:\1
\2:g;
		s:</PARA></ITEM></SEQLIST></PARA><PARA0 LABEL="([^<>"]*)"><PARA>(lb
in.)</PARA><PARA><SEQLIST NUMSTYLE="[A-Z]*">:\1 \2:g;
		s:figure</PARA></ITEM></SEQLIST></PARA><CHAPTER><SECTION><PARA0
LABEL="([0-9 \-\.]*)"><PARA>:figureS \1:g;
		s:figure</PARA><CHAPTER><SECTION><PARA0 LABEL="([0-9
\-\.]*)"><PARA>:figureS \1:g;
		s:figure</PARA><PARA0 LABEL="([0-9 \-\.]*)"><PARA>:figureS \1:g;
		s:figure</PARA></PARA0><PARA0 LABEL="([0-9 \-\.]*)"><PARA>:figureS \1:g;
		s:figure</PARA></ITEM></SEQLIST></PARA></SUBPARA1><SUBPARA1 LABEL="([0-9
\-\.]*)"><PARA>:figureS \1:g;

# change the dtd header from docgasturb to dcgastep -- for meter books

		s:<!DOCTYPE DOCGASTURB SYSTEM "docgasturb.dtd":<!DOCTYPE DCGASTEP SYSTEM
"dcgastep.dtd":;
		s:\]><PARA>\[</PARA>:[]>:;
		s:</TBODY>([^YH])*<TBODY>::;

	print;
}

The bitch of it is, this script worked fairly well and saved us enormous
amounts of time.

I hope that this ugliness demonstrates that Perl and SGML-like text files
can allow a complete neophyte to do wonders, and that any hifalutin changes
to the relative simplicity of the current XML spec would be detrimental to
the average 'frustrated perl programmer'.

Steve
(I'm so ashamed)

--
"All the good geek things,                         schampeo at hesketh.com
 only without all the                         http://a.jaundicedeye.com
 bad geek things."                         http://hesketh.com/schampeo/

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)