Bush Donor Lists in XML

Elliotte Rusty Harold elharo at metalab.unc.edu
Sun Sep 12 23:50:54 BST 1999


Friday Governor George W. Bush of Texas posted complete records
of his campaign contributions on his web site. However, he
deliberately posted them in PDF format so they couldn't be
imported into a database or a spreadsheet, and consequently
reporters and voters couldn't find out just how much of his
money was coming from whom. Or at least that's what he thought. :-)

I am pleased to announce, that after a few hours of intense
hacking I have succeeded in extracting the crucial information
from the PDF files and have posted them online in XML and tab delimited
formats for anybody who wants them. Accountants,
start your spread sheets!  You'll find the files at

http://metalab.unc.edu/javafaq/bush/

I've written a very simple DTD for the XML version.
<http://metalab.unc.edu/javafaq/bush/donations.dtd> Based on
this DTD the results do appear to be well-formed and valid
(though I've been burned by misbehaving validators before). The
first two validators I tried gave up on trying to parse such a
large (more than eight megabytes) document. Interestingly, the
initial conversion to XML did turn up some bugs in my
PDF-to-text converter program, but the validation of the XML did
not find any additional problems. I can see where a schema
language would be very useful for this sort of reverse
engineering work though.

Eventually I may try to cook up a more serious DTD that more closely
matches the FEC's actual required format for filing electronic copies of
donor lists. I'm also going to try to add a simple XSL stylesheet to these
in the near future, but they're so large that they really challenge anyone
trying to browse them
directly.




+-----------------------+------------------------+-------------------+
| Elliotte Rusty Harold | elharo at metalab.unc.edu | Writer/Programmer |
+-----------------------+------------------------+-------------------+
|                  The XML Bible (IDG Books, 1999)                   |
|              http://metalab.unc.edu/xml/books/bible/               |
|   http://www.amazon.com/exec/obidos/ISBN=0764532367/cafeaulaitA/   |
+----------------------------------+---------------------------------+
|  Read Cafe au Lait for Java News:  http://metalab.unc.edu/javafaq/ |
|  Read Cafe con Leche for XML News: http://metalab.unc.edu/xml/     |
+----------------------------------+---------------------------------+



xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)





More information about the Xml-dev mailing list