Representing Large Tabular Data Blocks

Fri Nov 19 22:51:30 GMT 1999

Mike Mclinn asked:
>
>I am converting a number of existing proprietary file formats
>to a general XML format.  The nature of the data I am working
>with is very large, very well formatted blocks of data.
>To give a point of reference, these files are used for data
>transfer within the CAD industry.
>
>The natural XML solution would be of course to embed each
>data value within an element or element attribute.  Such as:
><Point XYZ="324.1241 121.1214 -12.4521" NORMAL="0.0 0.0 1.0"/>
>However, when you replicate this point element a hundred
>thousand times or so, you get an enormous increase in file
>size.  Thus raising the question of XML efficiency.
>

It looks like you want to parse out the XYZ data after you remove them
from the XML.  You have 40 characters of data (including the point's id,
"XYZ" in this example, and 22 mark-up-related characters.  This is an
overhead of about 33%.  It's not an "enormous" overhead, but it's
noticable.  Since you are not using XML to describe the substructure of
points, you could abbreviate the names:
    <Pt XYZ="324.1241 121.1214 -12.4521" NRM="0.0 0.0 1.0"/>

This gives you 16 characters of non-data, or 29%.  If the NORMAL vector
used 8 digits for each component instead of 3 as in the example (which I
imagine is usually the case), you would have an overhead of 16/71, or
22%.

Do you really need to worry about 22% in the file size?  If so, you
could express your numbers in base 26, say, to reduce the number of
digits.  Just as binary expession makes the numbers very long,  This
would probably reduce every 3 characters to 2.  But with or without this
encoding, you still have to have another non-XML parser for the actual
data components.

If you have to write the parser, you can do anything, but if you are
feeding an existing parser, you may think you have to live with this
format.

It all depends on your needs, of course.  I would suggest that 22%
overhead is acceptable.

>One possible solution is to compress the resulting files.
>However, this is a very undesirable option in my case for
>a variety of reasons.  Primarily compatibility and dependency
>issues.
>
>Another possible solution is to use a single element to bound
>all points, using some sort of delimiters to separate records,
>such as:
><Point TEMPLATE="XYZ,NORMAL,DT" DELIM="|">
>324.1241 121.1241 -12.4521, 0.0 0.0 1.0, 0.707 0.707 0.0|
>This works to some extent, though it seems a pretty drastic
>break from standard XML data schemes.  Unfortunately, to
>handle this nicely a customized parser is needed.
>Essentially, this method is providing a template or
>macro encoded in an element attribute that defines the
>contents of the element.
>
>Thus, if anyone could provide opinions / comments on the
>following issues, I'd be greatly appreciative:
>
>Is there a good way of representing bulk data embedded in
>an XML file, without relying on external compression for
>efficiency?
>
>Is the concept of using structured element contents
>a viable method in this case?
>

Regards,

Tom Passin

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)