Representing Large Tabular Data Blocks

Fri Nov 19 23:26:57 GMT 1999

> I am converting a number of existing proprietary file formats
> to a general XML format.  The nature of the data I am working
> with is very large, very well formatted blocks of data.

Is it formatted well enough that you can write a scrap of code to retrieve element N at 
random?

> The natural XML solution would be of course to embed each
> data value within an element or element attribute.  Such as:
> <Point XYZ="324.1241 121.1214 -12.4521" NORMAL="0.0 0.0 1.0"/>

Actually, I'd see that more intuitively as:
<point x="324.1241" y="121.1214" z="-12.451" nx="0.0" ny="0.0" nz="1.0"/>

More granularity, y'know?

> However, when you replicate this point element a hundred 
> thousand times or so, you get an enormous increase in file
> size.  Thus raising the question of XML efficiency.  

I seem to remember seeing in the XML spec that brevity in code size was not a major 
design concern; its selling points were clarity and interchangeability.

One thing I could suggest, if your NORMAL spec is the default that it appears to be, you 
could define that default value in the tag definition.  I'd really have to see a block of 
elements to make a solid conclusion there, though.

> Another possible solution is to use a single element to bound
> all points, using some sort of delimiters to separate records,
> such as:
> <Point TEMPLATE="XYZ,NORMAL,DT" DELIM="|">
> 324.1241 121.1241 -12.4521, 0.0 0.0 1.0, 0.707 0.707 0.0|

First up, the whole concept of defining a delimiter character is antithetical to XML 
theory; the tags are supposed to be containers and/or atoms which delimit themselves.  
Hence, the whole "tag" structure; a tag is structural, thus it delimits and/or contains data.

> Is there a good way of representing bulk data embedded in
> an XML file, without relying on external compression for
> efficiency?

Not that I am aware of; XML was not designed for code brevity.

> Is the concept of using structured element contents 
> a viable method in this case?

You mean like your second example?  No, I wouldn't say so.

One thing I do note is that you seem to have two or three sub-elements with identical 
structure; you have three coordinates representing, respectively, an X, Y, and Z value.  
Hence, it may make greater lexical sense to use one element to represent that triplet, and 
another element outside it to assign relative values.  For instance:

<point>
<coords x="324.1241" y="121.1241" z="-12.4521" type="xyz"/>
<coords x="0.0" y="0.0" z="1.0" type="normal"/>
<coords x="0.707" y="0.707" z="0.0" type="dt"/>
</point>

Another thing to consider is that, if you have the same original data file that you can pass 
around to other systems, and if you can write a platform-independent scrap of code (you 
know, the one I mentioned earlier?) to extract and parse a given element from that data 
file, you may be able to use that code as the low-level interface between the data and a 
virtual XML document.  Hence, instead of reading an actual static document, the agent 
requests that the interface give it element N in XML format.  The interface scans the 
data file for the Nth line, reads it, internally converts it into a format like one of those 
above, and spits that back to the agent as a response to the request.  Apply a cache 
system, and this could work pretty well.  Since you'd still be using that original data 
format at the core (or an optimized conversion of that format), you shouldn't see any 
footprint growth outside that taken up by the interface and the agent module.  If you're 
looking at a one-time lump conversion from the original format to another for your future 
use, you can have the interface handle the XML and use the conversion to make the 
new data format something your interface can more easily (and quickly) handle.  At that 
point, you can either store the core file and the interface in a central location (having the 
interface function as the ultimate administrator) or distribute copies and find a way to 
reconcile/distribute changes regularly.  I'd advise the former.  ;)

In other words, you'd just be moving a key bit of the logic.  Instead of reading the full 
XML file and extracting an element, you move to the next higher level of abstraction 
(sorry, but I just had to work that in) and tell the extraction call to ask the interface for 
that element instead of doing the grunt work of pulling the element itself.  Since the 
interface is delivering one element at a time, it doesn't need to keep the data around in 
full-fledged XML format - it only needs to deliver the data in that format, and the 
extraction call need never know that there's NOT a full XML document.  Odds are, 
you'd wind up not only saving disk space, but processing time and disk wear as well.

 Rev. Robert L. Hood  | http://rev-bob.gotc.com/
  Get Off The Cross!  | http://www.gotc.com/

Download NeoPlanet at http://www.neoplanet.com

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)