Representing Large Tabular Data Blocks
rev-bob at gotc.com
rev-bob at gotc.com
Fri Nov 19 23:26:57 GMT 1999
> I am converting a number of existing proprietary file formats
> to a general XML format. The nature of the data I am working
> with is very large, very well formatted blocks of data.
Is it formatted well enough that you can write a scrap of code to retrieve element N at
random?
> The natural XML solution would be of course to embed each
> data value within an element or element attribute. Such as:
> <Point XYZ="324.1241 121.1214 -12.4521" NORMAL="0.0 0.0 1.0"/>
Actually, I'd see that more intuitively as:
<point x="324.1241" y="121.1214" z="-12.451" nx="0.0" ny="0.0" nz="1.0"/>
More granularity, y'know?
> However, when you replicate this point element a hundred
> thousand times or so, you get an enormous increase in file
> size. Thus raising the question of XML efficiency.
I seem to remember seeing in the XML spec that brevity in code size was not a major
design concern; its selling points were clarity and interchangeability.
One thing I could suggest, if your NORMAL spec is the default that it appears to be, you
could define that default value in the tag definition. I'd really have to see a block of
elements to make a solid conclusion there, though.
> Another possible solution is to use a single element to bound
> all points, using some sort of delimiters to separate records,
> such as:
> <Point TEMPLATE="XYZ,NORMAL,DT" DELIM="|">
> 324.1241 121.1241 -12.4521, 0.0 0.0 1.0, 0.707 0.707 0.0|
First up, the whole concept of defining a delimiter character is antithetical to XML
theory; the tags are supposed to be containers and/or atoms which delimit themselves.
Hence, the whole "tag" structure; a tag is structural, thus it delimits and/or contains data.
> Is there a good way of representing bulk data embedded in
> an XML file, without relying on external compression for
> efficiency?
Not that I am aware of; XML was not designed for code brevity.
> Is the concept of using structured element contents
> a viable method in this case?
You mean like your second example? No, I wouldn't say so.
One thing I do note is that you seem to have two or three sub-elements with identical
structure; you have three coordinates representing, respectively, an X, Y, and Z value.
Hence, it may make greater lexical sense to use one element to represent that triplet, and
another element outside it to assign relative values. For instance:
<point>
<coords x="324.1241" y="121.1241" z="-12.4521" type="xyz"/>
<coords x="0.0" y="0.0" z="1.0" type="normal"/>
<coords x="0.707" y="0.707" z="0.0" type="dt"/>
</point>
Another thing to consider is that, if you have the same original data file that you can pass
around to other systems, and if you can write a platform-independent scrap of code (you
know, the one I mentioned earlier?) to extract and parse a given element from that data
file, you may be able to use that code as the low-level interface between the data and a
virtual XML document. Hence, instead of reading an actual static document, the agent
requests that the interface give it element N in XML format. The interface scans the
data file for the Nth line, reads it, internally converts it into a format like one of those
above, and spits that back to the agent as a response to the request. Apply a cache
system, and this could work pretty well. Since you'd still be using that original data
format at the core (or an optimized conversion of that format), you shouldn't see any
footprint growth outside that taken up by the interface and the agent module. If you're
looking at a one-time lump conversion from the original format to another for your future
use, you can have the interface handle the XML and use the conversion to make the
new data format something your interface can more easily (and quickly) handle. At that
point, you can either store the core file and the interface in a central location (having the
interface function as the ultimate administrator) or distribute copies and find a way to
reconcile/distribute changes regularly. I'd advise the former. ;)
In other words, you'd just be moving a key bit of the logic. Instead of reading the full
XML file and extracting an element, you move to the next higher level of abstraction
(sorry, but I just had to work that in) and tell the extraction call to ask the interface for
that element instead of doing the grunt work of pulling the element itself. Since the
interface is delivering one element at a time, it doesn't need to keep the data around in
full-fledged XML format - it only needs to deliver the data in that format, and the
extraction call need never know that there's NOT a full XML document. Odds are,
you'd wind up not only saving disk space, but processing time and disk wear as well.
Rev. Robert L. Hood | http://rev-bob.gotc.com/
Get Off The Cross! | http://www.gotc.com/
Download NeoPlanet at http://www.neoplanet.com
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list