XML as a programming tool

Mon Dec 22 15:05:13 GMT 1997

Hi,

I'd like to give some examples for the use of SGML/XML in software
development (sorry, I never did any
publishing with SGML/XML and used it for software development only).

Force:  flexible and adaptive software needs meta-information:

This kind of software tends to remove definitions from source code. They
are put into meta-layers, repositories or - more likely because of missing
software infrastructures  - into simple configuration files. These files
generate a big mess pretty soon: They are changed and code breaks. The
overall structure is more than unclear. Parameter definitions in
configuration files are complicated and have to be parsed by every client.
example:  Token = 15 somevalue 32 anothervalue
Team development gets very hard. What IS the authoritative structure and
content of configuration files?

The first approach is usually to come up with a class that maps ".ini"
style configuration files. Still, you would like to have more: tokens in
hierarchies, many tokens of the same name. Validation. And you would like
to split the information into
smaller, separate parts so you can avoid copying them. That means you want
entity management. All of this must be programmed by your team - or?

Solution:

It takes about 2-3 weeks to integrate e.g. the SP parser/entity manager kit
into a framework. Most of the work goes into wrapping SP native classes
from the parser API into apropriate framework classes and interfaces (This
should get much better with a standard parser API). If you got a generic
composite object machine built in, just map the parser events into your
tree classes (nodes) and you have a representation of the configuration
information in memory.
The next step is to add some wrappers for convenience, e.g. implementing a
tiny query interface (findElementByName() etc. and your clients can avoid
hardcoding element lookups and a value class with some conversion
functions.
During boot your framework pulls in the configuration documents, the
parsers validates the content and hands it off to "PartBuilder" instances
that instantiate the proper objects and off you go.
The entity manager in the SP toolkit even enables you to pull configuration
information from some server without the client even noticing.

This way you end up with a defined configuration information that still can
be highly specialized per customer. Its real power shows when you imaging
having hundreds or thousands of installations at customer sites (some
configured even there by service teams) and you want to ship a new verson
of your software. Can you integrate the existing information during
installation? Across releases and possibly extensive modifications? If you
used a copy/paste/change approach to create new customer configuration
information it's now time to look for a new job...

Force: Use dynamic information safely.

Static typed languages sometimes force developers to use untyped
information to avoid changes in interfaces. Examples are:
getValueByName(String name)  etc. In effect one is working around the
static type system.

Solution:

Semantic data streams or the composite message pattern are easily
implemented using the basic tree model from above. You can transfer whole
trees or just parts. The factory that generates these types (they ARE types
because there
is a DTD for them) makes sure that they are created properly. Due to their
self-describing nature the structures can change without breaking existing
clients. Applications for this are externalization, serialization, event
and object bus systems.

Force: Error messages must be language independent and unique.

Solution:

Describe your message catalogs in SGML/XML. Use the ID mechanism name the
programmatic tokens that show up
in source code. The parser is going to tell you if somebody used the same
token twice. The same applies if you need a poor mans implementation
repository with some trader functionality, e.g. to automatically load
classes in factories where the client tells you what interface he wants and
some hints about the properties the object should have. You could map these
properties via introspection directly to beans but every once in a while an
indirection is necessary, e.g. if you bought some beans whose properties
have to be mapped to your systems language.

Force: avoid copying of information in your system.

Many systems duplicate a lot of information in various components or
layers. Let's say there is a customer type in the analysis model. This
usually turns into a customer database table schema, a gui ressource
description of a customer view and some representation of customer in the
"model" part which is a C++ header or a java class. Most of this
information is just a duplicate.

Solution:

Use SGML/XML description for all these aspects and reference customer
information from one place. Write generic modules that read this
information at runtime.

Force:   Share information without coupling objects tightly.
Let's say you are doing some workflow. The workflow objects are part of a
tree (built from SGML/XML information) and child and parent nodes can
communicate with each other, using some fixed interfaces and some dynamic
ones(semantic data streams using DOM). But every once in a while some
information is created in a node that is useful for some other node that is
NOT directly connected to the first node. How can this node get the
information without linking both nodes?

Solution: Turn some information tree into a blackboard.
Create some SGML/XML instance that models the structure of the information
you want to share. The elements can be empty (there IS use for markup
without content(:-)). Load this tree into memory. Make your nodes also
implement an observer type interface. Now clients can do lookups and if
nothing is there yet, they can register for change. This has three
advantages:
- publisher and subscriber are NOT directly coupled and can change any way
they want without affecting the other.
- There is no need to do sequential processing. The workflow tree will
settle into a correct state but the path it takes is undetermined and
decoupled from the descriptive workflow logic. (this makes some people with
a strong procedural background a bit nervous).
- the blackboard is highly structured and not chaotic. Debug routines can
print human readable snapshots.

Force: process error, trace and debug information automatically.
I guess everybody has seen that huge and unstructured mess created by
error, trace or debugging messages. In mission critical applications agents
are supposed to react on those kinds of messages.

Solution:

Write error, trace or debug information in SGML/XML. This can be well
formed information only. Don't allow anybody to write unstructured
information anywhere. They have to go to a factory, get a special type of
SGML/XML node and fill it in. Now it is easy for agents to find critical
information. To get to the information they let the output go through the
parser and use a SGMLApp implementation that does not build a tree but
processes the parser events on the fly. (Assuming that in this case the
information need not be represented as a tree).

Using the same convenience wrappers from above the agents are totally
independent of any structural changes in the output stream, caused by
different execution order etc.) and will continue working. (I have seens
desperate moves to process e.g. Unix kernel and boot messages via handcoded
applications...)

Force: Translate from one domain language into a different one.
I suspect that about 50% of work in business programming goes into format
transformations between different COTS or other applications and databases.
One can view database schemas, interfaces and protocols as little domain
languages. Since SGML/XML information trees have enough descriptive power
to represent those, it is possible to build automatic translator
sub-frameworks for "data-schlepping".

Solution:

example: import server.
Frequently information in a new format has to be imported into a system
(e.g. DTA electronic commerce data, financial instruments data etc.).
Storage Objects convert these formats into SGML/XML representation. This
makes further processing independent of the different physical data formats
of the new format and the existing system. But it does not solve the
language problem itself: one format might call the customer "customer" and
the other one "BusinessPartner".

A translator framework provides wrapper that wrapp the new information tree
(e.g. DTA info) into the internal language. Of course the mapping process
is driven by mapping information specified in SGML/XML. If more than simple
name mapping is necessary, the wrappers can be dynamically configured with
little action objects that can compute values etc. Of course these are
again configured using SGML/XML configurations.

Force: Get information from OO-Analysis into the system
In every larger framework the gap between OO-Analysis and implementation is
huge. Direct mapping from an analysis class to code just leads to totally
inflexible systems. That's why e.g. Enterprise Java Beans treats
concurrency, persistence etc. as being "orthogonal" to an objects
implementation. This means that the implementation of these do not happen
in the object. They are provided by containers etc. The next thing that's
going to be pulled out of objects is business logic. (our framework did
this already and used SGML/XML to describe the workflow).

But what does this mean for the analysis information if it doesn't get
turned into code?

Solution:

Use analysis information to build up a meta-information layer. Use SGML/XML
to describe it. Now generic objects can interpret this information and
instantiate the necessary objects for processing. The meta-layer objects
are of course the same ones we used to implement the configuration
information, the trace facility etc.

Conclusion:
For all these uses of SGML/XML basically the same software components were
used over and over again. And the real hard ones were written by James
Clark anyway(:-). This is reusable software and has the nice side effect
that after a while programmers get familiar with the interfaces and don't
have to learn new ones to new data formats all the time. I mean - what's
the difference between configuration information, external data formats,
blackboards etc? Just different DTDs.

But more important than reuse is the flexibility of software using SGML/XML
to represent meta-information. Bringing a system to a new release does no
longer mean: transform BLOB1 into BLOB2. It means transform XXX.dtd into
YYY.dtd - a defined and traceable process. Due to the self-describing
nature of SGML/XML information versioning becomes a defined and automated
process too. Different versions can be detected and automatic translators
can upgrade "legacy" objects. No longer do I have to have old classes in
the system for backward compatibility reasons only.

The bad news:

Past (bad) experience shows that the real problem with using SGML/XML in
software development is not a technical one. Using SGML/XML makes only
sense if the everybody is willing to make information and assumptions
EXPLICIT so they can go into DTDs and instances. This seems to be a sore
point for many programmers that rather see this hidden in code (just look
at the slow progress of pre/postcondition specification or semantic
interface definitions). And no, I don't have a solution for this one.

Merry Christmas and a Happy New Year,

Walter

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)