From Jon.Bosak at eng.Sun.COM Sat Aug 2 07:08:47 1997 From: Jon.Bosak at eng.Sun.COM (Jon Bosak) Date: Mon Jun 7 16:58:11 2004 Subject: XML Dev Day schedule Message-ID: <199708020506.WAA20598@boethius.eng.sun.com> A fine assortment of technical presentations is in store for participants in XML Developers Day (Le Centre Sheraton Hotel, Montreal, Thursday, August 21). In fact, the "day" has had to be extended into the evening to accommodate a wealth of reports from early implementors of the new Web technology. This is going to be a can't-miss event for anyone hoping to play a significant role in the coming revolution. Registration for XML Developers Day can be made through the page for the 4th International HyTime Conference: http://www.gca.org/conf/hytime/hytime97.htm Participants new to XML should note that in addition to the many interesting presentations scheduled for the HyTime Conference (August 19-20), a tutorial on XML will be given on Monday, August 18 in the same location. Jon Bosak Dev Day Chair ========================================================= PRELIMINARY SCHEDULE: XML DEVELOPERS DAY, AUGUST 21, 1997 ========================================================= 9:00-9:05 Jon Bosak, Sun Microsystems Welcome 9:05-9:30 David Megginson, Microstar Java Beans and Architectural Forms 9:30-10:00 Lloyd Harding, Information Automation Assembly The Kona Proposal for Electronic Health Care Records 10:00-10:30 Henry Thompson, University of Edinburgh A Motivation for the Schema Component of XML-Data 10:30-11:00 ------------------------------ BREAK 11:00-11:30 Daniel Rivers-Moore, RivCom XML in the Delivery of Corporate Information 11:30-12:00 Patrick Gannon, CommerceNet XML in Component-based Commerce 12:00-1:30 ------------------------------ LUNCH 1:30-2:00 John Tigue, Datachannel XAPI-J in Theory and Practice 2:00-2:30 Jeffrey Olson, School of EECS, Washington State University Conceptual Knowledge Markup Language, an XML Application 2:30-3:00 Henry Thompson, University of Edinburgh The Win95/NT Version of LT XML 3:00-3:30 ------------------------------ BREAK 3:30-4:00 Paul Trevithick, Bitstream Highly Designed Pages and Cross-Media Authoring with XML 4:00-4:30 Sarah Slocombe, Apropos Toy & Tool Development A Java-based QuarkXpress-to-XML Converter 4:30-5:00 David Slocombe and Rajiv Thanawala, Tata Infotech A Visual Recognition Approach to Legacy Document Conversion 5:00-5:30 ------------------------------ BREAK 5:30-6:00 Murray Maloney, Grif XML Editing: Well-formed Documents, CSS, and Namespaces 6:00-6:30 Paul Grosso, ArborText Some Ideas for XML Editing Interfaces 6:30-7:00 Jonathan Robie, POET Software An XML Document Component Database 7:00-7:30 Jeff Eby, Chrystal Software XML and a Generic Repository Architecture xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From neil at bradley.co.uk Sat Aug 2 10:38:48 1997 From: neil at bradley.co.uk (Neil Bradley) Date: Mon Jun 7 16:58:11 2004 Subject: Specification Questions Message-ID: <199708020838.JAA11135@andromeda.ndirect.co.uk> Thanks for the feedback, it was very helpful. However, I STILL do not understand the need for the brackets in the latter half of Mixed: > > > > The second line of the rule for [50]Mixed is: > > > > | '(' S? %( '#PCDATA' ) S? ')' > > > > I cannot understand the purpose of the inner brackets in this part > > of the rule. > > I believe it is to allow parameter entity replacement at that spot: > > > I understand the explanation, but the first half of the same rule is as follows: '(' S? %( %'#PCDATA' ( .......... If %'#PCDATA' can appear here, why can't the second part of the rule be similarly formulated: | '(' S? % '#PCDATA' S? ')' Am I wrong in thinking this would allow a content of " ( %xyz; ) "? > > There is also little written about interpretation of line-ending > > codes. Although the standard states that white space and > > line-ending codes are ignored in element content, nothing is said > > regarding the age old problem of line-ending codes in mixed > > content. > > The spec makes no special provision for whitespace at the beginning > and end of elements. I believe that this is intended to be one of > its simplifications over "regular" SGML. This seeming > incompatibility is mitigated by an an SGML TC which will allow XML > to remain compatible with (post-TC) SGML. > > Paul Prescod Is it up to the application to decide what to do with any leading line ending code in these positions then? I am pleased to be rid of the 'record' concept (using RS and RE) defined for SGML, particularly as I have tended to use Mac and UNIX systems which use a single character to end a line (albeit different ones!). However, I still think there is too little information on the effect of line ending codes in mixed content. Obviously the safe thing to do is to make the content of all elements with a mixed content model fit on a single line, as in:

This is a long paragraph.........................

But with large text blocks, created using text editors, people will continue to use line ending codes to make it readable on-screen. Normally, a break between words would be interpreted as a space when the block is paginated:

This is a long paragraph that is broken over two lines, with an implied space between 'two' and 'lines'.

Yet what happens when a comment or processing instruction appears on its own line?

This is a long paragraph that is broken over two lines, with an implied space between 'two' and 'lines'.

Is this interpreted as "two lines...", which reduces to "two lines"? Neil. ----------------------------------------------- Neil Bradley - Author of The Concise SGML Companion. neil@bradley.co.uk www.bradley.co.uk xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From Peter at ursus.demon.co.uk Sat Aug 2 11:56:33 1997 From: Peter at ursus.demon.co.uk (Peter Murray-Rust) Date: Mon Jun 7 16:58:12 2004 Subject: Specification Questions Message-ID: <9075@ursus.demon.co.uk> In message <199708020838.JAA11135@andromeda.ndirect.co.uk> "Neil Bradley" writes: [...] [Paul Prescod] > > The spec makes no special provision for whitespace at the beginning > > and end of elements. I believe that this is intended to be one of > > its simplifications over "regular" SGML. This seeming > > incompatibility is mitigated by an an SGML TC which will allow XML > > to remain compatible with (post-TC) SGML. The spec is consistent over this, I think, and says that all characters that are not markup should be passed to the application. This includes whitespace. My personal view is that without some central guidance at least, the XML treatment of whitespace will cause problems and incompatibility for two groups of people: - those who are familiar with SGML - those who are not familiar with SGML. The first group are accustomed to SGML parsers (primarily James Clark's) carrying out consistent operations on whitespace. This includes: - removing line-ends immediately after and before markup - translating markup into a small number of platform-independent codes (e.g. ' ' and '\n'). The second group will be familiar with HTML where all whitespace is normalised according to various rules of varying consistency between useragents/browsers. Apart from characters within
 and related markup, all whitespace is 
normalised to single spaces, which and line-ends are inserted according to
the user-agent software, not the document's content. Treatment of 'special'
characters (e.g.     and other escaped characters or entities) is
probably inconsistent.  However, in general, whitespace is not a current 
concern of the second group.

***Both groups are in for a serious problem with XML unless there is some 
central guidance.  Otherwise we are at the mercy of any software implementor.
***


What whitespace characters can be passed to the application? Regardless of 
what is done with it, is CR+LF treated in the same way as LF or CR alone
in a document?  


If not, we shall appear to be in for variations according to what platforms 
the document is created on.  It will be no use telling people that this is 
what the spec says - I had always assumed that one of the attractions of
SGML was that it removed platform-dependent documents.  But reading 
XML-lang [2] suggests that CR and CR+LF produce different results.

The result of parsing, therefore, passes original whitespace to the 
application.  Thus:

two spaces

and

two spaces

are different documents. So are:

no line feeds

and

no line feeds

The first will confuse anyone accustomed to HTML only. The second will also confuse them, and in addition will confuse some current users of SGML. > > > > Paul Prescod > > Is it up to the application to decide what to do with any leading line > ending code in these positions then? > > I am pleased to be rid of the 'record' concept (using RS and RE) > defined for SGML, particularly as I have tended to use Mac and UNIX > systems which use a single character to end a line (albeit different > ones!). However, I still think there is too little information on the > effect of line ending codes in mixed content. Obviously the safe thing > to do is to make the content of all elements with a mixed content > model fit on a single line, as in: > >

This is a long paragraph.........................

> > But with large text blocks, created using text editors, people will > continue to use line ending codes to make it readable on-screen. > Normally, a break between words would be interpreted as a space when > the block is paginated: > >

This is a long paragraph that is broken over two > lines, with an implied space between 'two' and 'lines'.

Yes. Most people will want to work this way. Very long lines are a menace for many types of software. We must assume (and in many cases encourage) people will read and even edit XML documents with non-XML tools. > > Yet what happens when a comment or processing instruction > appears on its own line? > >

This is a long paragraph that is broken over two > > lines, with an implied space between 'two' and 'lines'.

> > Is this interpreted as "two lines...", which reduces > to "two lines"? No. it reduces (I think) to: "...two lines..." If there is one single 'obvious' issue which will prevent the take-up of XML by 'ordinary' people (like myself) it is whitespace. The present position on whitespace is: - the rules are clear but not prescriptive - the rules are non-intuitive to most people - the rules allow many different ways of processing a given document - the role of whitespace in a given document will depend on the software used to process it The philosophy of the XML-lang authors is consistently: - whitespace is a problem for the application, not the spec. - there is no generic way of treating whitespace [I should make it clear that this isssue has been debated at great length, and that the present position is the considered opinion of many experts. I accept it, although I think it will be difficult to work with in practice.] Without consistent treatment, a document author has to ask 'which application is going to process my document?' It means, for example, that the way that whitespace is treated in MathML may be different from that in CML and FooML and ... It effectively destroys the possibility of (sub)document re-use, without a generally agreed convention. I know that XML-lang authors read this group and may therefore take some of these points on board. P. > > > Neil. > > > ----------------------------------------------- > Neil Bradley - Author of The Concise SGML Companion. > neil@bradley.co.uk > www.bradley.co.uk > > xml-dev: A list for W3C XML Developers > Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ > To unsubscribe, send to majordomo@ic.ac.uk the following message; > unsubscribe xml-dev > List coordinator, Henry Rzepa (rzepa@ic.ac.uk) > > -- Peter Murray-Rust, domestic net connection Virtual School of Molecular Sciences http://www.vsms.nottingham.ac.uk/ xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From tbray at textuality.com Sat Aug 2 21:03:46 1997 From: tbray at textuality.com (Tim Bray) Date: Mon Jun 7 16:58:12 2004 Subject: Specification Questions Message-ID: <3.0.32.19970802115011.0089c5e0@pop.intergate.bc.ca> At 09:51 AM 02/08/97 GMT, Peter Murray-Rust wrote: > >What whitespace characters can be passed to the application? Regardless of >what is done with it, is CR+LF treated in the same way as LF or CR alone >in a document? > All bytes that are not markup are data, and passed to the application. Yes, this will be surprising to people who are used to HTML. Too bad - HTML's behavior is unacceptable for many classes of applications. It would be surprising to those who understand the 8879 rules, but experience shows that this group includes only about a dozen people, and they disagree. The rule given above has the virtue that it is short, simple, and easily understood by everyone. We spent a lot of time on this, and it's the only sane way to go. -Tim xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From john at datachannel.com Sun Aug 3 06:21:34 1997 From: john at datachannel.com (John Tigue) Date: Mon Jun 7 16:58:12 2004 Subject: Xapi-J: an architectural detail (long) Message-ID: <33E407E3.37404793@datachannel.com> This note explains some of the internal implementation details of Xapi-J compliant processors. If all you want to do is use an Xapi-J processor, you do not need to concern yourself with these details. This note is intended for people who are actually writing Xapi-J processors. One of the nice features of Java is the clear distinction between inheritance and interfaces. Xapi-J tries to leverage Java interfaces to provide processor users a simple object model and processor implementors wide latitude in regards to the processor internals.. Folks don't really have a problem grasping the XML object model: To get a new XML processor object instance: xml.XMLProcessor xmler = new xml.XMLProcessor(); To have a processor read a document: xml.IDocument aDocument = xmler.readXML( someInfoSource ); To get the root of a document: xml.IElement anElement = aDocument.getRoot(); To get an element's attributes: java.util.Enumeration someAttributes = anElement.getAttributes(); That's easy to understand. It is a very simple object model which can be mapped onto many of the current XML processors without requiring major rewrites. The greatest confusion centers around the mechanism used to hide the specifics of the underlying implementation of the XML processor. This is the only part of Xapi-J which actually involves real classes as opposed to simply interfaces. Xapi-J does not contain an XML processor, it simply says what one could look like. It is up to others to actually supply the working code which is accessed through the Xapi-J interfaces. One of the goals of Xapi-J is to create an architecture which (although powerful and flexible) makes simple things simple. Navigating an XML document object should be simple and as can be seen from the above code fragments, it is. And getting an XML processor should be simple. That it is. If a JVM comes with an installed XML processor, in the interests of making things easy for developers, that processor should be used by default. So a developer could simple do a "new xml.XMLProcessor();" and expect that an XMLProcessor will be instantiated and usable. So if, say, Microsoft wished to package their JVM with an XMLProcessor, they could tweek the default constructor for xml.XMLProcessor to where it would instantiate a com.ms.xml.Parser by default. (I have tested that Xapi-J can be implemented on top of msxml. I have the code on my hard drive. If anyone is interested drop me an email and I'll give you the classes. With this interface adaptor, a developer could write Java applets which save some download time by using the MS parser which will be on the IE4 client and only downloading the light weight adaptor. I would not suggest this. As I have mentioned in an earlier posting I feel that the msxml object model is serious flawed. Correcting for it required some non-optimal efficiency code.) A good architecture makes simple things simple but it doesn't limit a developer. Say a developer wanted to use an XML processor which was tweeked for parsing MathML documents (call it MathMLProcessor). Perhaps a MathMLProcessor could only understand that particular XML application but via this specialization was able to obtain greater performance than a general purpose XML processor. It would be great if the developer could specify that when a "new xml.XMLProcessor()" call occurs a MathMLProcessor should be instantiated. Xapi-J allows for this via the following method in the xml.XMLProcessor class: public static synchronized void setIXMLProcessorFactory( IXMLProcessorFactory factorySettee ) throws XMLException The method signature is that way because: public: accessible from other packages static: applies the the class in general not a particular instance synchronized: thread-safe access to a static method is usually advisable void: standard JavaBean accessor method signature design pattern is: TypeOfX getX() AND void setX( TypeOfX xToBe ) setIXMLProcessorFactory: this method sets the class's IXMLProcessorFactory IXMLProcessorFactory: an Xapi-J interface for objects which can be asked to create objects which implement the interface IXMLProcessor. During "new xml.XMLProcessor()" the factory will be asked to instantiation an object which implements IXMLProcessor factorySettee: The object which is to be assigned as the factory throws XMLException: a general XML exception object; might be thrown if the factory had already been set (a security concern expressed in the regular JDK fashion) So the developer could do the following: XMLProcessor.setIXMLProcessorFactory( new MathMLProcFactory() ); XMLProcessor xmler = new XMLProcessor(); Here the developer using an Xapi-J compliant processor needs to do just one special line of code (tell the XMLProcessor class that it should ask the specified MathMLProcFactory object to create IXMLProcessor's). After that all the implementation specific details of the MathMLProcessor are hidden behind the Xapi-J interfaces i.e. just do a "new XMLProcessor()" and access the document through Xapi-J interfaces. This is possible because even though the class XMLProcessor is the only real class in Xapi-J, it is essentially hollow. A XMLProcesssor instance is not really an XML processor. Xapi-J does not include an XML processor, just the interface to one. All an XMLProcessor does is act as a proxy to an object which implements IXMLProcessor. The IXMLProcessor object is instantiated by the above mentioned factory. So in the source code for the XMLProcessor class we see something like the following code fragments: // Class static factory code: private static IXMLProcessorFactory processorFactory; public static synchronized void setIXMLProcessorFactory( IXMLProcessorFactory factorySettee ) throws XMLException { processorFactory = factorySettee; } // Instance constructor code: private IXMLProcessor implementation; public XMLProcessor () { this.implementation = processorFactory.createIXMLProcessor(); } // instance action code: public IDocument readXML( Object xmlSource ) throws XMLException { return implementation.readXML( xmlSource ); } The execution sequence looks like: 1. The factory is set via XMLProcessor.setIXMLProcessorFactory(). 2. Later, a "new XMLProcessor()" happens. 3. In the constructor the factory is asked to return an IXMLProcessor. 4. The IXMLProcessor object is assigned to the field "implementation". 5. Later, a "readXML()" call happens. 6. In readXML(), the XMLProcessor object, acting as a proxy, passes the request onto its IXMLProcessor and then, 7. The XMLProcessor object returns whatever is returned to it from its IXMLProcessor. I.e. class IXMLProcessor is the real worker. So the phrase "Xapi-J contains no XML processor" could more precisely be stated as: Xapi-J does contain a class XMLProcessor but it does not contain an implementation of the interface IXMLProcessor which is the real worker/processor in the Xapi-J architecture. The above is a convoluted dance but to the developer who is simply using an Xapi-J compliant XML processor it looks really simple on the outside. (For a very similar "design patter" see java.net.Socket et al.) And only one API has to be learned to work with any Xapi-J compliant processor. -- John Tigue Sr. Software Architect DataChannel http://www.datachannel.com jtigue@datachannel.com 206-462-1999 -------------- next part -------------- A non-text attachment was scrubbed... Name: vcard.vcf Type: text/x-vcard Size: 263 bytes Desc: Card for John Tigue Url : http://mailman.ic.ac.uk/pipermail/xml-dev/attachments/19970803/36d826f5/vcard.vcf From Peter at ursus.demon.co.uk Sun Aug 3 15:58:14 1997 From: Peter at ursus.demon.co.uk (Peter Murray-Rust) Date: Mon Jun 7 16:58:12 2004 Subject: Specification Questions Message-ID: <9091@ursus.demon.co.uk> In message <199708020838.JAA11135@andromeda.ndirect.co.uk> "Neil Bradley" writes: [...] >

This is a long paragraph that is broken over two > > lines, with an implied space between 'two' and 'lines'.

> > Is this interpreted as "two lines...", which reduces > to "two lines"? Some additional - hopefully constructive - thoughts on whitespace. The XML-lang spec does not ( and I suspect will not) give detailed guidance on how whitespace will be managed. My impression is that it is up to implementers and/or groups like this to come up with particular solutions. My worry is that these will be inconsistent and not inter-operable. *** Therefore I propose that those on XML-DEV who care about this problem come up with some guidelines for implementers. *** XML does NOT treat whitespace like SGML and does NOT behave like HTML (although it can be configured to do so). As far as I see them, the rules are: 'All characters that are not markup are passed to the application'. (This is independent of any value of XML-SPACE (see below), processing instructions, stylesheets, etc.) These characters include HT, CR, LF, SP, and probably a number of other Unicode 'whitespace' characters. What the application does with them is *undefined* in XML-lang. Note that this means that CR and LF are passed as separate characters. No normalisation takes place. Therefore Line one\n\rline two is different from Line one\nline two even if they are visually similar on various text editors/displays, etc. (My impression was that SGML normalised these two strings to the same ESIS output - is that right?). This means that the author/processor 'contract' has to be aware of this. Note also that *all* line-ends are passed (even immediately before/after markup) unlike SGML. Therefore: line one and line one are different. Note also that: baz is different from baz The latter contains two pseudo-elements which contain only whitespace (line-end characters) and FOO therefore has three children. [Note that to make documents readable, the following trick can be used: baz since whitespace within the tag is ignored. I do not think newcomers will adopt this easily, and I suspect it can lead to errors in document editing.] *** In some cases the document author and the application author are both aware of this problem and so the whitespace characters inserted by the author will be processed in the way that they expect. However, in most cases I suspect this will NOT be true and that authors will inadvertently create documents that are processed differently *** XML provides an attribute XML-SPACE (local to an element BUT inherited by its children) which can have three values: - #IMPLIED (no signals about whitespace handling) - PRESERVE (applications preserve all the whitespace) - DEFAULT (the *application's* default white-space processing modes are acceptable fro this element). PRESERVE seems clear. All whitespace is passed to the application. The others seem to be dangerous unless there are some general conventions. [Note also that XML parsers or processors have to ensure that children inherit the XML-SPACE attributes of their parents. Where does this get done? In the parser? (It's part of XML-lang), in the processor - in which case there is ample scope for inconsistent treatment... Inheritance is already required in two places - XML-SPACE and XML-ATTRIBUTES (XML-link). This is a generic mechanism and presumably should be implemented in some package independenetly of the application. Comments?] If possible, we should propose a *general* default mechanism for whitespace handling for XML-SPACE="DEFAULT". If everyone adopts this, it will greatly reduce this problem. Is this a reasonable strategy? If so, we can propose that the DEFAULT mode for any whitespace processing is something along the lines (similar to HTML?). Within an element with XML-SPACE="DEFAULT" All whitespace sequences are mapped into a single space character. All whitespace pseudo-elements are ignored (i.e. whitespace between markup) All leading and trailing whitespace in #PCDATA is ignored. Does this cover everything? Is it workable? Example: this isa bar folds to: this is a bar [Note that the Xpointer STRING syntax and the use of pseudo-elements works on the *raw* data (i.e. all non-markup characters). Therefore the application has to have access to this - it has to maintain a PRESERVEd version of the document as well as (say) displaying or transforming a DEFAULTed document.] I think it's important to address this, since otherwise I predict we shall have considerable confusion, especially when implementors of authoring or processing software have not thought this through completely. P. -- Peter Murray-Rust, domestic net connection Virtual School of Molecular Sciences http://www.vsms.nottingham.ac.uk/ xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From Peter at ursus.demon.co.uk Sun Aug 3 15:58:25 1997 From: Peter at ursus.demon.co.uk (Peter Murray-Rust) Date: Mon Jun 7 16:58:12 2004 Subject: Xapi-J: an architectural detail (long) Message-ID: <9092@ursus.demon.co.uk> John, Thanks very much for this - including keeping the momentum of this effort. I encourage other memebrs of this list to react to this posting - John has obviously worked very hard at this. In message <33E407E3.37404793@datachannel.com> john@datachannel.com (John Tigue) writes: [...] > > Folks don't really have a problem grasping the XML object model: > > To get a new XML processor object instance: > xml.XMLProcessor xmler = new xml.XMLProcessor(); > > To have a processor read a document: > xml.IDocument aDocument = xmler.readXML( someInfoSource ); > > To get the root of a document: > xml.IElement anElement = aDocument.getRoot(); > > To get an element's attributes: > java.util.Enumeration someAttributes = anElement.getAttributes(); > > That's easy to understand. It is a very simple object model > which can be mapped onto many of the current XML processors without > requiring major rewrites. I follow all this. Can we also go one step further and say how we get the children of an Element. I am assuming also that (say) the DTD is not a child of root in this model - do you have proposals for all this? If so, please post them :-) so we can get it finished - we keep going round and round on this ... > [...] > > (I have tested that Xapi-J can be implemented on top of msxml. I have Excellent! What are your thoughts about NXP and Lark? [...] > > The execution sequence looks like: > 1. The factory is set via XMLProcessor.setIXMLProcessorFactory(). > 2. Later, a "new XMLProcessor()" happens. > 3. In the constructor the factory is asked to return an IXMLProcessor. > 4. The IXMLProcessor object is assigned to the field "implementation". > 5. Later, a "readXML()" call happens. > 6. In readXML(), the XMLProcessor object, acting as a proxy, passes > the request onto its IXMLProcessor and then, > 7. The XMLProcessor object returns whatever is returned to it from its > IXMLProcessor. I.e. class IXMLProcessor is the real worker. > > So the phrase "Xapi-J contains no XML processor" could more precisely be > > stated as: Xapi-J does contain a class XMLProcessor but it does not > contain an implementation of the interface IXMLProcessor which is the > real worker/processor in the Xapi-J architecture. > > The above is a convoluted dance but to the developer who is simply using > > an Xapi-J compliant XML processor it looks really simple on the outside. > > (For a very similar "design patter" see java.net.Socket et al.) And only > one > API has to be learned to work with any Xapi-J compliant processor. I think I have followed John's logic and proposal, and suggest that we take this as a concrete proposal. Since it's only likely to be used by a smallish number of people, its apparent complexity is acceptable. For example, JUMBO is able to use more than one parser, but I have to delve into each one to see how to extract the correct aprts. This would make it easier overall. Assuming we accept this I'd like us also to tackle the question of Nodes, Elements, etc. Until this is done it's difficult to build application software with interchangeable parts. For example, there is a lot of generic stuff (see my posting on whitespace) that an XML application (?processor) has to implement, and hopefully we can isolate and standardise on that. Once again, thanks John. P. -- Peter Murray-Rust, domestic net connection Virtual School of Molecular Sciences http://www.vsms.nottingham.ac.uk/ xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From john at datachannel.com Mon Aug 4 00:11:29 1997 From: john at datachannel.com (John Tigue) Date: Mon Jun 7 16:58:12 2004 Subject: Xapi-J: an architectural detail References: <9092@ursus.demon.co.uk> Message-ID: <33E502A7.80A66A68@datachannel.com> Peter Murray-Rust wrote: > > > To get an element's attributes: > > java.util.Enumeration someAttributes = > anElement.getAttributes(); > > > I follow all this. Can we also go one step further and say how we get > > the children of an Element. To get an element's children: java.util.Enumeration children = anElement.getContents(); This method returns an Enumeration, each object of which implements IContent. The below paragraphs explain IContent et al. An XML document can be represented as a tree. In an XML document object model there are things which are containers (e.g. a document is a container and so is an element) and also things which are the content of a container (e.g. a chunk of text is a content or even a element can be, in the case of one element within another). To model these there are the IContainer and IContent interfaces. The full source follows: public interface IContainer { public Enumeration getContents(); public void insertContent( IContent aContent, IContent preceedingContent ); public void appendContent( IContent aContent ); public void removeContent( IContent aContent ); } public interface IContent { public void setParent( IContainer aContainer ); public IContainer getParent(); public String getData(); } These interfaces only express the methods for navigating a tree. A particular class of objects would need to have some more methods to be interesting. For example, the interface for an element is IElement. The full source follows: public interface IElement extends IContent, IContainer { public String getType(); public void setType( String aType ); public void addAttribute( String name, String value ); public void removeAttribute( String name ); public IAttribute getAttribute( String attributeName ); public java.util.Enumeration getAttributes(); } The above states that an IElement can be a container and/or a content and also has some other methods particular to being an element. So although IElement does not directly have a method called getContents(), it gets the method from its superinterface IContainer. (Note that the Xapi-J method getType() follows the terminology of XML-LANG and as such it implies completely different semantics than com.ms.xml.Element.getType(). Xapi-J's getType() returns a String which is the "Name" from production [33] of the spec. For example, in the following: red The spec clearly says "The Name in the start-and end-tags gives the element's type" so for the above example in Xapi-J getType() would return a String with the value "color" not an int with the value 1 (i.e. MS's ELEMENT constant). Microsoft has chosen an independent model in which most objects in a document are com.ms.xml.Element and the particular flavor of "Element" is determined through the getType() method. In that model all of the following are "Element" types: DOCUMENT, ELEMENT, PCDATA, PI, BETA, COMMENT, and CDATA.). > > > (I have tested that Xapi-J can be implemented on top of msxml. I > have > > Excellent! What are your thoughts about NXP and Lark? > Lark maps very easily to Xapi-J. Xapi-J was designed by taking all the best ideas from the existing processors so the mappings are straight-forward. NXP is pretty much the standard when it comes to ESIS output so it defines that part of Xapi-J making the mapping essentially direct. The only new part is the stuff mentioned in the posting which started this thread: how does a developer instantiate a processor through the Xapi-J interfaces. After that it's the regular old NXP stuff. Note that since Xapi-J is pretty much just a bunch of interfaces, this work can easily be fit into a full grove model. The objects in the grove could implement their grove interfaces and if desirable also implement the earlier Xapi-J interfaces. A full grove model is being work on by others so making Xapi-J a full grove model would be a duplication of effort. The main goal of Xapi-J is simply to make things easier for developers using the current crop of processors. -- John Tigue Sr. Software Architect DataChannel http://www.datachannel.com jtigue@datachannel.com 206-462-1999 -------------- next part -------------- A non-text attachment was scrubbed... Name: vcard.vcf Type: text/x-vcard Size: 263 bytes Desc: Card for John Tigue Url : http://mailman.ic.ac.uk/pipermail/xml-dev/attachments/19970803/be6c700d/vcard.vcf From Peter at ursus.demon.co.uk Mon Aug 4 09:01:09 1997 From: Peter at ursus.demon.co.uk (Peter Murray-Rust) Date: Mon Jun 7 16:58:12 2004 Subject: Xapi-J: an architectural detail Message-ID: <9106@ursus.demon.co.uk> In message <33E502A7.80A66A68@datachannel.com> john@datachannel.com (John Tigue) writes: [...] > Lark maps very easily to Xapi-J. Xapi-J was designed by taking all the > best ideas from the existing processors so the mappings are > straight-forward. NXP is pretty much the standard when it comes to ESIS > output so it defines that part of Xapi-J making the mapping essentially > direct. The only new part is the stuff mentioned in the posting which > started this thread: how does a developer instantiate a processor > through the Xapi-J interfaces. After that it's the regular old NXP > stuff. Sounds good to me. I am particulalry impressed by the fact that you can make it work with the various parsers, even if they take different approaches with different terms. What is your timescale for putting it all together? Are there any places where you need more feedback from the list? FWIW it gets my vote :-) P. -- Peter Murray-Rust, domestic net connection Virtual School of Molecular Sciences http://www.vsms.nottingham.ac.uk/ xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From richard at light.demon.co.uk Mon Aug 4 12:24:12 1997 From: richard at light.demon.co.uk (Richard Light) Date: Mon Jun 7 16:58:12 2004 Subject: Xapi-J: an architectural detail In-Reply-To: <33E502A7.80A66A68@datachannel.com> Message-ID: <1r1HGKAKwa5zEwZY@light.demon.co.uk> In message <33E502A7.80A66A68@datachannel.com>, John Tigue writes > >An XML document can be represented as a tree. In an XML document object >model there are things which are containers (e.g. a document is a >container and so is an element) and also things which are the content of >a container (e.g. a chunk of text is a content or even a element can be, >in the case of one element within another). To model these there are the >IContainer and IContent interfaces. The full source follows: > >public interface IContainer > { > public Enumeration getContents(); > public void insertContent( IContent aContent, IContent >preceedingContent ); > public void appendContent( IContent aContent ); > public void removeContent( IContent aContent ); > } > >public interface IContent > { > public void setParent( IContainer aContainer ); > public IContainer getParent(); > public String getData(); > } These interfaces are mirrored in the SGML/XML Property Set. In that, everything is a 'node', each with its own name and a set of properties. One of those properties is 'subnode' - having a subnode property makes a node, de facto, into a Container in your terminology. The complete XML document can be represented as a 'grove' (tree structure) of these nodes. The parent-child relationship between elements of the XML document is more specific than this. The full grove includes things like the DTD and processing instructions, which are nodes in the grove structure but do not exhibit 'parent-child' relationships to anything else. Nodes have some 'intrinsic properties', which apply whatever their particular type might be. (Again, this mirrors your thinking very closely.) These intrinsic properties are: object Node property ClassNm ; the name of the node's class property GrovRoot ; the root of the grove of which the node forms a part property SunPNs ; the names of all the subnode properties exhibited by the node property AllPNs ; the names of all the properties exhibited by the node property ChildPN ; the name of the children property, when this class of node has children property DataPN : the data property name (i.e. 'char' or 'string'), when this class of node contains data property DSepPN ; the data separator property name property Parent ; the node's parent property TreeRoot ; the root of the parent-children tree [not the same as the 'grove root'] property Origin ; the node that that this node as one of its subnode properties property OTSRelPN ; the origin-to-subnode relationship property name I've given the full set of intrinsic node properties, really just to point out that all of this modeling has already been done before. Much of it is too detailed (and perhaps one level too abstract) to apply to Xapi-J. However, I'm concerned that Xapi-J developers shouldn't just ignore the SGML property set and invent their own version. Expressing the only intrinsic property (parent) that is relevant to this discussion leads to: public interface XMLnode { public XMLnode parent(); } We could add in a couple of extra intrinsic properties, so you can get to the grove root and its origin from any node: public interface XMLnode { public XMLnode parent(); public XMLnode grovroot(); public XMLnode origin(); } I don't think we need separate IContainer and IContent interfaces - what's wrong with just INode (or XMLnode, as I have it)? >These interfaces only express the methods for navigating a tree. A >particular class of objects would need to have some more methods to be >interesting. For example, the interface for an element is IElement. The >full source follows: > >public interface IElement extends IContent, IContainer > { > public String getType(); > public void setType( String aType ); > public void addAttribute( String name, String value ); > public void removeAttribute( String name ); > public IAttribute getAttribute( String attributeName ); > public java.util.Enumeration getAttributes(); > } > >The above states that an IElement can be a container and/or a content >and also has some other methods particular to being an element. So >although IElement does not directly have a method called getContents(), >it gets the method from its superinterface IContainer. We can do the same thing here: public interface XMLelement extends XMLnode { public String gi(); public void setType( String aType ); public void addAttribute( String name, String value ); public void removeAttribute( String name ); public XMLattribute getAttribute( String attributeName ); public XMLattlist atts(); } Notice that I've left the middle four declarations more or less unchanged, for the following reason: There is definitely a useful distinction here, between those things which are _properties_ of a node within an XML document, like the GI of an element or its list of declared attributes, and _operations_ which the API lets you carry out on that node. The SGML/XML property set is entirely about the properties of an existing instance. It provides no framework or precedent for API commands which _alter_ that instance, like SetType (which assigns or changes the GI of an element). There, we are rather more on our own! I'm not sure if the Java API provides for a more elegant way of specifying a property than the one I've dreamt up - if it does, we should use it. Hope this helps. Richard Light SGML and Museum Information Consultancy richard@light.demon.co.uk 3 Midfields Walk Burgess Hill West Sussex RH15 8JA U.K. tel. (44) 1444 232067 xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From john at datachannel.com Mon Aug 4 18:54:11 1997 From: john at datachannel.com (John Tigue) Date: Mon Jun 7 16:58:12 2004 Subject: Xapi-J: an architectural detail References: <1r1HGKAKwa5zEwZY@light.demon.co.uk> Message-ID: <33E609C6.6EEE0CB9@datachannel.com> Richard Light wrote: > In message <33E502A7.80A66A68@datachannel.com>, John Tigue > writes > > > >An XML document can be represented as a tree. In an XML document > object > >model there are things which are containers (e.g. a document is a > >container and so is an element) and also things which are the content > of > >a container (e.g. a chunk of text is a content or even a element can > be, > >in the case of one element within another). To model these there are > the > >IContainer and IContent interfaces. The full source follows: > > > >public interface IContainer > > { > > public Enumeration getContents(); > > public void insertContent( IContent aContent, IContent > >preceedingContent ); > > public void appendContent( IContent aContent ); > > public void removeContent( IContent aContent ); > > } > > > >public interface IContent > > { > > public void setParent( IContainer aContainer ); > > public IContainer getParent(); > > public String getData(); > > } > > These interfaces are mirrored in the SGML/XML Property Set. In that, > everything is a 'node', each with its own name and a set of > properties. > One of those properties is 'subnode' - having a subnode property makes > a > node, de facto, into a Container in your terminology. The complete > XML > document can be represented as a 'grove' (tree structure) of these > nodes. > I agree that grove is the way to go. I'm just trying to get all the current processors on the same track before we move towards the grove work. > The parent-child relationship between elements of the XML document is > more specific than this. The full grove includes things like the DTD > and processing instructions, which are nodes in the grove structure > but > do not exhibit 'parent-child' relationships to anything else. > How will we represent the DTD in order to reflect the effects of the Bray Namespace Proposal? > Nodes have some 'intrinsic properties', which apply whatever their > particular type might be. (Again, this mirrors your thinking very > closely.) These intrinsic properties are: > > object Node > property ClassNm ; the name of the node's class > property GrovRoot ; the root of the grove of which the node forms a > part > property SunPNs ; the names of all the subnode properties > exhibited > by the node > property AllPNs ; the names of all the properties exhibited by the > > node > property ChildPN ; the name of the children property, when this > class > of node has children > property DataPN : the data property name (i.e. 'char' or > 'string'), > when this class of node contains data > property DSepPN ; the data separator property name > property Parent ; the node's parent > property TreeRoot ; the root of the parent-children tree [not the > same > as the 'grove root'] > property Origin ; the node that that this node as one of its > subnode > properties > property OTSRelPN ; the origin-to-subnode relationship property name > > I've given the full set of intrinsic node properties, really just to > point out that all of this modeling has already been done before. > Much > of it is too detailed (and perhaps one level too abstract) to apply to > > Xapi-J. However, I'm concerned that Xapi-J developers shouldn't just > ignore the SGML property set and invent their own version. > Ignoring the SGML property set would be just plain stupid. I like to drive cars not re-invent wheels. > Expressing the only intrinsic property (parent) that is relevant to > this > discussion leads to: > > public interface XMLnode > { > public XMLnode parent(); > } > > We could add in a couple of extra intrinsic properties, so you can get > > to the grove root and its origin from any node: > > public interface XMLnode > { > public XMLnode parent(); > public XMLnode grovroot(); > public XMLnode origin(); > } > I absolutely agree that the Xapi-J interfaces are not done. I have tried to bring the current processors together while mapping out the basics of the object model. We will need to add more properties as you point out. One thing I would like to see is that we return appropriate objects as much as possible. One particular processor out there does a getAttribute() where you pass in a String and get back a String. I think an IAttribute should be returned. This way other convenience methods of the returned class can be used. For example something like isPercent() or isNumeric() for an attribute not to mention all the properties of say a character. > I don't think we need separate IContainer and IContent interfaces - > what's wrong with just INode (or XMLnode, as I have it)? > We could do that. Or maybe both with something like the following: public interface XMLNode extends IContainer, IContent { ... } I went with IContainer and IContent because I can do more precise polymorphic message handling such that the receiving method can make more assumptions about what the passed object can do without casting to the exact class. Casting in Java is a runtime cost (b/ of security) so more expensive. > >These interfaces only express the methods for navigating a tree. A > >particular class of objects would need to have some more methods to > be > >interesting. For example, the interface for an element is IElement. > The > >full source follows: > > > >public interface IElement extends IContent, IContainer > > { > > public String getType(); > > public void setType( String aType ); > > public void addAttribute( String name, String value ); > > public void removeAttribute( String name ); > > public IAttribute getAttribute( String attributeName ); > > public java.util.Enumeration getAttributes(); > > } > > > >The above states that an IElement can be a container and/or a content > > >and also has some other methods particular to being an element. So > >although IElement does not directly have a method called > getContents(), > >it gets the method from its superinterface IContainer. > > We can do the same thing here: > > public interface XMLelement extends XMLnode > { > public String gi(); > public void setType( String aType ); > public void addAttribute( String name, String value ); > public void removeAttribute( String name ); > public XMLattribute getAttribute( String attributeName ); > public XMLattlist atts(); > } > I generally argree. I went for getGI() and setGI() at one point but the spec forced getType() and setType(). Plus I believe that the work we produce here will filter down to folks who are far less preoccupided with XML. For them the term "generic identifier" or even "gi" would be less readily grasped than "type". Either way, by following the get/set naming convention we map to JavaBeans. Slightly more wordy than X() and setX() but the builder tools are geared for recognising getX() and setX(). > Notice that I've left the middle four declarations more or less > unchanged, for the following reason: > > There is definitely a useful distinction here, between those things > which are _properties_ of a node within an XML document, like the GI > of > an element or its list of declared attributes, and _operations_ which > the API lets you carry out on that node. > > The SGML/XML property set is entirely about the properties of an > existing instance. It provides no framework or precedent for API > commands which _alter_ that instance, like SetType (which assigns or > changes the GI of an element). There, we are rather more on our own! > At first setType() might seem less than useful. And perhaps type should be a parameter to the constructor and not modifiable (more on that later). I got caught in a Java specific detail related to the following:Class.forName("SomeClass").newInstance() With this code Java objects can be instantiated from a String of the class' name. That's handy for object serialization amongst other things; for example, say you had a repository of classes for specific element types and you want to instantiate one during a parse. The point is that in Java newInstance() only works with the default constructor; parameters cannot be passed in. So there is need for a seperate method for setting the type of the element. If we wanted to make the type immutable then perhaps we could specify that the member field "type" can only be set once. This type of behavior shows up a lot in the JDK. Inside the property setter the field is checked for null, if not then produce an exception. Also in the JDK we see String and StringBuffer where String is immutable and StringBuffer is where strings can be dynamically built up. Perhaps something like that for Xapi-J > I'm not sure if the Java API provides for a more elegant way of > specifying a property than the one I've dreamt up - if it does, we > should use it. > The only point I'm sure on is the getX() and setX() "design pattern". Most Java devs casually consuming XML will use a JavaBean and we should plan for that architecture. > Hope this helps. > Deffinately. Thanks. > Richard Light > SGML and Museum Information Consultancy > richard@light.demon.co.uk > 3 Midfields Walk > Burgess Hill > West Sussex RH15 8JA > U.K. > tel. (44) 1444 232067 > > xml-dev: A list for W3C XML Developers > Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ > To unsubscribe, send to majordomo@ic.ac.uk the following message; > unsubscribe xml-dev > List coordinator, Henry Rzepa (rzepa@ic.ac.uk) -- John Tigue Sr. Software Architect DataChannel http://www.datachannel.com jtigue@datachannel.com 206-462-1999 -------------- next part -------------- A non-text attachment was scrubbed... Name: vcard.vcf Type: text/x-vcard Size: 263 bytes Desc: Card for John Tigue Url : http://mailman.ic.ac.uk/pipermail/xml-dev/attachments/19970804/bfb1fb6e/vcard.vcf From andrewl at microsoft.com Tue Aug 5 00:03:26 1997 From: andrewl at microsoft.com (Andrew Layman) Date: Mon Jun 7 16:58:13 2004 Subject: Process: Subjects of Messages (was "A question and a proposal") Message-ID: <7BB61B44F197D011892800805FD4F7920133B7A8@RED-03-MSG.dns.microsoft.com> It would be helpful if authors would give their messages titles that are meaningful descriptions of the substantive contents of the message. Something like "A question and a proposal" is cute, but useless for filing. --Andrew Layman AndrewL@microsoft.com xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From akirkpatrick at ims-global.com Tue Aug 5 10:14:39 1997 From: akirkpatrick at ims-global.com (akirkpatrick@ims-global.com) Date: Mon Jun 7 16:58:13 2004 Subject: Xapi-J: an architectural detail Message-ID: I really like the combination of IContent and IContainer. The only question I have is how an element can query its context in an efficient way? For example, how can I find the previous element without referring to the parent container. Presumably then the parent would have to enumerate all its children to find the previous content to the element in question. Obviously a particular application can record the previous element in a variable but then you get to more complex contexts, like "what is the previous of my parent". Any thoughts? Alfie. xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From neil at bradley.co.uk Tue Aug 5 11:49:56 1997 From: neil at bradley.co.uk (Neil Bradley) Date: Mon Jun 7 16:58:13 2004 Subject: Specification Questions Message-ID: <199708050949.KAA07792@andromeda.ndirect.co.uk> Reply-to: Peter@ursus.demon.co.uk (Peter Murray-Rust) > Some additional - hopefully constructive - thoughts on whitespace. > > The XML-lang spec does not ( and I suspect will not) give detailed guidance > on how whitespace will be managed. My impression is that it is up to > implementers and/or groups like this to come up with particular solutions. > My worry is that these will be inconsistent and not inter-operable. I agree totally. This was my original concern. > *** > Therefore I propose that those on XML-DEV who care about this problem come > up with some guidelines for implementers. > *** I very much hope this happens. > XML does NOT treat whitespace like SGML and does NOT behave like HTML > (although it can be configured to do so). As far as I see them, the rules > are: > > 'All characters that are not markup are passed to the application'. (This > is independent of any value of XML-SPACE (see below), processing instructions, > stylesheets, etc.) These characters include HT, CR, LF, SP, and probably > a number of other Unicode 'whitespace' characters. What the application > does with them is *undefined* in XML-lang. > > Note that this means that CR and LF are passed as separate characters. No > normalisation takes place. Therefore > > Line one\n\rline two > > is different from > > Line one\nline two > > even if they are visually similar on various text editors/displays, etc. > (My impression was that SGML normalised these two strings to the same > ESIS output - is that right?). > > This means that the author/processor 'contract' has to be aware of this. I think all applications should be expected to either or both characters in sequence as a line end signal, so that platform dependancies can be eliminated. If there is no good reason to omit this taks from the XML-processor itself, I think it should be done there. > *** In some cases the document author and the application author are both > aware of this problem and so the whitespace characters inserted by the > author will be processed in the way that they expect. However, in most cases > I suspect this will NOT be true and that authors will inadvertently create > documents that are processed differently *** > > XML provides an attribute XML-SPACE (local to an element BUT inherited by > its children) which can have three values: > - #IMPLIED (no signals about whitespace handling) > - PRESERVE (applications preserve all the whitespace) > - DEFAULT (the *application's* default white-space processing modes > are acceptable fro this element). > > PRESERVE seems clear. All whitespace is passed to the application. The > others seem to be dangerous unless there are some general conventions. > If possible, we should propose a *general* default mechanism for whitespace > handling for XML-SPACE="DEFAULT". If everyone adopts this, it will greatly > reduce this problem. Is this a reasonable strategy? I believe so. In addition, can we not put 'XML-SPACE (PRESERVE|IMPLIED) "PRESERVE" in an attribute declaration for an element which will always have reserved content. It is common practice for a DTD to have some kind of pre-formatted element, such as HTML's '
'.


> If so, we can propose that the DEFAULT mode for any whitespace processing is
> something along the lines (similar to HTML?).  Within an element with
> XML-SPACE="DEFAULT"
> 

> All whitespace sequences are mapped into a single space character.
Agreed.

> All whitespace pseudo-elements are ignored (i.e. whitespace between markup)

Ummm. what about 'the bold  italic styles...'?

> All leading and trailing whitespace in #PCDATA is ignored.

I think all applications should remove leading and trailing CR and LF
characters in a mixed content element. But not SP or HT, as this would
be undesirable in the following fragment:

A  bold  word.

Although an unusual layout, some people may use it, and it would be
unfortunate if it resulted in 'Aboldword'.


> Example:
> 
>  this
> 
> isa 
DID YOU INTEND A SPACE SOMEWHERE BETWEEN 'is' AND 'a'?
> bar
> 
> 
> folds to:
> this is a bar
> 
> I think it's important to address this, since otherwise I predict we shall
> have considerable confusion, especially when implementors of authoring or
> processing software have not thought this through completely.

Again, I agree, and I think it will be possible to achieve this with 
a bit more discussion in this forum.

> Peter Murray-Rust, domestic net connection

Neil.

-----------------------------------------------
Neil Bradley - Author of The Concise SGML Companion.
neil@bradley.co.uk
www.bradley.co.uk

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo@ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa@ic.ac.uk)


From ricko at allette.com.au  Tue Aug  5 15:12:30 1997
From: ricko at allette.com.au (Rick Jelliffe)
Date: Mon Jun  7 16:58:13 2004
Subject: XML and whitespace: lets just dump CR and LF!
Message-ID: <199708051317.XAA23619@jawa.chilli.net.au>

> From: Neil Bradley 
> Reply-to:      Peter@ursus.demon.co.uk (Peter Murray-Rust)
 
> > Therefore I propose that those on XML-DEV who care about this problem come
> > up with some guidelines for implementers. 
 
> I very much hope this happens.
  
> > This means that the author/processor 'contract' has to be aware of this.
 

Can I suggest a very different tack?

The problem with CR/LF is one of overloading not of translation or contracts. 
They have too many meanings.  In particular they function both as 
record-start/-end characters and as new-lines.

I suggest that the following approach should be taken. (I think it is the only
realistic solution, especially if we assume that 1) data is usually generated by applications, 2) humans only check and tweak data;
3) we want operating system 
and character set independence, 4) line-breaking is generally done by clients
...so CR/LF is basically a convenience for fitting data into editors, 
not for the purposes of output.)

**A) XML applications should ignore *ALL* CR and LF as a bad joke.  They should
be entirely there for formatting the raw text into nice, eye-sized records.
So CR and LF should never be converted to spaces. (This approach was the
one taken by Interleaf, and I have come to appreciate it.) If you need a 
space, then start the new line with it!  (Ending the previous line is difficult
to see.)

**B) XML applications should mandate the use of the unambiguous Unicode characters
	-- LINE SEPARATOR  

	-- PARAGRAPH SEPARATOR 


So if I want to do the equivalent of HTML 
  X
XML can have:
  X


or even

 
 
X


&x2028;

And it can do this with the text conventions of any operating system.

I certainly think that CR/LF should be not of interest to XML-lang. And I think
they should be of marginal interest to XML applications too. Lets dump them!


Rick Jelliffe 

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo@ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa@ic.ac.uk)


From tbray at textuality.com  Tue Aug  5 16:10:39 1997
From: tbray at textuality.com (Tim Bray)
Date: Mon Jun  7 16:58:13 2004
Subject: XML and whitespace: lets just dump CR and LF!
Message-ID: <3.0.32.19970805070023.0081f720@pop.intergate.bc.ca>

At 11:13 PM 05/08/97 +1000, Rick Jelliffe wrote:
>**A) XML applications should ignore *ALL* CR and LF as a bad joke......
>
>I certainly think that CR/LF should be not of interest to XML-lang...
>Lets dump them!

Heh-heh.  If you go look in the proceedings of the 1988 Usenix conference,
you'll find a paper I wrote, on the Oxford English Dictionary project,
which has a section entitled

 '\n' Considered Harmful

I'd love to lose the record-end silliness.  Trouble is, we're stuck with
it until we have better editing tools. -T.

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo@ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa@ic.ac.uk)


From clloyd at gorge.net  Tue Aug  5 16:49:14 1997
From: clloyd at gorge.net (Chris Lloyd)
Date: Mon Jun  7 16:58:13 2004
Subject: Xapi-J: an architectural detail
In-Reply-To: 
Message-ID: <3.0.1.32.19970805074603.006bcb18@gorge.net>

akirkpatrick wrote:
>I really like the combination of IContent and IContainer.
>The only question I have is how an element can query
>its context in an efficient way? For example, how can
>I find the previous element without referring to the parent
>container. Presumably then the parent would have to
>enumerate all its children to find the previous content
>to the element in question. Obviously a particular
>application can record the previous element in a variable
>but then you get to more complex contexts, like "what
>is the previous of my parent".
>
This is where the next step is needed. Tree Iterators can provide efficient
and well abstracted mechanisms for walking the XML tree. Everyone is still
stuck on the schema part of Xpia-j and that is fine. After that is done
then it's time to add classes specifically for navigation.

Keep the schema simple. Don't add members for the previous child, etc.. It
is unnecessary and complex to maintain.

Over the past 2 years, we have been developing an object database system
for SGML. We have gone through the same thought processes as are going on
with xapi-j right now. I think there are a few design considerations to
keep in mind if you want to use iterator classes with the xapi-j schema and
I think eventually you will.

The idea of inheriting from IContainer is a good one. Polymorphism is very
useful when it comes time to write navigation classes. A base class for all
objects in the tree is very important!! We'll call this INode.

It then becomes useful to break the type of nodes into 2 classes.
IContainer and IProperty. An IProperty is always a leaf node of the tree
and an IContainer is not. After that you add your concrete classes such as
IElement.

John Tigue wrote:

>These interfaces only express the methods for navigating a tree. A
>particular class of objects would need to have some more methods to be
>interesting. For example, the interface for an element is IElement. The
>full source follows:


>public interface IElement extends IContent, IContainer
>    {
>     public String getType();
>     public void setType( String aType );
>     public void addAttribute( String name, String value );
>     public void removeAttribute( String name );
>     public IAttribute getAttribute( String attributeName );
>     public java.util.Enumeration getAttributes();
>     }

In the above example the returned interface IAttribute would inherit from
IProperty because it is a leaf node.

A Tree Iterator would already now the structure of an element when it walks
over it an would know how to retrieve the attributes. When it walks on to
an attribute, it knows it's a leaf node because it inherits from IProperty.

Again I stress that every XML object in the tree should inherit from a
single base class even if the base class does not provide any common
interfaces to it's concrete classes. In this way, any XML object can be
passed via a base class reference(Whoops, I almost said pointer). It is
trivia to implement a fast, safe-casting mechanism that uses polymorphism
for casting.

This way, we can later add navigation classes that leverage the polymorphic
nature of the XML tree.

Chris Lloyd
POET Software








>Any thoughts?
>Alfie.
>
>xml-dev: A list for W3C XML Developers
>Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
>To unsubscribe, send to majordomo@ic.ac.uk the following message;
>unsubscribe xml-dev
>List coordinator, Henry Rzepa (rzepa@ic.ac.uk)
>
>

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo@ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa@ic.ac.uk)


From andrewl at microsoft.com  Tue Aug  5 18:55:19 1997
From: andrewl at microsoft.com (Andrew Layman)
Date: Mon Jun  7 16:58:13 2004
Subject: Linking and Query question
Message-ID: <7BB61B44F197D011892800805FD4F7920133B7BF@RED-03-MSG.dns.microsoft.com>

Can I create a link that, in effect, contains a query so that it
references one document among a set? For example, if I know that several
versions of a document exist, and I want to reference the latest
version, but I'm willing to accept either of the two prior versions, can
I express that?  If so, how?  Thanks.

--Andrew Layman
   AndrewL@microsoft.com


xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo@ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa@ic.ac.uk)


From tbray at textuality.com  Tue Aug  5 19:10:53 1997
From: tbray at textuality.com (Tim Bray)
Date: Mon Jun  7 16:58:13 2004
Subject: Linking and Query question
Message-ID: <3.0.32.19970805100443.008bd6e0@pop.intergate.bc.ca>

At 09:54 AM 05/08/97 -0700, Andrew Layman wrote:
>Can I create a link that, in effect, contains a query so that it
>references one document among a set? For example, if I know that several
>versions of a document exist, and I want to reference the latest
>version, but I'm willing to accept either of the two prior versions, can
>I express that?  If so, how?  Thanks.

XML-link has no versioning machinery built in... this would be in the
territory of the WebDAV work, if anywhere.  I think (but am not sure)
that there is some machinery for this in the URN work.  Note that versioning
in the general case is a horribly complex problem and tends to have all
sorts of application-specific requirements, so I wouldn't bet too much
in finding a good general solution. -T.

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo@ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa@ic.ac.uk)


From john at datachannel.com  Tue Aug  5 19:11:44 1997
From: john at datachannel.com (John Tigue)
Date: Mon Jun  7 16:58:13 2004
Subject: Xapi-J: an architectural detail
References: <3.0.1.32.19970805074603.006bcb18@gorge.net>
Message-ID: <33E75F5F.CC7EC38E@datachannel.com>

Chris Lloyd wrote:

> akirkpatrick wrote:
> >I really like the combination of IContent and IContainer.
> >The only question I have is how an element can query
> >its context in an efficient way? For example, how can
> >I find the previous element without referring to the parent
> >container. Presumably then the parent would have to
> >enumerate all its children to find the previous content
> >to the element in question. Obviously a particular
> >application can record the previous element in a variable
> >but then you get to more complex contexts, like "what
> >is the previous of my parent".
> >
> This is where the next step is needed. Tree Iterators can provide
> efficient
> and well abstracted mechanisms for walking the XML tree. Everyone is
> still
> stuck on the schema part of Xpia-j and that is fine. After that is
> done
> then it's time to add classes specifically for navigation.

> Keep the schema simple. Don't add members for the previous child,
> etc.. It
> is unnecessary and complex to maintain.

I agree. I think we should follow the Visitor design pattern. Quoting
from Gamma's _Design_Patterns_: "Intent: Represent an operation to be
performed on the elements of an object structure. Visitor lets you
define a new operation without changing the classes of the elements on
which it operates." Here the operation is tree interation.

>
>
> Over the past 2 years, we have been developing an object database
> system
> for SGML. We have gone through the same thought processes as are going
> on
> with xapi-j right now. I think there are a few design considerations
> to
> keep in mind if you want to use iterator classes with the xapi-j
> schema and
> I think eventually you will.
>
> The idea of inheriting from IContainer is a good one. Polymorphism is
> very
> useful when it comes time to write navigation classes. A base class
> for all
> objects in the tree is very important!! We'll call this INode.
>

public interface INode    {
    // What do we put in here?
    }

> 
> Again I stress that every XML object in the tree should inherit from a
>
> single base class even if the base class does not provide any common
> interfaces to it's concrete classes. In this way, any XML object can
> be
> passed via a base class reference(Whoops, I almost said pointer). It
> is
> trivia to implement a fast, safe-casting mechanism that uses
> polymorphism
> for casting.
>

So the interfaces in Xapi-J would extend INode like this?

public interface INode {...}

public interface IContainer extends INode{...}

public interface IElement extends IContainer {...}

This way an IElement is also an INode so passing via base interface can
be done for any object in the model. We're still dealing purely with
interfaces so vendors are still free to implement their own base
classes. This also could be mapped to CORBA, DCOM, and others.

> This way, we can later add navigation classes that leverage the
> polymorphic
> nature of the XML tree.
> 

--
John Tigue
Sr. Software Architect
DataChannel
http://www.datachannel.com
jtigue@datachannel.com
206-462-1999

-------------- next part --------------
A non-text attachment was scrubbed...
Name: vcard.vcf
Type: text/x-vcard
Size: 263 bytes
Desc: Card for John Tigue
Url : http://mailman.ic.ac.uk/pipermail/xml-dev/attachments/19970805/0d9759e5/vcard.vcf
From eliot at isogen.com  Tue Aug  5 20:10:29 1997
From: eliot at isogen.com (W. Eliot Kimber)
Date: Mon Jun  7 16:58:13 2004
Subject: Linking and Query question
Message-ID: <3.0.32.19970805130657.00b1692c@swbell.net>

At 09:54 AM 8/5/97 -0700, Andrew Layman wrote:
>Can I create a link that, in effect, contains a query so that it
>references one document among a set? For example, if I know that several
>versions of a document exist, and I want to reference the latest
>version, but I'm willing to accept either of the two prior versions, can
>I express that?  If so, how?  Thanks.

If the reference to a document is via an entity reference, the query can be
part of the system ID for the document.  As system IDs in XML are always
URLs, if you have a way of expressing the query in an URL, you can do it
that way.  If not, then the short answer is "no" (unless there's some
aspect of URLs or TEI extended pointers I've overlooked, which it quite
possible).

In a general SGML system, there are three basic approaches:

1. Define your own application-specific addressing syntax and semantics and
use it, hoping tools will support it or providing your own support (because
the scope of use is totally within your control).

2. Use Formal System Identifiers and make the query part of an entity's
system ID.

3. Use query addressing and make the query part of a direct or indirect
address  (that does not use a declared entity).

The only difference between these three approaches is that two and three
are done within the framework of standardized definitional mechanisms
defined by ISO/IEC 10744:1997 while one is not.  In all three cases you
still have to implement support for the query and provide the necessary
integration with the tools you're using (browser to repository, editor to
repository, etc.).

The Formal System Identifier Definition Requirements (FSIDR) facility of
ISO/IEC 10744:1997 (Annex A.6, reviewable at
http://www.drmacro.com/hythtml/clause-A.6.html) provides a syntax for
associating repository-specific attributes with system IDs.

For example, say you have a repository with a "version" property for
storage objects.  You can refer to this property by declaring the
repository as a "storage manager" and providing an attribute (or
attributes) for specifying the version you want, something like this:








][=]?)?[0-9]+(\.[0-9]+)?"
              Prefixes for version number:
              <    Anything less than specified version
              >    Anything greater than specified version
              <=   Anything less than or equal to specified version
              >=   Anything greather than or equal to specified version 
              If no prefix specified, only specified version is used.
           --
     CDATA #IMPLIED  -- Default: latest version --
>

Obviously, these declarations can be provided by the storage manager
provider and used by reference from documents--you wouldn't expect authors
to type these things themselves (or even necessarily be aware of their
presence or use).

You then invoke the storage manager by treating the notation name as an
element type name within the system ID:

1.2'>mydoc.xml" CDATA SGML >

As the semantics of the tags within a system ID are well defined by the
FSIDR, it is probably reasonable for XML systems to treat the tag name as a
repository notation name even when the formal declarations are not present.
 If the storage manager name is well understood (e.g., "URL"), there's no
problem.  It's probably also reasonable to assume that storage manager
names are generally unique and therefore processing can be associated with
the names directly (rather than by requiring a notation declaration with a
public ID).  This is analogous to being able to map entities by entity name
within an SGML Open catalog.

A processor would provide a way to associate the storage manager notation
MyDocManager with that storage manager's API (i.e., the integrator of the
storage manager would register a DLL or DLL entry point with the notation's
public identifier).  The processor would then pass the value of the version
attribute and the data following the MyDocManager start tag to the API.

If you're not addressing the document as an entity but using some other
query, I don't think XML Link provides a way to do this (because it doesn't
generalize the notion of addressing by query).

The HyTime architecture does generalize addressing by query such that you
can declare a query notation with whatever semantics you want and then use
that query.  The only requirement is that the result of the query be a list
of nodes in groves.  In DOM terms this would mean you get back objects
conforming to the DOM model, rather than the unparsed data of the document
addressed. (All addressing is in terms of the results of parsing, not the
unparsed source.)

For example, to create and use such a query, you could do something like this:




<-- Now declare an element type that uses this query notation for 
    addressing: -->




...

See My document... A HyTime aware processor interprets the above as follows: 1. Sees that Doclink is a hyperlink. Looks for the required (by HyTime) "anchrole" attribute, from which it will determine the names of the attributes used to address the anchors (they are the same as the anchor role names). 2. Sees that "refmark" is a self anchor, so no addressing attribute is needed for it. Sees that second role is "document". Looks for attribute named "document". 3. Finds attribute named "document". Looks for attribute named "loctype" (location type) to see if a location type has been associated with this attribute (without location type, the HyTime engine has no way of knowing what form of addressing is being used [unless the attribute is declared as IDREF(s) or ENTITY/ENTITIES]). 4. Finds a loctype attribute and sees that the document attribute is a query location that uses the notation named "MyDocQuery" 5. Looks to see if a notation named MyDocQuery has been declared. It has. 6. Passes the value of the document attribute to the MyDocQuery API (again, registered using whatever integration API the browser provides). The processor (my document manager in this case), interprets the query and provides a response. 7. Waits until it gets a response, which had better be a list of objects in an object model it understands (e.g, grove nodes, DOM objects, etc.). 8. Assuming it gets a response, enables traversal to the returned objects. XML Link removes the need for the above general processing by providing a fixed set of query notations that XML Link recognizes (URLs and TEI extended pointers). However, this limits your ability to do things these two query notations don't provide for. Note also that the XML Link specification can be defined in terms of the HyTime generalizations such that any general-purpose HyTime engine can process XML Link documents (and you would expect HyTime engines to have built-in support for XML Link so that there would be no additional integration required to process XML Link documents). The HyTime mechanism has no "magic"--it just provides a framework within which the integration you'd have to do in any case can be done. It simply provides a way to name things (queries, storage managers) with universally-unique names (public IDs) and associate these universal names with local names (notation names). This framework standardizes the formal declaration of what you're doing and (hopefully) makes the integration mechanism consistent across tools, which shoudl make integration easier. It doesn't remove the need for tools to be plugged together by humans (either directly or through the definition of API standards like the DOM or CORBA or ODBC). Cheers, Eliot xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From andrewl at microsoft.com Tue Aug 5 21:12:28 1997 From: andrewl at microsoft.com (Andrew Layman) Date: Mon Jun 7 16:58:13 2004 Subject: Linking and Query question Message-ID: <7BB61B44F197D011892800805FD4F7920133B7D4@RED-03-MSG.dns.microsoft.com> Thanks. I like the power and exactness of the example you showed using FSID. Now I need to find a way to integrate that with a URI scheme. --Andrew Layman AndrewL@microsoft.com xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From eliot at isogen.com Tue Aug 5 21:35:39 1997 From: eliot at isogen.com (W. Eliot Kimber) Date: Mon Jun 7 16:58:13 2004 Subject: Linking and Query question Message-ID: <3.0.32.19970805142954.00b09c38@swbell.net> At 12:11 PM 8/5/97 -0700, Andrew Layman wrote: >Thanks. I like the power and exactness of the example you showed using >FSID. Now I need to find a way to integrate that with a URI scheme. Cool. Let me know if I can be of assistance in any way. Cheers, Eliot xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From ricko at allette.com.au Tue Aug 5 22:39:34 1997 From: ricko at allette.com.au (Rick Jelliffe) Date: Mon Jun 7 16:58:13 2004 Subject: XML and whitespace: lets just dump CR and LF! Message-ID: <199708052044.GAA01690@jawa.chilli.net.au> > From: Tim Bray > Heh-heh. If you go look in the proceedings of the 1988 Usenix conference, > you'll find a paper I wrote, on the Oxford English Dictionary project, > which has a section entitled > > '\n' Considered Harmful > > I'd love to lose the record-end silliness. Trouble is, we're stuck with > it until we have better editing tools. -T. I'm not saying to ban the characters, merely to say give them no significance for an application. So we can still use our existing editing tools. For example, using vi or sed to add the unambiguous newline to an existing file, which will be stuck in an HTML-like

, it is merely a rule like 
   1,$s/$/\
/
which is trivial.  

We can do this only because we are using ISO 10646 as the document character
set: since we have the chance to clear up the mess with a simple convention,
why not take it!


Rick Jelliffe

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo@ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa@ic.ac.uk)


From clloyd at gorge.net  Wed Aug  6 00:07:44 1997
From: clloyd at gorge.net (Chris Lloyd)
Date: Mon Jun  7 16:58:13 2004
Subject: Xapi-J: an architectural detail
Message-ID: <3.0.1.32.19970805150432.006ba560@gorge.net>

>John Tigue wrote:
>>
>>I agree. I think we should follow the Visitor design pattern. Quoting
>>from Gamma's _Design_Patterns_: "Intent: Represent an operation to be
>>performed on the elements of an object structure. Visitor lets you
>>define a new operation without changing the classes of the elements on
>>which it operates." Here the operation is tree interation.
>

Yes, Our whole system is based on "Design Patterns". We use the visitor
pattern for formatting output and for operations where we walk a subtree
from stem to stern. They are useful and easy to implement. Visitors are
good for moving an operation outside a class. They are not so good for
defining and extending complex walking tasks.

We find it necessary to have iterators for complex walking tasks. You can
let an iterator drive a visitor as well. We are doing very complex queries
right now using iterators, algorithmns, functions, and operators. The last
three patterns are borrowed from STL. The problem with the last three
patterns for Java is that they are template driven. A very complex tree
walking query/algorithmn can be formulated in a single line of C++. I'm
sure they could be adapted to Java.

We are dealing with tree versioning as well, so we have trees within trees.
You just can't expect to walk this stuff without some well abstracted
navigation patterns.

>I wrote:
>> The idea of inheriting from IContainer is a good one. Polymorphism is
>> very
>> useful when it comes time to write navigation classes. A base class
>> for all
>> objects in the tree is very important!! We'll call this INode.
>>

>
>public interface INode    {
>    // What do we put in here?
>    }
>

It is not necessary for anything to be in the base class. It is just
necessary for there to be one.

For example: An Iterator has a very simple interface

public interface ITreeIterator
{
	ITreeIterator(INode, ICursor iCursor, INodeIterFactory iFactory);
	bool next(); // walk to the next node
	INode current(); // return the current node
	long GetIterLevel(); // return how many tags deep we are from starting
position
}

We construct the iterator with a current node which can be the document
root or any object in the tree. We use a cursor which externalizes the
walking algorithmn(forward, backward, follow links) and we use a Factory
which provides the algorithmns for walking each object type in the tree.

This iterator will return different objects in the tree or maybe even walk
the tree differently depending on the factory and cursor that it is
constructed with. 

This iterator will not work without a common base class because the
iterator knows NOTHING about the types of objects in the tree. It only
knows what an INode is.

>So the interfaces in Xapi-J would extend INode like this?
>
>public interface INode {...}
>
>public interface IContainer extends INode{...}
>
>public interface IElement extends IContainer {...}
>
>This way an IElement is also an INode so passing via base interface can
>be done for any object in the model. We're still dealing purely with
>interfaces so vendors are still free to implement their own base
>classes. This also could be mapped to CORBA, DCOM, and others.

YES!

Chris Lloyd
POET Software

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo@ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa@ic.ac.uk)


From h.rzepa at ic.ac.uk  Wed Aug  6 09:46:07 1997
From: h.rzepa at ic.ac.uk (Rzepa, Henry)
Date: Mon Jun  7 16:58:13 2004
Subject: XML-DEV Digest
Message-ID: 

 A number of people have asked for a digest of this list. I forwarded this
request to our postmaster. As soon as it is actioned, I will let this list know
the details.

Dr Henry Rzepa,  Dept. Chemistry,  Imperial College,  LONDON SW7 2AY;
mailto:rzepa@ic.ac.uk; Tel  (44) 171 594 5774; Fax: (44) 171 594 5804.
URL: http://www.ch.ic.ac.uk/rzepa/ 



xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo@ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa@ic.ac.uk)


From Peter at ursus.demon.co.uk  Wed Aug  6 17:19:29 1997
From: Peter at ursus.demon.co.uk (Peter Murray-Rust)
Date: Mon Jun  7 16:58:13 2004
Subject: Xapi-J: an architectural detail
Message-ID: <9160@ursus.demon.co.uk>

In message <33E75F5F.CC7EC38E@datachannel.com> john@datachannel.com (John Tigue) writes:

Firstly many thanks to John for driving this forward and for the positive replies
from several others. let's make sure that we get closure on this fairly shortly
as we don't want to fall back to where we were 4-5 months ago with a lot of
enthusiasm and no final outcome.

I know we are all doing this on a voluntary basis, but if we get it right this 
time we save a lot of problems later.  I have put my JUMBO development on hold
because I really want to get it on top of a decent architecture.  We need to
know precisely what an Element, Node, etc. are :-)

I get the impression from John and others that it is possible to create an
API which does not necessarily suport the property set today, but is capable
of doing it in the future without rewriting.  If so, then perhaps John and
others could suggest where they intend to freeze the current API at. If we don't
set some limits now, there is the danger that we try to be too ambitious.

As soon as an API is established, a benefit will be that we can start to think
about what other features of XML processing need to be covered in a generic 
manner.  I found that quite of lot of JUMBO implementation was generic 
(e.g. checking semantic validity, 'inheritance/default' implementation, etc.
which is not trivial and should be isolated as far as possible from applications.



> 
> Chris Lloyd wrote:
[...]
> > This is where the next step is needed. Tree Iterators can provide
> > efficient
> > and well abstracted mechanisms for walking the XML tree. Everyone is
> > still
> > stuck on the schema part of Xpia-j and that is fine. After that is
> > done
> > then it's time to add classes specifically for navigation.
> 
> > Keep the schema simple. Don't add members for the previous child,
> > etc.. It
> > is unnecessary and complex to maintain.

I agree with this - it should be possible to add these in at a later stage
(e.g. by subclassing in Java).  I have a lot of stuff in JUMBO that implements
treewalking (e.g. TEI Xptrs) and even tree-editing, and my Tree/Node classes 
can have up to 100 methods each. We want to avoid this at this stage :-)

> 
> I agree. I think we should follow the Visitor design pattern. Quoting
> from Gamma's _Design_Patterns_: "Intent: Represent an operation to be
> performed on the elements of an object structure. Visitor lets you
> define a new operation without changing the classes of the elements on
> which it operates." Here the operation is tree interation.
> 
> >
> >
> > Over the past 2 years, we have been developing an object database
> > system
> > for SGML. We have gone through the same thought processes as are going
> > on
> > with xapi-j right now. I think there are a few design considerations
> > to
> > keep in mind if you want to use iterator classes with the xapi-j
> > schema and
> > I think eventually you will.
> >
> > The idea of inheriting from IContainer is a good one. Polymorphism is
> > very
> > useful when it comes time to write navigation classes. A base class
> > for all
> > objects in the tree is very important!! We'll call this INode.
> >

I tend to support this. It makes general management such as editing and display
easier, even if the Node objects are not of the same class.


	P.

-- 
Peter Murray-Rust, domestic net connection
Virtual School of Molecular Sciences
http://www.vsms.nottingham.ac.uk/

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo@ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa@ic.ac.uk)


From Peter at ursus.demon.co.uk  Wed Aug  6 17:19:57 1997
From: Peter at ursus.demon.co.uk (Peter Murray-Rust)
Date: Mon Jun  7 16:58:14 2004
Subject: Specification Questions
Message-ID: <9162@ursus.demon.co.uk>

In message <199708050949.KAA07792@andromeda.ndirect.co.uk> "Neil Bradley" writes:
> 
> 
> Reply-to:      Peter@ursus.demon.co.uk (Peter Murray-Rust)
> 
> > Some additional - hopefully constructive - thoughts on whitespace.
> > 
> > The XML-lang spec does not ( and I suspect will not) give detailed guidance
> > on how whitespace will be managed.  My impression is that it is up to 
> > implementers and/or groups like this to come up with particular solutions.
> > My worry is that these will be inconsistent and not inter-operable.
> 
> I agree totally. This was my original concern.
> 
> > ***
> > Therefore I propose that those on XML-DEV who care about this problem come
> > up with some guidelines for implementers. 
> > ***
> 
> I very much hope this happens.
> 
[...]
> 
> I think all applications should be expected to either or both 
> characters in sequence as a line end signal, so that platform 
> dependancies can be eliminated. If there is no good reason to omit 
> this taks from the XML-processor itself, I think it should be done 
> there.
> 
> 
[...]
> 
> I believe so. In addition, can we not put 'XML-SPACE 
> (PRESERVE|IMPLIED) "PRESERVE" in an attribute declaration for an 
            ^^^^^^^
I think you meant DEFAULT - #IMPLIED is when no value is given.

> element which will always have reserved content. It is common 
> practice for a DTD to have some kind of pre-formatted element, such 
> as HTML's '
'.
> 
> 
> > If so, we can propose that the DEFAULT mode for any whitespace processing is
> > something along the lines (similar to HTML?).  Within an element with
> > XML-SPACE="DEFAULT"
> > 
> 
> > All whitespace sequences are mapped into a single space character.
> Agreed.
> 
> > All whitespace pseudo-elements are ignored (i.e. whitespace between markup)
> 
> Ummm. what about 'the bold  italic styles...'?
> 
> > All leading and trailing whitespace in #PCDATA is ignored.
> 
> I think all applications should remove leading and trailing CR and LF
> characters in a mixed content element. But not SP or HT, as this would
> be undesirable in the following fragment:
> 
> A  bold  word.
> 
> Although an unusual layout, some people may use it, and it would be
> unfortunate if it resulted in 'Aboldword'.
> 
OK - I had overlooked this.

Taking account of other posts on this subject here and elsewhere, there seems to
be a positive view that a set of Guidelines/Best Practice/Gerally Agreed 
Conventions should be developed, and that XML-DEV is probably the right place.

It's also clear that the more of this that can be done before the XMLProcessor
output gets to the *specific* application - e.g. a browser or transformer - the
better.  We seem to be looking at a filter or layer immediately after/on_top_of
the XMLProcessor.  At the ESIS stream level we could have:

Document ->[Parser] -> ESIS -> [XMLWhitespace] -> NewESIS -> [Application]

and at the API level something that either sits on top of the EventStream or
the  final TreeFactory (or whatever it's called).

(There is a difficulty in filtering any document, in that XPtrs in XML-LINK
would appear to have to operate on the unfiltered document (although this is
not specifically stated, it's implied).  So it might have to be that the 
stream or tree contained 'significant' and 'non-significant' whitespace, and 
that the application would have to be able to recognise the flag.  All Xptr
activity has to take place on *all* whitespace (although I don't think this
is pretty).

The current switch PRESERVE is clear (everything goes through).  It would go
against the spec if it didn't do this. That means (I suppose) that CR+LF is 
different from LF - that's the price paid for PRESERVE. The other option DEFAULT
cannot map onto a set of actions that we all agree for all documents. Therefore
we have to give DEFAULT some hints at the *document* level - presumably through
PIs.

Can we propose, therefore. a set of PIs that would control whitespace 
processing? I would hope that we could keep this to a very small number 
(ca. 3-4).  Is it too simple to suggest that there are two types of markup
(STRUCTURE and TEXT) that need to normalise whitespace?  the former would
deal with things like:

  
  

where the author did not intend there to be any whitespace, and the second
would deal with

This is a long space in a paragraph.

where all whitespace would be normalised to a single space as in HTML? Where a document contained both, the author could use a PI to switch between them. If we could come up with a very simple set of options, it might make it sufficiently simple that a standard filter could be devised, or the application programmer had a much simpler strategy. Is consensus possible? P. -- Peter Murray-Rust, domestic net connection Virtual School of Molecular Sciences http://www.vsms.nottingham.ac.uk/ xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From Peter at ursus.demon.co.uk Wed Aug 6 17:20:08 1997 From: Peter at ursus.demon.co.uk (Peter Murray-Rust) Date: Mon Jun 7 16:58:14 2004 Subject: XML and whitespace: lets just dump CR and LF! Message-ID: <9163@ursus.demon.co.uk> In message <199708051317.XAA23619@jawa.chilli.net.au> "Rick Jelliffe" writes: [...] > > I suggest that the following approach should be taken. (I think it is the only > realistic solution, especially if we assume that 1) > data is usually generated by applications, Although this will be partly true, I think we still have to expect people to use text editors for a year or two yet :-). [It's how I create most of my XML at present :-)]. > 2) humans only check and tweak data; Yes. XML must certainly be tweakable. So it mustn't have to have lines 1000 chars long :-) > 3) we want operating system > and character set independence, critical :-) 4) line-breaking is generally done by clients > ...so CR/LF is basically a convenience for fitting data into editors, > not for the purposes of output.) Yes. > > **A) XML applications should ignore *ALL* CR and LF as a bad joke. They should > be entirely there for formatting the raw text into nice, eye-sized records. > So CR and LF should never be converted to spaces. (This approach was the > one taken by Interleaf, and I have come to appreciate it.) If you need a > space, then start the new line with it! (Ending the previous line is difficult > to see.) Appeals to me :-) > > **B) XML applications should mandate the use of the unambiguous Unicode characters > -- LINE SEPARATOR 
 > -- PARAGRAPH SEPARATOR 
 > This makes sense unless someone finds a flaw in it... P. -- Peter Murray-Rust, domestic net connection Virtual School of Molecular Sciences http://www.vsms.nottingham.ac.uk/ xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From sarah at attd.com Wed Aug 6 21:55:13 1997 From: sarah at attd.com (Sarah Slocombe) Date: Mon Jun 7 16:58:14 2004 Subject: Xapi-J: an architectural detail Message-ID: <3.0.32.19970806155554.006b6f00@mail.lglobal.com> Greetings! I've been following this thread with great interest. I'm trying to piece together the suggestions so far but I wonder if I've muddled it already. Perhaps I should just wait a bit longer but things are really starting to get exciting now! As I understand it, we've got: public interface INode{ public INode getParent(); public void setParent(INode aContainer); } (Or is INode ONLY so things have a common base class/ interface, and shouldn't have any methods? Or does an IContainer never need to deal with parents? Or ought even parent stuff to be handled by iterators?) public interface IContainer extends INode{ public Enumeration getContents(); public void insertContent(IContent aContent, IContent preceedingContent); public void appendContent(IContent aContent); public void removeContent(IContent aContent); } public interface IContent extends INode{ public String getData(); } public interface IElement extends IContent, IContainer{ public String getType(); public void setType(String aType); public void addAttribute(String name, String value); public void removeAttribute(String name); public IAttribute getAttribute(String attributeName); public java.util.Enumeration getAttributes(); } So far so good? Now what about IAttribute? John Tigue's shown: public interface IAttribute{ public String getName(); public void setName(String aName); public String getValue(); public void setValue(); } Ought this to inherit from IContent? Chris Lloyd spoke of IContainer vs. IProperty -- are IContent and IProperty the same thing? Thanks for any help. Sarah Slocombe sarah@attd.com xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From john at datachannel.com Wed Aug 6 22:49:28 1997 From: john at datachannel.com (John Tigue) Date: Mon Jun 7 16:58:14 2004 Subject: Xapi-J: an architectural detail References: <3.0.32.19970806155554.006b6f00@mail.lglobal.com> Message-ID: <33E8E3E4.AAA99E5@datachannel.com> Sarah Slocombe wrote: > > So far so good? Now what about IAttribute? John Tigue's > shown: > > public interface IAttribute{ > public String getName(); > public void setName(String aName); > public String getValue(); > public void setValue(); > } > > Ought this to inherit from IContent? Chris Lloyd spoke of > IContainer vs. IProperty -- are IContent and IProperty the > same thing? > IContent is for things in things so I think IAttribute would extend INode and maybe an IProperty but not IContent as it was initially designed. We haven't nailed down which interfaces are in Xapi-J. I think Chris was saying that IProperty is for leaves in the parse tree. I've been maintaining a site which discusses the Xapi-J interfaces at http://www.datachannel.com/ChannelWorld/xml/dev. You can find the other Xapi-J interfaces there. As for stopping at some point and labeling what we have as Xapi-J 1.0, I think we are very close to a point where we can do that. The real work is having XML processor providers implementing Xapi-J. I have a functional XML processor which complies to Xapi-J which I've been using to test the concepts. It doesn't reflect the latest stuff like INode. I'll rev the site and the example processor this weekend. -- John Tigue Sr. Software Architect DataChannel http://www.datachannel.com jtigue@datachannel.com 206-462-1999 -------------- next part -------------- A non-text attachment was scrubbed... Name: vcard.vcf Type: text/x-vcard Size: 263 bytes Desc: Card for John Tigue Url : http://mailman.ic.ac.uk/pipermail/xml-dev/attachments/19970806/0f3d455d/vcard.vcf From andrewl at microsoft.com Thu Aug 7 00:16:20 1997 From: andrewl at microsoft.com (Andrew Layman) Date: Mon Jun 7 16:58:14 2004 Subject: XML-Link: Relative URL expansion Message-ID: <7BB61B44F197D011892800805FD4F7920133B7FF@RED-03-MSG.dns.microsoft.com> If an XML-Link element has a relative URL in its href attribute, what is used as the base for resolving the URL? I presume the URL of the containing document. Is this correct? --Andrew Layman AndrewL@microsoft.com xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From tbray at textuality.com Thu Aug 7 00:26:59 1997 From: tbray at textuality.com (Tim Bray) Date: Mon Jun 7 16:58:14 2004 Subject: XML-Link: Relative URL expansion Message-ID: <3.0.32.19970806152015.00830410@pop.intergate.bc.ca> At 03:15 PM 06/08/97 -0700, Andrew Layman wrote: >If an XML-Link element has a relative URL in its href attribute, what is >used as the base for resolving the URL? I presume the URL of the >containing document. Is this correct? More properly "containing resource", but yes. Check the XML spec, section 4.3.2, for some more details. In this connection I'd also recommend a look at RFC 1808. -T. xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From andrewl at microsoft.com Thu Aug 7 00:54:02 1997 From: andrewl at microsoft.com (Andrew Layman) Date: Mon Jun 7 16:58:14 2004 Subject: FW: First Draft of RDF, differences from my notes. Message-ID: <7BB61B44F197D011892800805FD4F7920133B80B@RED-03-MSG.dns.microsoft.com> After reading the RDF paper, I posted the following message to the RDF working group. Since the RDF paper is now posted to the XML dev mailing list, these comments are relevant in the new context. --Andrew Layman AndrewL@microsoft.com > -----Original Message----- > From: Andrew Layman > Sent: Friday, August 01, 1997 4:24 PM > To: w3c-labels-wg@w3.org; w3c-dsig-collect@w3.org > Subject: First Draft of RDF, differences from my notes. > > Thank you for the early draft of the paper. In reading it over, I've > found a number of points that differ from my recollection of our > Boston meeting. Perhaps my notes and memory are wrong on some of these > points (in which case I welcome correction) but it also appears that > some new features have crept into the document: > > 2. We only agreed on ablocks describing single resources. I > remember discussing having an RDF assertion block describe > characteristics of more than one resource, but concluding that this is > a difficult problem with great risk of user confusion. (I'm not > opposed to solving this problem; just want to note that we did not > solve it but left it for the future.) > > 2.4 I don't remember us ever finding a satisfactory way for the > ablock to actually contain its target resource (because the > subelements of an ablock are interpreted as properties of the ablock's > target). > > 2. We discussed the need for a small set of base data types, which > I believe were strings, numbers and dates/times. We also talked at > length about the need to distinguish between a base semantic type such > as date and a particular format such as ISO8061. The sentance > beginning "The domain of property values..." does not reflect dates or > the semantic/format distinction. > > 3. I don't remember agreement on refTypeAttr. Did we but I don't > have it in my notes? > > 3. We most definitely did not agree that the first namespace > element sets a default namespace! We did agree, tentatively, that we > might make the "as" attribute optional, where its omission could > signal that it was to be the default namespace for its containing > element (with the caveat that this needs more thought). We also > discussed that a namespace attribute on the containing element might > be a better way to achieve the same effect. > > 3. I remember discussing listItem, but don't remember ever nailing > it down precisely or agreeing on it. > > Example 5.1.1. This simply needs to be clarified. I think what > is meant is that an ablock with no href has as its implied target the > entirety of the enclosing document. > > 5.2.3 The note at the bottom makes the assertion that a downlevel > application can blindly concatenate together elements it does not > understand. My recollection is that we discussed this, concluded that > such a policy is dangerous and presumes to dictate processing. We did > agree to investigate adding some standard attribute that might signal > when such a policy is reasonable. We identified three values for such > an attribute: (a) ignore the unknown element, (b) ignore the unknown > tag, (c) application cannot process this element or any peer. > > I don't mean these comments to be interpreted as disagreements with > any aspect of the RDF design, but rather as a report on differences > between my notes and the current paper. > > --Andrew Layman > AndrewL@microsoft.com > > -----Original Message----- > From: Ralph R. Swick [SMTP:swick@w3.org] > Sent: Friday, August 01, 1997 9:49 AM > To: w3c-labels-wg@w3.org; w3c-dsig-collect@w3.org > Subject: First draft of RDF specification for review > > The first draft of the Resource Description Framework Model and Syntax > specification (Lassila & Swick, eds.) is now ready for your review and > comment. > > http://www.w3.org/Member/9708/WD-rdf-syntax-970801.html > > I would like to ask this working group's permission to distribute > this draft to w3c-xml-sig. xml-sig is the forum where technical > discussions of XML are ocurring and they particularly need to see > our requirements for the namespace tag. The only reason I ask your > consent is that while xml-sig is a W3C Members forum, it has quite > a few non-Member invited experts. I will distribute this draft to > that list at 1600UTC on Monday, August 5 unless I hear serious > objections before then. > > Thanks to all who have contributed thus far, and to each of you who > will take the time to review and make suggestions for improvement. > > -Ralph and Ora xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From andrewl at microsoft.com Thu Aug 7 00:54:41 1997 From: andrewl at microsoft.com (Andrew Layman) Date: Mon Jun 7 16:58:14 2004 Subject: RDF Specification: Ambiguity of the ABLOCK Message-ID: <7BB61B44F197D011892800805FD4F7920133B80C@RED-03-MSG.dns.microsoft.com> --Andrew Layman AndrewL@microsoft.com > -----Original Message----- > From: Andrew Layman > Sent: Friday, August 01, 1997 4:55 PM > To: w3c-labels-wg@w3.org; w3c-dsig-collect@w3.org > Subject: RE: First draft of RDF specification for review > > The example shown in 5.1.4 shows an interesting aspect of RDF: > > > > 1 > 45 > 70 > > > > Color is a property with three sub-elements. However, it is not > written that way. Instead it is shown containing an ablock, which then > has three sub-elements. > > What is the target of this ablock? Section 5.1.1 implies that an > ablock without an href has as its target the containing document. > Here, the rule seems to be that the target is the immediate parent. > > Why do we need this ablock? Why do we not just have a color that > itself has three sub-elements, as in > > > 1 > 45 > 70 > > > I think the reason we don't is that the RDF rule about properties is > that they must be binary. That is, the target of the color property > must be a single object. In actuality here, we have what amounts to a > quaternary relation, so we have interposed this "ablock" element in > order to reify the quaternary relation. > > I don't think this is the same kind of ablock at all as used in 5.1.1. > In fact, I don't think that "ablock" is the right element. The literal > interpretation of 5.1.4 is that the target of the color relation is a > typeless thing with three properties. Should not the target be a > color? As in > > > > 1 > 45 > 70 > > > > We could also reach this conclusion by thinking about the colorHSV as > a datatype describing how to interpret its subelements to produce a > color. (This point has implications for general thinking about data > types.) > > --Andrew Layman > AndrewL@microsoft.com > > -----Original Message----- > From: Ralph R. Swick [SMTP:swick@w3.org] > Sent: Friday, August 01, 1997 9:49 AM > To: w3c-labels-wg@w3.org; w3c-dsig-collect@w3.org > Subject: First draft of RDF specification for review > > The first draft of the Resource Description Framework Model and Syntax > specification (Lassila & Swick, eds.) is now ready for your review and > comment. > > http://www.w3.org/Member/9708/WD-rdf-syntax-970801.html > > I would like to ask this working group's permission to distribute > this draft to w3c-xml-sig. xml-sig is the forum where technical > discussions of XML are ocurring and they particularly need to see > our requirements for the namespace tag. The only reason I ask your > consent is that while xml-sig is a W3C Members forum, it has quite > a few non-Member invited experts. I will distribute this draft to > that list at 1600UTC on Monday, August 5 unless I hear serious > objections before then. > > Thanks to all who have contributed thus far, and to each of you who > will take the time to review and make suggestions for improvement. > > -Ralph and Ora xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From andrewl at microsoft.com Thu Aug 7 00:57:41 1997 From: andrewl at microsoft.com (Andrew Layman) Date: Mon Jun 7 16:58:14 2004 Subject: FW: content: sequence? Message-ID: <7BB61B44F197D011892800805FD4F7920133B80D@RED-03-MSG.dns.microsoft.com> The following is a message to the RDF working group regarding sequence in RDF. This led to some subsequent discussion in which I argued that if sequence is a generally useful concept, 3a is the best answer. We also discussed the relative merits of indicating sequence on the containing element vs. the contained. --Andrew Layman AndrewL@microsoft.com > -----Original Message----- > From: Andrew Layman > Sent: Monday, August 04, 1997 2:55 PM > To: w3c-labels-wg@w3.org; w3c-dsig-collect@w3.org > Subject: RE: content: sequence? > > We did not reach agreement on how best to handle sequence in Boston, > though we did agree that there are times in RDF when sequence is > significant and other times when it is not. We discussed the > possibility of having an attribute on an element signalling to an > application when it could ignore sequence. This was generally agreed > to as a direction, but we did not agree on what the appropriate > default should be. > > There were three approaches discussed: > > 1. a. Sequences are always important on some (tbd) elements > (e.g. "list") and never on others. > b. Sequences are not important on some (tbd) elements (e.g. > "ablock"), but are significant on all others. > > 2. Sequence-significance could be indicated by an attribute, > required on elements defined by RDF, and presumably unavailable on > other elements. > > 3. Sequence-significance could be indicated by an attribute that > can be used on any element. If omitted, and if no default was given in > a schema, then > a. The application should follow the XML precedent > of treating sequence as significant (after all, it might be). > b. The application should treat sequence as > insignificant (after all, that takes less processing). > > Separately, we briefly discussed whether sequence-significance should > be lexically inherited, but this dissolved into the general difficulty > of lexical inheritance. > > By my calculation, the only options fully compatible with XML without > implying any sort of contextual processing or lexical inheritance are > 1a, 2 and 3a. > > --Andrew Layman > AndrewL@microsoft.com > > -----Original Message----- > From: Tim Bray [SMTP:tbray@textuality.com] > Sent: Saturday, August 02, 1997 12:15 PM > To: w3c-labels-wg@w3.org; w3c-dsig-collect@w3.org > Subject: content: sequence? > > The draft does not, unless, I missed it, allow for sequence in the RDF > model. This is going to be widely required in all sorts of classes of > metadata (examples on request). I don't think RDF 1.0 is worthwhile > without > sequence. > > Suggestion: RDF already has a list primitive. If I say > > > > PanoramaNavigatorNotepad > > > > > > then I think we have a sequenced property value. Does this work? > > Cheers, Tim Bray > tbray@textuality.com http://www.textuality.com/ +1-604-708-9592 xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From ricko at allette.com.au Thu Aug 7 18:46:54 1997 From: ricko at allette.com.au (Rick Jelliffe) Date: Mon Jun 7 16:58:14 2004 Subject: XML and whitespace: lets just dump CR and LF! Message-ID: <199708071652.CAA06410@jawa.chilli.net.au> > From: Eric Baatz > > XML applications should ignore *ALL* CR and LF as a bad joke. > > That doesn't seem reasonable from my point of view, although an option to do > so might be reasonable. For example, my XML application, which reads text > and speaks it, is likely to be fed existing text that is only lightly marked > up with XML and that uses CR/LF (or newlines) and whitespace to convey > important information. My application needs to see that information to > operate in an acceptable manner. For example, input could be narrative > paragraphs denoted by adjacent newlines (or CR/LF's), poetry (lots of > prosodic information is in the the breaks and whitespace), or columns of > text (such as newspapers) and numbers (such as spreadsheets) that have not > been reduced to a single logical flow of characters. Under the current proposals, white-space is preserved or defaulted. (This relates to labelling data for applications, not on how the application presents it.) So there is no way to indicate whether newlines are hard returns or soft returns. I think this hearkens back to XML last year, when the idea was around that XML without declarations would be mainly used for closed-systems, where the recieving end had been built with a specific DTD in mind. Now it seems that this is not a big factor in the WG's mind, as the XML-ATTRIBUTE discussion show: the WG wants to support systems that work with many DTDs, even if they are not declared. (I, of course, think this is a mistaken change in direction for XML, but I bow to collective wisdom.) Under a closed-system approach, it made sense to say "default" or "preserve", since "default" and "preserve" might have some determinate meaning. Under the new all-singing-all-dancing direction for XML, I think they make little sense. If XML-SPACE is just "preserve" or "default", then document instance's newline coventions must be tailored for each application. But what if we are processing against an architectural form? Then every instance must use the the newline conventions belonging to the meta-Document Type Definition. And what if you have different AFs active at different parts of the document, or even applicable concurrently on some elements? Then all the meta-DTD's newline conventions must match, or you must adopt different conventions at different parts of the document. A hard return should be explicitly marked up: whether it is an attribute or a PI or a
element or 
, it should not be stuck outside the element in CSS or DSSSL--it is part of the data, not an artifact of formatting. (I suppose that the Remappers will think it desirable to define a new standard XML attribute that specifies which convention you use (PI, attribute,
, character reference, entity reference) to signify hard returns, and then provide other attributes to let us cope with existing DTDs that have churlishly adopted their own, prior, conventions. But I think it is simpler to merely say "The only way to signify hard returns in XML is 
" ) If you have gotten rid of hard returns, then next we need to sort out newlines that are soft returns in data from newlines that are in (or "attributable to") markup or element content. For this distinction, XML-SPACE may be good enough, in a brutish way. But I think that the Interleaf option, of making newlines not significant for presentation, is superior, for the reasons given before. I would also add another: it may simplify indexing into character strings--if you decide "CR and LF are not significant for presentation or indexing" then you get rid of the problem of documents needing to tell you which newline conventions they have adopted: you don't care, and the users are free to translate between different conventions without impacting indexes into documents (all other things being equal). Rick Jelliffe P.S. An Omnimark program to markup an existing well-formed HTML-in-XML document would be merely to add to a XML normaliser: TRANSLATE "%n" WHEN ANCESTOR IS PRE OUTPUT "
%n" TRANSLATE "%n" OUTPUT "%n " This does not seem too complex at all. xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From shawnhsu at ARC.unm.EDU Thu Aug 7 21:53:23 1997 From: shawnhsu at ARC.unm.EDU (Xu, Xiang) Date: Mon Jun 7 16:58:14 2004 Subject: Publisher Seeking XML Authors Message-ID: <3.0.1.32.19970807135137.006adf58@arc.unm.edu> Hi: We are a computer book publishing company by the name of Bigi International USA. Asking who will be interested in writing books about XML. Please reply to us as soon as possible. Thanks Best Regards -Xiang Xu ================================= Xu, Xiang Bigi International USA Inc. email: shawnhsu@arc.unm.edu http://www.bigiintl.com Tel:(505)830-1443(O), (505)232-8223(H) FAX:(505)830-1448 2501 San Pedro Blvd., NE, Suite 208 Albuquerque, NM 87110, USA xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From john at datachannel.com Fri Aug 8 02:32:08 1997 From: john at datachannel.com (John Tigue) Date: Mon Jun 7 16:58:15 2004 Subject: DOM and Xapi-J Message-ID: <33EA6992.B6615F6D@datachannel.com> Some questions have arisen as to the possibility of conflicting overlap between the DOM and Xapi-J. For those areas where they do overlap, I see Xapi-J as eventually being a proper subset of the DOM. For example: The DOM is language independent. Xapi-J is Java only. The DOM is platform independent. Xapi-J is for the Java platform only. Xapi-J is designed to be a stylistically consistent extention to the JDK which embeds it even further into Java e.g. see the recent thread entitled "Xapi-J: an architectural detail" The DOM covers HTML and XML. Xapi-J only covers XML. Eventually, I would think that Xapi-J compliant processors would be seen as having a DOM-compliant object model of an XML document because they will eventually use the DOM's Java language bindings exactly. There are also many other features of the DOM requirements which are not reflected in Xapi-J. The parts of Xapi-J related to how a developer instantiates a processor and optionally get ESIS parse events out of one of these JavaBeans does not overlap with the DOM work. I think we can declare Xapi-J 1.0 complete at any time now. When the DOM is done I think Xapi-J should be reved to be a direct subset of the DOM's object model using the DOM's object model and method signatures exactly. That is the only part I see where there is overlap and it would be a shame to have two very similar but different object models of an XML document. The original goal of Xapi-J was to come up with a unified model/api for Java developers who are using/writing XML processors. To not reflect the work of the DOM WG would defeat the whole idea. -- John Tigue Sr. Software Architect DataChannel http://www.datachannel.com jtigue@datachannel.com 206-462-1999 -------------- next part -------------- A non-text attachment was scrubbed... Name: vcard.vcf Type: text/x-vcard Size: 263 bytes Desc: Card for John Tigue Url : http://mailman.ic.ac.uk/pipermail/xml-dev/attachments/19970808/3df6ada0/vcard.vcf From Peter at ursus.demon.co.uk Fri Aug 8 10:06:49 1997 From: Peter at ursus.demon.co.uk (Peter Murray-Rust) Date: Mon Jun 7 16:58:15 2004 Subject: DOM and Xapi-J Message-ID: <9260@ursus.demon.co.uk> In message <33EA6992.B6615F6D@datachannel.com> john@datachannel.com (John Tigue) writes: [...] > > I think we can declare Xapi-J 1.0 complete at any time now. When the DOM I think this is a great achievement, and I'd like to thank John both for the API and for continuing the momentum. Also thanks to everyone who has contributed ideas. This group is not, of course, part of the formal process of XML under W3C, but I believe that anyone involved in XML development will take Xapi-J as a central reference. I would suggest that those who have pages publicising XML resources should include this. John - is there now a definitive URL that should be used? > is done I think Xapi-J should be reved to be a direct subset of the > DOM's object model using the DOM's object model and method signatures > exactly. That is the only part I see where there is overlap and it would > be a shame to have two very similar but different object models of an > XML document. The original goal of Xapi-J was to come up with a unified > model/api for Java developers who are using/writing XML processors. To > not reflect the work of the DOM WG would defeat the whole idea. It seems clear to me that there will continue to be revisions to many parts of XML (we do not yet have a definitive version). So revision of Xapi-J to be consistent with DOM will be one of several such adjustments or extensions. I hope that other reference documents will come out of public debate on XML-DEV. I think there are going to be a large number of problems which are not defined by the spec and which are not felt appropriate for discussion by the WG (the formal W3C body) or the SIG (now not public). From time to time it is suggested that 'this is an implementation problem, - perhaps XML-DEV would be appropriate?' Some of us are concerned that uncoordinated (though well-meant) implementation of XML applications and tools will create a range of inconsistent approaches. At present XML-DEV is the only forum for discussing these and I think we have a critical role here. Obviously any contributions are voluntary, not part of the W3C process, but if we continue to come up with well-thought out documents or proposals they should have an important role. Some areas where I think guidance for implementers is critically needed NOW (in rough priority) are: - whitespace processing - error processing - treatment of defaults and inheritance - interpretation of XML-LINK constructs Volunteers? Should we adopt any sort of informal process? Thoughts? :-) Perhaps those who are able to be present at XML-DEV day might wish to discuss this? P. -- Peter Murray-Rust, domestic net connection Virtual School of Molecular Sciences http://www.vsms.nottingham.ac.uk/ xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From Alain.Michard at inria.fr Fri Aug 8 13:57:44 1997 From: Alain.Michard at inria.fr (Alain Michard) Date: Mon Jun 7 16:58:15 2004 Subject: FPIs as locators in XML-links ? Message-ID: <199708081157.NAA21705@yana.inria.fr> I may miss something (and in that case thanks for the help!) , but I feel that the current specification of locator in XML-LINK don't open an easy way to avoid including URLs in XML documents. It is true that a URL can be considered as a unique id of one physical copy of a resource, but this id is transient : machines may be changed, and "publishers" (any entity putting content on the Web) may decide to migrate a document repository from one place to another. If locators in XML-LINKS are URLs, this implies that in case of change of URL, many authors -a- should ideally be notified in some way of the change if it is relevant for them (ie: if they have in their own documents links pointing to resources which have changed of URL) -b- have to retrieve all the document they have published which contain a link to the modified URL; -c- have to edit all these documents. That's in fact exactly the situation with the HTML-based Web. I feel that the SGML practice to use Public Identifiers and to store mappings of PUBLIC identifiers to SYSTEM identifiers in a Catalogue file facilitates greatly the management of large collection of documents: - a "publisher" may distribute updates of his public catalogue to the community with which he shares a number of resources; - the catalogue is the only file you need to edit when any Public ID gets associated to a new physical resource; - in case of mirror copies of Web sites, the catalogue may be an easy mean to impose to your browser to look for a given document at a given site, without having to specify it at each traversal of a link; Moreover, including URLs in XML documents appears contradictory to the general phylosophy of SGML, which I guess could be resumed as "Ensure long-term life of documents". I would be very interested to read some comments from SGML experts on the list, to help me understand the reasons why the XML draft specs exclude -so far- using Public IDs in links. Best Regards Alain Michard Mediaculture - Direction du D?veloppement INRIA - Domaine de Voluceau BP 105 F-78153 Le Chesnay Cedex - France Tel: +33 1 3963 5472 Fax: +33 1 3963 5114 xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From tbray at textuality.com Fri Aug 8 18:02:27 1997 From: tbray at textuality.com (Tim Bray) Date: Mon Jun 7 16:58:15 2004 Subject: FPIs as locators in XML-links ? Message-ID: <3.0.32.19970808084641.008d9320@pop.intergate.bc.ca> At 01:58 PM 08/08/97 +0200, Alain Michard wrote: >I would be very interested to read some comments from SGML experts on the >list, to help me understand the reasons why the XML draft specs >exclude -so far- using Public IDs in links. Two reasons, really. XML-Link is designed specifically for use in the context of the Web, and on the the Web, things exist if they can be addressed by URI's, otherwise not. Secondly, whereas PUBLIC identifiers are very interesting and useful, it is not the case that virtually every server and desktop in the world comes with excellent free machinery to use them across the network, which is the case with URLs. -Tim xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From tbray at textuality.com Fri Aug 8 19:18:37 1997 From: tbray at textuality.com (Tim Bray) Date: Mon Jun 7 16:58:15 2004 Subject: FPIs as locators in XML-links ? Message-ID: <3.0.32.19970808101550.008dfaf0@pop.intergate.bc.ca> At 01:08 PM 08/08/97 -0400, Paul Prescod wrote: >> Two reasons, really. XML-Link is designed specifically for use in >> the context of the Web, and on the the Web, things exist if they >> can be addressed by URI's, otherwise not. >> >> Secondly, whereas PUBLIC identifiers are very interesting and useful, >> it is not the case that virtually every server and desktop in the >> world comes with excellent free machinery to use them across the network, >> which is the case with URLs. -Tim > >1. Don't these arguments apply equally to XML-Lang? No. Links on the Web are based on URI's; that's a fact of life. If you want another kind of link that isn't, go ahead and build it, but our mandate was to build a Web-oriented hyperlinking facility. There is no Web machinery that knows anything about FPI's. >2. You've argued why PUBLIC identifiers will sometimes not be useful. >You haven't argued why they will *never* be useful. They were put into >XML Lang because some argued that they will sometimes need them. That >applies to XML-Link equally. Yes, you've said this many times. So far, the WG membership is unconvinced. >3. What about entities declared through system identifiers? Why can't I >link to them through their entity names? Because that's not how things are done on the Web. Of course, you in XML you *can* say > What is the point of "binary" >entities if the "standard" linking and transclusion mechanism can't use >them? Or to go the other way, why wouldn't the standard linking and >transclusion mechanism be able to use the standard mechanism for mapping >external resources into document names? The key point is the use of the word "standard". The use of entities and PUBLIC identifiers is standard only in the world of SGML. For interoperation with the universe of Web documents, the only standard way to do things is the URI mechanism. To summarize, we were not trying to extend the SGML entity mechanism to do network hypertext; we were trying to extend the existing Web hypertext mechanism to be usable in XML. Anyhow, this argument is over. Sorry. -T. xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From neil at bradley.co.uk Sat Aug 9 12:15:01 1997 From: neil at bradley.co.uk (Neil Bradley) Date: Mon Jun 7 16:58:15 2004 Subject: 5 Whitespace Rules Message-ID: <199708091014.LAA11574@andromeda.ndirect.co.uk> I think it's time to pin down some rules or guidelines regarding the use of whitespace. I am not suggesting that the following is exhaustive or totally unambiguous, but maybe it is a starting point for discussion. I would really like to see a small list of rules such as the following being defined, as I am sure it will help avoid potentially damaging confusion arising when products arrive and prove to be incompatible. One of the problems of defining rules for XML has been the grouping of line-end codes with space separating characters under the 'S' rule. By separating these concepts, it is quite easy to define rules with are both backward compatible with SGML and HTML (very important in its own right) and also intuitive. While the idea of ignoring all line-end codes and manually inserting spaces at the start of each line to compensate is at first sight attractive, it is certainly not intuitive, and there are plenty of text files in existence (including SGML and HTML files, of course), which do not follow this convention. -------------------- An application should remove or transform whitespace characters received from the XML-processor according to the following 5 rules: RULE 1. Every CR and LF code is regarded as a line-end signal, except when it immediately follows the other code ([CR][LF] or [LF][CR]), in which case it is discarded (and is also ignored, so has no effect on calculations for the next character). This rule applies even in 'preserved' content. /* This rule standardizes input from documents prepared on Mac, Unix and MS-DOS/Windows platforms. [CR] ---> line-end [LF] ---> line-end [CR][LF] ---> line-end [LF][LF] ---> line-end, line-end [CR][CR] ---> line-end, line-end [CR][LF][CR][LF] ---> line-end, line-end (because both LF's are ignored) By including this rule in preserved content, we avoid alternate blank lines appearing in documents prepared on an MS-DOS system but viewed on another system. */ RULE 2. A line-end code (or codes) immediately following a start-tag, PI or declaration, or immediately preceding an end-tag, is discarded (except in preserved content). /* [CR][CR]

[CR]This is a para in a note.[CR]

becomes:

This is a para in a note.

But the CRs below are not removed (they are later converted to a space - see rule 4):

Here is an[CR] emphasised[CR] word.

becomes:

Here is an emphasised word.

*/ RULE 3. All other whitespace in element content is discarded. /* [SP][TAB]

This is a para in a note... becomes (in validated input):

This is a para in a note... Note that only the presence of spaces and tabs in element content, which is not common, will cause discrepancies between validated and non-validated processing. */ RULE 4. Line-end codes are discarded when preceded by a hard or soft ('°') hyphen (and a soft hyphen is also discarded). Remaining line-end codes are treated as spaces. /* A[CR] line-[CR] end code sep°[CR] erates lines. becomes: A line-end code seperates lines. */ RULE 5. Consecutive whitespace characters (including translated line-end codes) are reduced to a single space, except in preserved mode. /* These lines are divide by a space[SP][CR] and carriage[SP][TAB][SP]return. becomes: These lines are divided by a space and carriage return. */ ------------------------------ ----------------------------------------------- Neil Bradley - Author of The Concise SGML Companion. neil@bradley.co.uk www.bradley.co.uk xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From jamesr at steptwo.com.au Sat Aug 9 12:29:43 1997 From: jamesr at steptwo.com.au (James Robertson) Date: Mon Jun 7 16:58:15 2004 Subject: 5 Whitespace Rules In-Reply-To: <199708091014.LAA11574@andromeda.ndirect.co.uk> Message-ID: <3.0.2.32.19970809202726.00a9abe0@magna.com.au> At 23:13 8/08/97 +0000, you wrote: | | I think it's time to pin down some rules or guidelines regarding | the use of whitespace. I am not suggesting that the following is | exhaustive or totally unambiguous, but maybe it is a starting point | for discussion. I would really like to see a small list of rules such | as the following being defined, as I am sure it will help avoid potentially | damaging confusion arising when products arrive and prove to be | incompatible. | An application should remove or transform whitespace characters | received from the XML-processor according to the following 5 rules: [snip] Hear hear. These are practical, useful rules, and I can find no fault with them. They are certainly much more backwards-compatible than the suggested solution of ignoring all line-end characters. My vote: make it so ... J ------------------------- James Robertson Step Two Designs Newton & SGML Consultancy jamesr@steptwo.com.au "Beyond the Idea" xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From paul at arbortext.com Sat Aug 9 15:10:17 1997 From: paul at arbortext.com (Paul Grosso) Date: Mon Jun 7 16:58:15 2004 Subject: 5 Whitespace Rules Message-ID: <3.0.32.19970809060751.00698840@pophost.arbortext.com> At 23:13 1997 08 08 +0000, Neil Bradley wrote: >RULE 3. All other whitespace in element content is discarded. > >Note that only the presence of spaces and tabs in element content, >which is not common, will cause discrepancies between validated and >non-validated processing. This is the crux of the problem. As soon as you say something about element content, you get different results from the document when you process the DTD and when you don't. You don't say explicitly what happens when you don't process the DTD, but I assume your Rule 3 doesn't do anything in that case. Therefore, your Rule 5 will turn all line-end codes into a space, and it is extremely common to have line-end codes in element content. So your Rule 3 will cause you to end up with lots of spaces when you process in the absence of a DTD that you wouldn't get when you process in the presence of the DTD. > >RULE 4. Line-end codes are discarded when preceded by a hard >or soft ('°') hyphen (and a soft hyphen is also discarded). >Remaining line-end codes are treated as spaces. This might be a nice heuristic for incoming WP files, but it doesn't agree with SGML. If I had "a - b" in my document and a line-end happened to occur after the -, you'd turn my file into "a -b". paul xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From neil at bradley.co.uk Sat Aug 9 15:59:40 1997 From: neil at bradley.co.uk (Neil Bradley) Date: Mon Jun 7 16:58:15 2004 Subject: 5 Whitespace Rules Message-ID: <199708091359.OAA22023@andromeda.ndirect.co.uk> > Reply-to: Paul Grosso > At 23:13 1997 08 08 +0000, Neil Bradley wrote: > >RULE 3. All other whitespace in element content is discarded. > > > > >Note that only the presence of spaces and tabs in element content, > >which is not common, will cause discrepancies between validated and > > non-validated processing. > > This is the crux of the problem. As soon as you say something about > element content, you get different results from the document when > you process the DTD and when you don't. Yes, but as I say, the problem only arises if people put spaces or tabs in element content, which in my experience is very unusual. > You don't say explicitly what happens when you don't process the > DTD, but I assume your Rule 3 doesn't do anything in that case. > Therefore, your Rule 5 will turn all line-end codes into a space, > and it is extremely common to have line-end codes in element > content. So your Rule 3 will cause you to end up with lots of > spaces when you process in the absence of a DTD that you wouldn't > get when you process in the presence of the DTD. No, Rule 2 has already dispensed with these CR and LF codes. I should have made it clear that this rule applies to non-validated input. So... [CR] [CR]

[CR] This is a para in a note[CR]

[CR]
[CR] ... becomes

This is a para in a note

... ...before Rules 3 and 5 are applied. This was my whole point about separating line-end code processing from spacing character processing. > > > >RULE 4. Line-end codes are discarded when preceded by a hard or > >soft ('°') hyphen (and a soft hyphen is also discarded). > >Remaining line-end codes are treated as spaces. > > This might be a nice heuristic for incoming WP files, but it doesn't > agree with SGML. If I had "a - b" in my document and a line-end > happened to occur after the -, you'd turn my file into "a -b". Yes, well, I can only suggest this is unlikely to happen, and in any case Rule 4 is only a suggestion for paginating applications. I am open to suggestions here, but for now I am far more concerned about the Rules 1 to 3. > paul Neil. ----------------------------------------------- Neil Bradley - Author of The Concise SGML Companion. neil@bradley.co.uk www.bradley.co.uk xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From ak117 at freenet.carleton.ca Sat Aug 9 16:02:51 1997 From: ak117 at freenet.carleton.ca (David Megginson) Date: Mon Jun 7 16:58:15 2004 Subject: PSGML-XML Message-ID: <199708091359.JAA00319@localhost> A couple of weeks ago, I patched PSGML to add an XML mode that enables XML-specific delimiters, parsing, and error-reporting (in other words, it's a real, native XML DTD-driven editor). (QUERY: IS THIS THE FIRST NATIVE XML EDITOR AVAILABLE?) I'm waiting to hear back from Lennart Staflin on integrating this into the main distribution; in the mean time, I'm looking for some alpha testers who meet the following criteria: 1) You are familiar with both SGML and XML. 2) You are an intermediate to advanced Emacs user (as a minimum, you should know how to byte-compile modules, modify the load path, and set start-up variables). 3) You are currently using PSGML 1.0.1 and are familiar with its commands. If you're interested, please send me a message, and I'll send you the patches against PSGML 1.0.1 next week. *** I am _not_ prepared to provide help on Emacs configuration (etc.) at the alpha stage, so please don't reply unless you are either an experienced Emacs user or you have easy access to one. For your information, here are the current features: ************************************************************************ XML FEATURES CURRENTLY SUPPORTED ************************************************************************ - understands "/>" TAGC for empty elements, and inserts it by default - requires "?>" PIC for processing instructions - always quotes attribute value literals - Reports the following DTD errors: * use of AND-connector in content model in element declaration * use of name group for element type in element declaration * use of omitted tag minimization in element declaration * use of CDATA or RCDATA declared content * use of inclusion or exclusion exceptions * declaration of external CDATA, SDATA, or SUBDOC entities * declaration of internal CDATA, SDATA, PI, STARTTAG, ENDTAG, MS, or MD entities * declaration of data attributes * use of name group for associated element type in ATTLIST * declaration of NAME, NAMES, NUMBER, NUMBERS, NUTOKEN, or NUTOKENS attributes * declaration of #CURRENT or #CONREF attributes * a public identifier that is not accompanied by a system identifier - Reports the following general errors: * data entity references in data * nested comments (enforces XML-style comments) * use of tag minimization ************************************************************************ XML FEATURES NOT YET SUPPORTED ************************************************************************ - allow SYSIDs to be URLs - validate that mixed content follows XML restrictions - validate that marked sections in DTD are either INCLUDE or IGNORE - validate that marked sections in content are CDATA (no parameter entities) - validate that XML declaration is present - probably many others that I've missed All the best, David -- David Megginson ak117@freenet.carleton.ca Microstar Software Ltd. dmeggins@microstar.com http://home.sprynet.com/sprynet/dmeggins/ xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From john at datachannel.com Sun Aug 10 06:46:53 1997 From: john at datachannel.com (John Tigue) Date: Mon Jun 7 16:58:15 2004 Subject: XML sample application References: <852564DB.0067F3F4.00@bna-03.bna.com> Message-ID: <33ED4848.E15DDF93@datachannel.com> sdarya@bna.com wrote: > I tried the demo on MS IE 3 (Windows95). I get a program exception > error > and IE crashes. Do I have to have Netscape? > > If you are referring to the Java applet at http://www.datachannel.com/xml/viewer, it has been shown to work on all major browsers on all major platforms. This XML viewer has been thoroughly tested and tech-supported past all serious problems. Please do not post tech support questions about DataChannel demo code to xml-dev; I do not believe that they are interested in such matters. If anyone has any questions, please e-mail me directly at john@datachannel.com. It will be my pleasure to help get the viewer running on your machine. -- John Tigue Sr. Software Architect DataChannel http://www.datachannel.com jtigue@datachannel.com 206-462-1999 -------------- next part -------------- A non-text attachment was scrubbed... Name: vcard.vcf Type: text/x-vcard Size: 263 bytes Desc: Card for John Tigue Url : http://mailman.ic.ac.uk/pipermail/xml-dev/attachments/19970810/f80bc645/vcard.vcf From murata at apsdc.ksp.fujixerox.co.jp Mon Aug 11 06:11:23 1997 From: murata at apsdc.ksp.fujixerox.co.jp (MURATA Makoto) Date: Mon Jun 7 16:58:15 2004 Subject: XML-Link: Relative URL expansion In-Reply-To: <3.0.32.19970806152015.00830410@pop.intergate.bc.ca> Message-ID: <9708110411.AA01143@lute.apsdc.ksp.fujixerox.co.jp> Tim Bray writes: >More properly "containing resource", but yes. Check the XML spec, section >4.3.2, for some more details. Let me point out a minor issue. When the XML document is not stored in anything but directly appears in the stream given to the XML parser, we do not know what is the "containing resource". Probably, relative URL's in such XML documents are errors? MURATA Makoto (FAMILY Given) Fuji Xerox Information Systems Tel: 044-812-7230 Fax: 044-812-7231 E-mail: murata@apsdc.ksp.fujixerox.co.jp xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From neil at bradley.co.uk Mon Aug 11 11:48:30 1997 From: neil at bradley.co.uk (Neil Bradley) Date: Mon Jun 7 16:58:15 2004 Subject: Whitespace rules (v2) Message-ID: <199708110948.KAA20836@andromeda.ndirect.co.uk> Due to some useful feedback, and further thoughts of my own, I would like to amend my list of 5 whitespace rules in a few respects. For people who read the previous set of rules, the corrections are: a) block-enclosing elements must be identified via list or style sheet b) PI, Comment and empty element processing has totally changed c) all rules explicitly apply to both validating and non-validating applications d) the rules are explicitly to be applied in sequence The new rules can be summarized as: 1. normalize line-end codes 2. Remove block surrounding whitespace 3. Remove leading/trailing block line-ends 4. Join lines and de-hyphenate 5. Remove surplus spaces in text ------WHITESPACE RULES------ A formatting application should remove or transform whitespace characters received from the XML-processor according to the following 5 rules. These rules are to be applied in sequence, by both validating and non-validating applications. Note 1: PI's, comments and empty elements may be removed, and at any point in the process. Note 2: in some cases, 'line-end' codes (CR and LF) are distinguished from 'spacing' characters (SP and TAB), but the term 'whitespace' continues to indicate all these characters ---------- RULE 1. Every line-end code is regarded as a line terminator, except when it immediately follows the other code ([CR] following [LF] or [LF] following [CR]), in which case it is discarded (and is also ignored, so has no effect on calculations for the next character). This rule also applies in 'preserved' content. --- Note: this rule standardizes input from documents prepared on Mac, Unix and MS-DOS/Windows platforms. [CR] ---> line-end [LF] ---> line-end [CR][LF] ---> line-end [LF][CR] ---> line-end [LF][LF] ---> line-end, line-end [CR][CR] ---> line-end, line-end [CR][LF][CR][LF] ---> line-end, line-end (because both LF's are ignored) Note: by including this rule in preserved content, we avoid alternate blank lines appearing in documents prepared on an MS-DOS system but viewed on another system. ---------- RULE 2. All whitespace preceding the start-tag and following the end-tag of a 'block enclosing' element is discarded. --- Note: a non-validating applications must refer to a style sheet or configuration file to identify 'block enclosing' elements (perhaps by applying this rule to elements not specified as in-line elements). As a validating application cannot easily determine this rule from the content model (the first mixed content element in the hierarchy is block enclosing, as well as all outer layers), it may choose the same approach. Note: [SP][SP][TAB]

This is a[SP]para... becomes:

This is a[SP]para and:

Para 1.

[CR]

Para 2.

becomes:

Para 1.

Para 2.

Note: If PI's, comments or empty elements remain in the data stream, they are deemed transparent to this process, so: [SP]

Some text... becomes:

Some text... ---------- RULE 3. A sequence of one or more line-end codes immediately following a start-tag, or immediately preceding an end-tag, are discarded (except in preserved content). --- Note: [CR]

[CR] This is a para in a note.[CR]

becomes:

This is a para in a note.

Note: If PI's, comments or empty-elements remain in the data stream, they are deemed transparent to this process, so:

[CR] some text... becomes:

some text... ---------- RULE 4. A remaining line-end code is converted into a space, except when it is preceded by a normal (hard) hyphen, or by a soft hyphen ('°'), in which case it is removed (a soft hyphen is also then removed). --- Note: A[CR] line-[CR] end code sep°[CR] erates lines. becomes: A line-end code seperates lines. Note: PI's, comments and empty elements are treated as text, so:

Some[CR] [CR] text. becomes:

Some[SP][SP]text. Note: if a space is required after the hyphen, it must be inserted before the line-end: 4 -[SP][CR] 3 = 1 becomes: 4 -[SP][SP]3 = 1 ---------- RULE 5. Consecutive whitespace characters (including translated line-end codes) are reduced to a single space, except in preserved mode. --- Note: 4 -[SP][SP]3 = 1 becomes: 4 -[SP]3 = 1 Note: if PI's, comments or empty elements are removed after rule 5:

Some[SP][SP]text. has already become:

Some[SP][SP]text. but now becomes:

Some[SP]text. Note: Multiple spaces can be preserved using the non-break space character (' ').

Some   spaces. ------------------------------ ----------------------------------------------- Neil Bradley - Author of The Concise SGML Companion. neil@bradley.co.uk www.bradley.co.uk xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From agreene at bitstream.com Mon Aug 11 14:40:38 1997 From: agreene at bitstream.com (Andrew Greene) Date: Mon Jun 7 16:58:15 2004 Subject: Whitespace rules (v2) In-Reply-To: <199708110948.KAA20836@andromeda.ndirect.co.uk> (neil@bradley.co.uk) Message-ID: <19970811123638.AAA6033@AGREENE-PC.bitstream.com> I'm troubled by one aspect of that suggestion: > RULE 4. A remaining line-end code is converted into a space, except > when it is preceded by a normal (hard) hyphen, or by a soft hyphen > ('°'), in which case it is removed (a soft hyphen is also then > removed). ^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^ That could alter the semantics of the data stream. The incoming data stream may have been broken at that point, but we don't want to lose the fact that such a break is legal -- it may be required again down- stream. So, using your example, I think that > A[CR] > line-[CR] > end code sep°[CR] > arates lines. should become A line-end code sep°arates lines. and not, as you suggest, > A line-end code seperates lines. An individual application may choose to ignore soft hyphens when it displays (or otherwise handles) the data. Does that make sense? - Andrew Greene xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From m.hampson at ic.ac.uk Mon Aug 11 15:18:39 1997 From: m.hampson at ic.ac.uk (m.hampson@ic.ac.uk) Date: Mon Jun 7 16:58:16 2004 Subject: Testing digest - please ignore Message-ID: Testing digest - please ignore -- +--------------------------------------------------------------------+ | Martyn Hampson | Tel: 0171 594 6973 | | Imperial College | Fax: 0171 594 6958 | | Computer Centre | E-Mail: M.Hampson@ic.ac.uk | | London SW7 2BP, ENGLAND | "Don't just do something, sit there!" | +--------------------------------------------------------------------+ xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From paul at arbortext.com Mon Aug 11 16:54:14 1997 From: paul at arbortext.com (Paul Grosso) Date: Mon Jun 7 16:58:16 2004 Subject: Whitespace rules (v2) Message-ID: <3.0.32.19970811093956.006e1d78@pophost.arbortext.com> At 22:48 1997 08 10 +0000, Neil Bradley wrote: >---------- >RULE 2. All whitespace preceding the start-tag and following the end-tag >of a 'block enclosing' element is discarded. >--- >Note: a non-validating applications must refer to a style sheet or >configuration file to identify 'block enclosing' elements (perhaps by >applying this rule to elements not specified as in-line elements). >As a validating application cannot easily determine this rule from the >content model (the first mixed content element in the hierarchy is >block enclosing, as well as all outer layers), it may choose the same approach. > >Note: > > [SP][SP][TAB]

This is a[SP]para... > >becomes: > >

This is a[SP]para > >and: > >

Para 1.

[CR] >

Para 2.

> >becomes: > >

Para 1.

Para 2.

What if a block enclosing element is contained within a block enclosing element? You appear to be trying to use different terms to describe what is effectively the issue of element content versus mixed content. How is requiring a style sheet or configuration file to indicate which elements are "block enclosing" different from having a DTD or partial set of declarations to indicate which elements have element content? paul xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From capt at augusta.inf.elte.hu Mon Aug 11 18:35:42 1997 From: capt at augusta.inf.elte.hu (Miskovics Gabor) Date: Mon Jun 7 16:58:16 2004 Subject: XML browser, stylesheet Message-ID: <33EF3F9A.6E5C5BD8@augusta.inf.elte.hu> Hi! I'm looking for XML browsers, XML stylesheet DTDs, and XML stylesheets. Can anyone help me? Bye, Capt -- Miskovics Gabor E-mail: capt@augusta.inf.elte.hu Web: http://augusta.inf.elte.hu/~capt xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From neil at bradley.co.uk Mon Aug 11 18:42:00 1997 From: neil at bradley.co.uk (Neil Bradley) Date: Mon Jun 7 16:58:16 2004 Subject: Whitespace rules (v2) Message-ID: <199708111641.RAA18361@andromeda.ndirect.co.uk> Paul Grosso wote: > At 22:48 1997 08 10 +0000, Neil Bradley wrote: > >---------- > >RULE 2. All whitespace preceding the start-tag and following the end-tag > >of a 'block enclosing' element is discarded. > >--- > >Note: a non-validating applications must refer to a style sheet or > >configuration file to identify 'block enclosing' elements (perhaps by > >applying this rule to elements not specified as in-line elements). > >As a validating application cannot easily determine this rule from the > >content model (the first mixed content element in the hierarchy is > >block enclosing, as well as all outer layers), it may choose the same approach. > > What if a block enclosing element is contained within a block enclosing > element? You appear to be trying to use different terms to describe > what is effectively the issue of element content versus mixed content. > > How is requiring a style sheet or configuration file to indicate which > elements are "block enclosing" different from having a DTD or partial > set of declarations to indicate which elements have element content? The point about style-sheets etc is that even a non-validating formatting application will require one, and it can get its information from that source. A validating formatter can do the same thing, and it is arguably easier than referring to the DTD, which does not directly identify block enclosing elements. A Paragraph element with mixed content is a block enclosing element, but an embedded Emphasis element, also with mixed content, is not! Of course, block enclosing elements CAN be identified from the DTD, it is *just* a matter of finding the outer-most element with mixed content, and I am not ruling out this approach, just saying a validating processor "may choose the same approach" as a non-validating processor for convenience. I know this is far from ideal, and I hope someone can suggest something better. If not, I would still prefer this rule to nothing, or to ignoring all line-end codes. Neil. ----------------------------------------------- Neil Bradley - Author of The Concise SGML Companion. neil@bradley.co.uk www.bradley.co.uk xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From h.rzepa at ic.ac.uk Tue Aug 12 17:57:10 1997 From: h.rzepa at ic.ac.uk (Rzepa, Henry) Date: Mon Jun 7 16:58:16 2004 Subject: Digests for xml-dev Message-ID: Anyone wishing to receive weekly digests (on Monday) of the xml-dev list should subscribe as follows mailto:majordomo@ic.ac.uk the request subscribe xml-dev-digest (if possible, do NOT use the form subscribe xml-dev-digest yourothermailaddress, since I have to moderate such requests, and this may not happen instantly!) If you wish to STOP receiving daily postings, you should mailto:majordomo@ic.ac.uk the request unsubscribe xml-dev. Members of either list will be able to post messages to xml-dev@ic.ac.uk Dr Henry Rzepa, Dept. Chemistry, Imperial College, LONDON SW7 2AY; mailto:rzepa@ic.ac.uk; Tel (44) 171 594 5774; Fax: (44) 171 594 5804. URL: http://www.ch.ic.ac.uk/rzepa/ xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From jimg at digitalthink.com Wed Aug 13 21:01:51 1997 From: jimg at digitalthink.com (Jim Gindling) Date: Mon Jun 7 16:58:16 2004 Subject: Proceedings for the 4th International HyTime Conference? Message-ID: <01BCA7E0.487807B0.jimg@digitalthink.com> Hi all, Does anybody know if proceedings for the 4th International HyTime Conference (especially XML Developer's Day) can be obtained by us poor souls who are unable to attend? Thanks in advance. Jim Gindling DigitalThink Software Engineer xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From srn at techno.com Wed Aug 13 23:35:20 1997 From: srn at techno.com (Steven R. Newcomb) Date: Mon Jun 7 16:58:16 2004 Subject: Proceedings for the 4th International HyTime Conference? In-Reply-To: <01BCA7E0.487807B0.jimg@digitalthink.com> (message from Jim Gindling on Wed, 13 Aug 1997 11:58:57 -0700) Message-ID: <199708132130.RAA00829@bruno.techno.com> > Does anybody know if proceedings for the 4th International HyTime Conference > (especially XML Developer's Day) can be obtained by us poor souls who are > unable to attend? I can't speak for XML Developers' Day. Jon? As for the HyTime Conference, what we have done in the past is to accept anything any speaker wishes to provide to the public and place it on the Web, subject to some editing and added value if resources permit. You must realize, though, that getting such materials off the Web is a poor substitute for attending a conference, and not every speaker is able (for a variety of reasons) to publish everything. There is another issue here, too. The GCA can't function without revenue, and it's not clear that the practice of giving away HyTime conference proceedings can be continued indefinitely. In general, sales of conference proceedings represent a revenue stream for the GCA. It is possible that access to such things as HyTime and XML conference proceedings on the Web may eventually become a "GCA members only" (or even a pay-per-view!) privilege. But please note that I do not speak for the GCA on this or any other matter, nor have I received any indication that such a plan is under consideration. I'm just pointing out that Adam Smith's invisible hand can be expected to have its effect here at the appropriate time and in the appropriate way, once the XML and HyTime conferences have sufficient momentum. -Steve -- Steven R. Newcomb President voice +1 716 271 0796 TechnoTeacher, Inc. fax +1 716 271 0129 (courier: 23-2 Clover Park, Internet: srn@techno.com Rochester NY 14618) FTP: ftp.techno.com P.O. Box 23795 WWW: http://www.techno.com Rochester, NY 14692-3795 USA xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From Jon.Bosak at eng.Sun.COM Wed Aug 13 23:58:18 1997 From: Jon.Bosak at eng.Sun.COM (Jon Bosak) Date: Mon Jun 7 16:58:16 2004 Subject: Proceedings for the 4th International HyTime Conference? In-Reply-To: <01BCA7E0.487807B0.jimg@digitalthink.com> (message from Jim Gindling on Wed, 13 Aug 1997 11:58:57 -0700) Message-ID: <199708132156.OAA26522@boethius.eng.sun.com> [Jim Gindling:] | Does anybody know if proceedings for the 4th International HyTime | Conference (especially XML Developer's Day) can be obtained by us poor | souls who are unable to attend? I don't know about the HyTime Conference, but the Dev Day presentations are specifically intended to be up-to-the-second reports, so there are no proceedings in the ordinary sense. Jon xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From pshams at hotmail.com Thu Aug 14 04:10:56 1997 From: pshams at hotmail.com (Parvez Shams) Date: Mon Jun 7 16:58:16 2004 Subject: XML parsers,browsers comparisn Message-ID: <19970814020943.12586.qmail@hotmail.com> Hello, I am working on a project with XML. We will be using Symposia for our "proof of concept" phase. I am curious to know if anyone did a comparisn between all other available XML browsers, parsers, processors. If there is such resource is available, please let me know. Thank you for your help. Cheers, Parvez Shams ______________________________________________________ Get Your Private, Free Email at http://www.hotmail.com xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From nmikula at edu.uni-klu.ac.at Fri Aug 15 11:55:59 1997 From: nmikula at edu.uni-klu.ac.at (Norbert Mikula) Date: Mon Jun 7 16:58:16 2004 Subject: Yet Another XML Article Message-ID: For those that are not on comp.text.sgml : http://www.ifi.uio.no/~larsga/download/xml/xml_eng.html Best regards, Norbert H. Mikula ===================================================== = SGML, XML, DSSSL, Intra- & Internet, AI, Java ===================================================== = mailto:nmikula@edu.uni-klu.ac.at = http://www.edu.uni-klu.ac.at/~nmikula ===================================================== xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From Martin.Beet at ncl.ac.uk Fri Aug 15 16:58:04 1997 From: Martin.Beet at ncl.ac.uk (Martin Beet) Date: Mon Jun 7 16:58:16 2004 Subject: purpose CDATA sections Message-ID: <33F46B94.2DF5@ncl.ac.uk> Hi I'm in the process of writing (yet) an(other) introduction to XML and I'm currently plodding through the standard. The only purpose of the CDATA section (CDSect) I can think of is for showing code examples. Am I missing something? Regards, Martin --------------- University of Newcastle Dept. of Computing Science | Tel:+44 191 2226157 Claremont Tower, Newcastle upon Tyne, NE1 7RU, UK | Fax:+44 191 2228232 xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From ak117 at freenet.carleton.ca Fri Aug 15 17:13:11 1997 From: ak117 at freenet.carleton.ca (David Megginson) Date: Mon Jun 7 16:58:16 2004 Subject: purpose CDATA sections In-Reply-To: <33F46B94.2DF5@ncl.ac.uk> References: <33F46B94.2DF5@ncl.ac.uk> Message-ID: <199708151507.LAA02872@localhost> Martin Beet writes: > I'm in the process of writing (yet) an(other) introduction to XML and > I'm currently plodding through the standard. > > The only purpose of the CDATA section (CDSect) I can think of is for > showing code examples. Am I missing something? That's the general idea, but it's a little narrow. Here are a few uses of CDATA marked sections, off the top of my head: - source code - excerpts from system log files - user sessions with a shell (like bash or command.com) - sample XML markup - ASCII art - mathematical text and other special notations (such as embedded TeX) Here's a non-source-code example: If the teletype machine displays the following text, please leave the building as quickly as possible: Good luck with the introduction, David -- David Megginson ak117@freenet.carleton.ca Microstar Software Ltd. dmeggins@microstar.com http://home.sprynet.com/sprynet/dmeggins/ xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From ebaatz at barbaresco.East.Sun.COM Fri Aug 15 17:16:56 1997 From: ebaatz at barbaresco.East.Sun.COM (Eric Baatz - Sun Microsystems Labs BOS) Date: Mon Jun 7 16:58:16 2004 Subject: purpose CDATA sections Message-ID: > The only purpose of the CDATA section (CDSect) I can think of is for > showing code examples. Am I missing something? By "code" do you mean XML markup? Text other than XML markup can contain characters that might be mistaken for XML and therefore should be escaped. It may be more convenient to stick the entire text into a CDATA rather than individually escaping each character that an XML processor is sensitive to. For example: ]]> Similarly for more specialized text, such as the native commands of a speech synthesizer (where I don't have any control over the syntax accepted by the synthesizer): ]]> The CDATA method may be easier to generate programatically and it may be viewed as more readable than individually escaping characters. Eric Baatz Sun Microsystems Laboratories 2 Elizabeth Drive, MS UCHL03-207 (508) 442-0257 Chelmsford, MA 01824 fax: (508) 250-5067 USA Internet: eric.baatz@east.sun.com xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From liamquin at interlog.com Sat Aug 16 07:27:37 1997 From: liamquin at interlog.com (Liam Quin) Date: Mon Jun 7 16:58:16 2004 Subject: Whitespace rules (v2) In-Reply-To: <199708110948.KAA20836@andromeda.ndirect.co.uk> Message-ID: On Sun, 10 Aug 1997, Neil Bradley wrote: > [...] > RULE 2. All whitespace preceding the start-tag and following the end-tag > of a 'block enclosing' element is discarded. > --- > Note: a non-validating applications must refer to a style sheet or > configuration file to identify 'block enclosing' elements (perhaps by > applying this rule to elements not specified as in-line elements). No -- "blockness" is not at all the same as element content. For example, you have to allow for a run-in heading, which starts out looking like an HTML H3 (say) except that the rest of the paragraph follow on on the same line. So it isn't a block in the paragraph sense. > As a validating application cannot easily determine this rule from the > content model (the first mixed content element in the hierarchy is > block enclosing, as well as all outer layers), it may choose the same > approach. I think this is too complicated, as well as being not 100% right. I don't think there's a single "right" solution. This is why it's best to allow the parser to pass _all_ whitespace back to the application, although it is certainly useful if a DTD-aware parser, even if it isn't validating, distinguishes element content whitespace from PCDATA whitespace in some way. More than this is a bad idea, I think. > Note: If PI's, comments or empty elements remain in the data stream, > they are deemed transparent to this process, so: > [SP]

Some text... > > becomes: > >

Some text... Note that if you have a very large comment, you might need a lot of lookahead here. > RULE 3. A sequence of one or more line-end codes immediately > following a start-tag, or immediately preceding an end-tag, are > discarded (except in preserved content). This means that This is very strange. becomes This isverystrange. or, if you format withut distinguishing emphasis, This isverystrange. which I don't think is what you want. But SGML itself is broken in this regard. > RULE 4. A remaining line-end code is converted into a space, except when it is > preceded by a normal (hard) hyphen, or by a soft hyphen ('°'), > in which case it is removed (a soft hyphen is also then removed). > --- > Note: > > A[CR] > line-[CR] > end code sep°[CR] > erates lines. > > becomes: > > A line-end code seperates lines. Well, note that there is no hyphen in that paragraph!! The character "-" in ISO 8859-1 (Latin 1) and ASCII is _not_ a hyphen. It is a minus sign. The hyphen is 0255 octal (173 decimal). It is a hyphen, not a soft hyphen. There is no soft hyphen in Latin 1. I don't have the necessary copy of Unicode in front of me, but last time I checked (Unicode 1.1) it was the same in this regard, and also in having the ` character be a spacing grave accent, not a single quote. This should be done by applications. I wouldn't want your mesage: ---------- RULE 5. Consecutive whitespace characters (including translated turrning into ----------RULE 5. Consecutive whitespace characters (including translated for example. > Note: Multiple spaces can be preserved using the non-break space > character (' '). > >

Some   spaces. Er, is this defined in Unicode or in ISO 10646?? Lee -- Liam Quin -- the barefoot typographer -- Toronto lq-text: freely available Unix text retrieval email address: l i a m q u i n at host: i n t e r l o g dot c o m xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From Peter at ursus.demon.co.uk Sat Aug 16 17:18:21 1997 From: Peter at ursus.demon.co.uk (Peter Murray-Rust) Date: Mon Jun 7 16:58:16 2004 Subject: WD-xml-970807 (fwd) Message-ID: <9499@ursus.demon.co.uk> Forwarded message follows: >From Dan Connolly (W3C): > > Please distribute this announcement far and wide. > > ============ > http://www.w3.org/TR/ > > Extensible Markup Language (XML) > 7 August 1997, Tim Bray, Jean Paoli, C.M. Sperberg-McQueen > ============ > > http://www.w3.org/TR/WD-xml-970807 > http://www.w3.org/TR/WD-xml-970807.html > http://www.w3.org/TR/WD-xml-970807.xml > http://www.w3.org/TR/WD-xml-970807.ps > http://www.w3.org/TR/WD-xml-970807.ps.zip > [...] > > -- > Dan Connolly, W3C Architecture Domain Lead > http://www.w3.org/People/Connolly/ > > -- Peter Murray-Rust, domestic net connection Virtual School of Molecular Sciences http://www.vsms.nottingham.ac.uk/ xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From neil at bradley.co.uk Sat Aug 16 19:52:00 1997 From: neil at bradley.co.uk (Neil Bradley) Date: Mon Jun 7 16:58:16 2004 Subject: Whitespace rules (v2) Message-ID: <199708161751.SAA28294@andromeda.ndirect.co.uk> Dear Liam, Thanks for the feedback. > > [...] > > RULE 2. All whitespace preceding the start-tag and following the end-tag > > of a 'block enclosing' element is discarded. > > --- > > Note: a non-validating applications must refer to a style sheet or > > configuration file to identify 'block enclosing' elements (perhaps by > > applying this rule to elements not specified as in-line elements). > > No -- "blockness" is not at all the same as element content. > For example, you have to allow for a run-in heading, which starts out > looking like an HTML H3 (say) except that the rest of the paragraph > follow on on the same line. So it isn't a block in the paragraph sense. > > > As a validating application cannot easily determine this rule from the > > content model (the first mixed content element in the hierarchy is > > block enclosing, as well as all outer layers), it may choose the same > > approach. > > I think this is too complicated, as well as being not 100% right. > I don't think there's a single "right" solution. This is why it's > best to allow the parser to pass _all_ whitespace back to the application, > although it is certainly useful if a DTD-aware parser, even if it isn't > validating, distinguishes element content whitespace from PCDATA whitespace > in some way. Note that these rules are intended for the application, not the parser, or any other part of the XML processor. As I state at the top of the rules, "A formatting application should......according to the following 5 rules". > > Note: If PI's, comments or empty elements remain in the data stream, > > they are deemed transparent to this process, so: > > [SP]

Some text... > > > > becomes: > > > >

Some text... > > Note that if you have a very large comment, you might need a lot of > lookahead here. Actually no, because the application would already KNOW that it is currently in block content. > > RULE 3. A sequence of one or more line-end codes immediately > > following a start-tag, or immediately preceding an end-tag, are > > discarded (except in preserved content). > > This means that > This is > very > strange. > > becomes > This isverystrange. > > or, if you format withut distinguishing emphasis, > This isverystrange. > > which I don't think is what you want. > > But SGML itself is broken in this regard. I know, and as it is impossible to cover all angles. I think your example is one of the least likely things to happen in reality, and if necessary document authors must be educated to avoid it. I am open to other suggestions, of course. I am only trying to get detailed discussions rolling. For example, we could get rid of both rules 2 and 3, and improve rule 5 to say that all surrounding white space is removed. > > RULE 4. A remaining line-end code is converted into a space, except when it is > > preceded by a normal (hard) hyphen, or by a soft hyphen ('°'), > > in which case it is removed (a soft hyphen is also then removed). > > --- > > Note: > > > > A[CR] > > line-[CR] > > end code sep°[CR] > > erates lines. > > > > becomes: > > > > A line-end code seperates lines. > > Well, note that there is no hyphen in that paragraph!! > The character "-" in ISO 8859-1 (Latin 1) and ASCII is _not_ a hyphen. > It is a minus sign. Well, most people in the past have used it as a hyphen in text documents, which I think is the important point here. Also, my source tells me that this character is the official ISO hyphen - but my source may be wrong. > The hyphen is 0255 octal (173 decimal). It is a hyphen, not a soft hyphen. > There is no soft hyphen in Latin 1 OK. I will take your word on this. Again, my source of information may be wrong. > I don't have the necessary copy of Unicode in front of me, but last time > I checked (Unicode 1.1) it was the same in this regard, and also in having > the ` character be a spacing grave accent, not a single quote. > > This should be done by applications. I wouldn't want your mesage: It is being done by the application. What "wouldn't you want your message:"? > ---------- > RULE 5. Consecutive whitespace characters (including translated > turrning into > ----------RULE 5. Consecutive whitespace characters (including translated > for example. > > > Note: Multiple spaces can be preserved using the non-break space > > character (' '). > > > >

Some   spaces. > Er, is this defined in Unicode or in ISO 10646?? Don't know. I have it as a non-breaking space, which I am 'liberally' interpreting here as a required space (if it can't be broken over lines, it must be pretty important). If Unicode has a more explicit required space character, then fine, let's use that. > Lee Neil. ----------------------------------------------- Neil Bradley - Author of The Concise SGML Companion. neil@bradley.co.uk www.bradley.co.uk xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From Peter at ursus.demon.co.uk Sat Aug 16 19:56:46 1997 From: Peter at ursus.demon.co.uk (Peter Murray-Rust) Date: Mon Jun 7 16:58:16 2004 Subject: Whitespace rules (v2) Message-ID: <9502@ursus.demon.co.uk> Firstly many thanks to neil for posting these proposed rules and those who have answered. On balance (I am an optimist!) I think there is something desirable and achievable here. I think a lot of us feel there has to be some guidance on whitespace and I think Neil has covered much of the ground. I think what is achievable is a set of rules at the 80/20 level (80% of XML-DEV'ers think they are 80% useful). There are certainly areas where there will be disagreement - this was a voluminous topic on XML-WG last autumn. XML-DEV has the advantage and disadvantage that it has no formal standing, so those who don't like anything that comes out of it can ignore it :-). So if we can come up with a set of rules and a label for them, application developers can use them (or not) as they wish. An advantage is that because all discussion is publicly archived, we can always point back and say 'that is why we suggested X'. If a set of rules *does* emerge, then how can we generally inform an application that it should take them as DEFAULT? I assume this is through a PI: ... The whitespace[CR][LF]is normalised So I think we need a mechanism from XML-WG to show the application where it should get its DEFAULT processing mechanism from. Specific points: [Rule 1 - normalisation] I think it's essential to have something like Neil's proposal for [CR][LF] In message Liam Quin writes: > On Sun, 10 Aug 1997, Neil Bradley wrote: > > > [...] > > RULE 2. All whitespace preceding the start-tag and following the end-tag > > of a 'block enclosing' element is discarded. > > --- > > Note: a non-validating applications must refer to a style sheet or > > configuration file to identify 'block enclosing' elements (perhaps by > > applying this rule to elements not specified as in-line elements). > > No -- "blockness" is not at all the same as element content. > For example, you have to allow for a run-in heading, which starts out > looking like an HTML H3 (say) except that the rest of the paragraph > follow on on the same line. So it isn't a block in the paragraph sense. > > > As a validating application cannot easily determine this rule from the > > content model (the first mixed content element in the hierarchy is > > block enclosing, as well as all outer layers), it may choose the same > > approach. > > I think this is too complicated, as well as being not 100% right. > I don't think there's a single "right" solution. This is why it's > best to allow the parser to pass _all_ whitespace back to the application, > although it is certainly useful if a DTD-aware parser, even if it isn't > validating, distinguishes element content whitespace from PCDATA whitespace > in some way. I agree with Liam - I didn't understand 'blockness'. I also think that whatever is done here has to be independent of stylesheets and DTDs. The average hacker like me simply won't undertsand the subtleties. > > More than this is a bad idea, I think. > > > > Note: If PI's, comments or empty elements remain in the data stream, > > they are deemed transparent to this process, so: > > [SP]

Some text... > > > > becomes: > > > >

Some text... > > Note that if you have a very large comment, you might need a lot of > lookahead here. I would assume that this processing takes place in the application, not the parser. How/whether comments are passed to the application is part of the parser API. I assume that at this stage the comment is recognised as a single chunk which can be deleted with/out surrounding whitespace as required. > > > RULE 3. A sequence of one or more line-end codes immediately > > following a start-tag, or immediately preceding an end-tag, are > > discarded (except in preserved content). > > This means that > This is > very > strange. > > becomes > This isverystrange. > > or, if you format withut distinguishing emphasis, > This isverystrange. > > which I don't think is what you want. > > But SGML itself is broken in this regard. This one is tough. Please criticise my current view :-). SGML documents seem to use markup as structure in some places (e.g. OL/LI in HTML) or event streams (e.g. EM, B in HTML). Authors/readers expect different processing modes from these types. The example above is best treated as structuring markup (P) containg an event stream (#PCDATA|EM)* [sorry for abbreviations]. So we have to indicate to the processor that P is structuring and that whitespace after

or before

is irrelevant, and that its content is an event stream where all whitespace is normalised to a single space (cf HTML.) Therefore can we have something like this: This isverystrange. (I am sure there are cleaner ways of doing this, especially declaring this for all s). The question is whether a model like this meets the 80/20 rule. > > > RULE 4. A remaining line-end code is converted into a space, except when it is > > preceded by a normal (hard) hyphen, or by a soft hyphen ('°'), > > in which case it is removed (a soft hyphen is also then removed). > > --- I have to argue against this :-(. A hyphen is indistinguishable from a minus to lots of people. There are also many cases where people may wish to end a line with a minus: CL- H+ Since we are normalising whitespace, then lines can always be arranged so that hyphens are unnecessary. Let's see if there is a solution which is simple, covers most of the common problems and which is intuitively obvious to the webhackers who graduate from HTML. We clearly need something more than
 and 
, but it shouldn't be more than, say, twice as complex. I think we are a long way towards that. P. -- Peter Murray-Rust, domestic net connection Virtual School of Molecular Sciences http://www.vsms.nottingham.ac.uk/ xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From tbray at textuality.com Sat Aug 16 19:59:48 1997 From: tbray at textuality.com (Tim Bray) Date: Mon Jun 7 16:58:16 2004 Subject: Whitespace rules (v2) Message-ID: <3.0.32.19970816105650.008fda00@pop.intergate.bc.ca> I gotta say that it's noble of you guys to take aim at this particular problem, but you should bear in mind that it's really really really hard. The original goal as stated in SGML was to ignore white space "caused by markup" by which they meant "used to prettyprint markup". A worthy goal, but in fact most people would agree that the rules you have to write to achieve this are horrendously complicated and some would argue that SGML never actually did get it right. We spent a huge amount of time on this in the XML committee and eventually decided that if simple rules could be written, we weren't smart enough to figure them out. So good luck, don't expect it to be easy, but if you get it right the world will be grateful. -Tim xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From Peter at ursus.demon.co.uk Sun Aug 17 00:08:29 1997 From: Peter at ursus.demon.co.uk (Peter Murray-Rust) Date: Mon Jun 7 16:58:16 2004 Subject: Whitespace rules (v2) Message-ID: <9506@ursus.demon.co.uk> In message <3.0.32.19970816105650.008fda00@pop.intergate.bc.ca> Tim Bray writes: Thanks very much for your support, Tim. We believe that XML-DEV has a role in coming up with workable pragmatic solutions to 'parts' of the XML process. Getting those all right at once (i.e. for the spec) may be impossible; getting a few of them mainly right may be a useful step. > I gotta say that it's noble of you guys to take aim at this particular > problem, but you should bear in mind that it's really really really > hard. The original goal as stated in SGML was to ignore white > space "caused by markup" by which they meant "used to prettyprint > markup". A worthy goal, but in fact most people would agree that > the rules you have to write to achieve this are horrendously complicated > and some would argue that SGML never actually did get it right. I'd agree with this. And XML does not work in precisely the same way as SGML here. It's most useful IMO to proceed on the basis that most XML-DEV'ers will not understand the niceties of SML-whitespace but *will be prepared to work to a (fairly) simple set of rules*. If we go for an 80/20 solution (i.e. 80% of users/applications find it useful 80% of the time, that solves 64% - a reasonable starting point...) > > We spent a huge amount of time on this in the XML committee and Yes. And it's essential we don't go round this loop again. It will always be possible to pick holes in a propsed set of rules - so we have to accept there will be holes from the start. Juts minimise their size and point them out. > eventually decided that if simple rules could be written, we weren't > smart enough to figure them out. I don't think there *is* a solution in terms that a cast-iron spec could contemplate (any more than there is one universal DTD). We have to seek a compromise solution. > > So good luck, don't expect it to be easy, but if you get it right > the world will be grateful. -Tim Obviously there will be applications which come 'out-of-the-box' - the authoring and processing tools are already written and validated, and most people won't need to see the intermediate XML text. Maybe CDF is in this category. I think we are aiming at those documents which might be processed by generic XML processors, or composed of cut-n-paste from a variety of sources (or both). For example, in a combined MathML and CML document, it is reasonable to expect the whitespace processing to be openly declared, easily implementable and (hopefully) easy to understand. I think we can aim for one (or possibly two) protocols that service 'most' applications. With those there would be simple guidelines for authors (of documents and of processing software). Firstly there are some 'gotchas'. I don't think anyone *wants* CR/LF problems to be platform-dependent. So we have to address this independently of other complications. IMO most XML documents will fall into the categories: (a) precise whitespace matters (PRESERVE or <(HTML)PRE>). The main problem with using this is the CR/LF one. (b) text-like, where markup is for formatting (mixed content, event-stream processing). (c) structured, often with pretty-printing (i.e. redundant whitespace) (element content). (d) mixtures of (b) and (c). This would be common in technical documents with a mixture of 'text' and 'non-textual' structured information. I believe we can come up with simple rules for b/c/d which are reasonably intuitive to the webhacker and also cover a wide enough range of applications. P. -- Peter Murray-Rust, domestic net connection Virtual School of Molecular Sciences http://www.vsms.nottingham.ac.uk/ xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From neil at bradley.co.uk Sun Aug 17 09:43:33 1997 From: neil at bradley.co.uk (Neil Bradley) Date: Mon Jun 7 16:58:17 2004 Subject: Whitespace rules (v2) Message-ID: <199708170743.IAA28970@andromeda.ndirect.co.uk> Peter Murray-Rust wrote: > If a set of rules *does* emerge, then how can we generally inform an application > that it should take them as DEFAULT? I assume this is through a PI: I was hoping that relevant applications (mainly browsers and typesetting systems) will ALWAYS assume the rules that are finally determined, except where preserved content (or some other set of rules) is explicitly actioned. > I agree with Liam - I didn't understand 'blockness'. I also think that whatever > is done here has to be independent of stylesheets and DTDs. The average hacker > like me simply won't undertsand the subtleties. I am merely trying to distinguish in-line elements from other elements. An in-line element implies no line-breaks above or below it. A 'Block' element therefore DOES imply such a break. I do not use the terms element and mixed content here, because it is not quite the same thing. As I have said before, a Para element is a 'block' element, and has mixed content, but an Emph element is an 'in-line' element, yet also has mixed content. All style sheets, including CSS, understand the concept of in-line and block elements. Any whitespace surrounding a block element MUST be irrelevant. Liam raised the issue of a half-way element type, such as a header which implies a line-break before it, but not after, so that following text will appear on the same line. This one is tricky. Suggestions anybody? > I would assume that this processing takes place in the application, not the > parser. How/whether comments are passed to the application is part of the > parser API. I assume that at this stage the comment is recognised as a single > chunk which can be deleted with/out surrounding whitespace as required. As I say at the top of the rules, ALL these rules are applied by the application, not the XML processor. > This one is tough. Please criticise my current view :-). SGML documents seem > to use markup as structure in some places (e.g. OL/LI in HTML) or > event streams (e.g. EM, B in HTML). Authors/readers expect different processing > modes from these types. The example above is best treated as structuring > markup (P) containg an event stream (#PCDATA|EM)* [sorry for abbreviations]. > So we have to indicate to the processor that P is structuring and that > whitespace after

or before

is irrelevant, and that its content is an > event stream where all whitespace is normalised to a single space (cf HTML.) > Therefore can we have something like this: > > > > This isverystrange. > > I think that, ultimately, some combinations of markup will always break whatever rules we come up with. We must ensure that only obscure, non-intuitive combinations do this, then just shout from the rooftops that these combinations are not to be used. > > > > > RULE 4. A remaining line-end code is converted into a space, except when it is > > > preceded by a normal (hard) hyphen, or by a soft hyphen ('°'), > > > in which case it is removed (a soft hyphen is also then removed). > > > --- > > I have to argue against this :-(. A hyphen is indistinguishable from a minus > to lots of people. There are also many cases where people may wish to end > a line with a minus: > > > CL- > H+ > > > > Since we are normalising whitespace, then lines can always be arranged so that > hyphens are unnecessary. My concern was to address existing text files, where hyphens are often used in this way. Maybe I am over-estimating this problem. Neil. ----------------------------------------------- Neil Bradley - Author of The Concise SGML Companion. neil@bradley.co.uk www.bradley.co.uk xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From Peter at ursus.demon.co.uk Sun Aug 17 13:50:28 1997 From: Peter at ursus.demon.co.uk (Peter Murray-Rust) Date: Mon Jun 7 16:58:17 2004 Subject: Whitespace rules (v2) Message-ID: <9516@ursus.demon.co.uk> In message <199708170743.IAA28970@andromeda.ndirect.co.uk> "Neil Bradley" writes: > > Peter Murray-Rust wrote: > > > If a set of rules *does* emerge, then how can we generally inform an application > > that it should take them as DEFAULT? I assume this is through a PI: > > I was hoping that relevant applications (mainly browsers and > typesetting systems) will ALWAYS assume the rules that are finally > determined, except where preserved content (or some other set of > rules) is explicitly actioned. I think - along with TimB - that it is unrealistic to come up with s single set of rules that will server every application. There was an enormous amount of discussion on the XML group last year and I take it as axiomatic that we cannot produce a set of rules which everyone agrees are: - simple to state - unambiguous - intuitive and easy to learn - universal (i.e. cover every situation) I think that XML will include applications beyond 'browsers and typesetting systems' although these will be the commonest. MathML and CML will have chunks of material which contains whitespace not used primarily as part of text. Here's a simple example: [HT]C H N Cl[CR][LF] [HT]O P Br[CR][LF] where the whitespace is used (a) for visual effect and potential ease in editing (b) as a delimiter (within ATOMS) [HT]=tab, for example. What I am after here is a convention that I can state which instructs the processor how to treat this whitespace. ***I do not wish to have to devise a specific convention for CML***. I want to be able to indicate that that the W/S after is irrelevant, and that the whitespace in the ATOMS content is normalisable and used only as a delimiter of tokens. I expect that many other applications will use a similar approach, so I want to share the effort with them. Examples of metadata in XML have often been portrayed as prettyprinted and I expect that CML could use the same conventions. [BTW I think that there will be more human editing of XML files than is often assumed - and metadata is a good example. Prettyprinting is a useful tool in those cases.] I think that we can aim for a set of options that could be used by a post-parser processor. Different applications (**or document authors**) could choose between them. Examples might be: - normaliseCRLF (Neil's Rule 1) - discardAllWS - normaliseToSingleSpace An author or application could then state which of these it was using. It might be that in the first instance we can only agree on (say) Rule 1, but this would be a useful start. > > > I agree with Liam - I didn't understand 'blockness'. I also think that whatever > > is done here has to be independent of stylesheets and DTDs. The average hacker > > like me simply won't undertsand the subtleties. > > I am merely trying to distinguish in-line elements from other > elements. An in-line element implies no line-breaks above or below > it. A 'Block' element therefore DOES imply such a break. I do not use > the terms element and mixed content here, because it is not quite the > same thing. As I have said before, a Para element is a 'block' > element, and has mixed content, but an Emph element is an 'in-line' > element, yet also has mixed content. All style sheets, including > CSS, understand the concept of in-line and block elements. Any > whitespace surrounding a block element MUST be irrelevant. It looks like the context, rather than the content is the significant feature. > > Liam raised the issue of a half-way element type, such as a header > which implies a line-break before it, but not after, so that > following text will appear on the same line. This one is tricky. > Suggestions anybody? > > > I would assume that this processing takes place in the application, not the > > parser. How/whether comments are passed to the application is part of the > > parser API. I assume that at this stage the comment is recognised as a single > > chunk which can be deleted with/out surrounding whitespace as required. > > As I say at the top of the rules, ALL these rules are applied by the > application, not the XML processor. Agreed. This discussion is about how the application behaves. The question is whether we can give it some generic instructions. I'd delete the word 'ALL' if it suggest that you either take all the rules or none. > > > This one is tough. Please criticise my current view :-). SGML documents seem > > to use markup as structure in some places (e.g. OL/LI in HTML) or > > event streams (e.g. EM, B in HTML). Authors/readers expect different processing > > modes from these types. The example above is best treated as structuring > > markup (P) containg an event stream (#PCDATA|EM)* [sorry for abbreviations]. > > So we have to indicate to the processor that P is structuring and that > > whitespace after

or before

is irrelevant, and that its content is an > > event stream where all whitespace is normalised to a single space (cf HTML.) > > Therefore can we have something like this: > > > > > > > > This isverystrange. > > > > > > I think that, ultimately, some combinations of markup will always > break whatever rules we come up with. We must ensure that only > obscure, non-intuitive combinations do this, then just shout from > the rooftops that these combinations are not to be used. It is clear that a set of guidelines and examples must accompany these rules. If necessary we may have to educate people to write XML like: (although I think if we have to go to this stage we have lost 95% of potential XML webhackers). [...] > > My concern was to address existing text files, where hyphens are > often used in this way. Maybe I am over-estimating this problem. I don't think we need to adress the conversion of existing non-XML files to XML in this discussion. The question is what the application does to the output of the XML parser. -------- WS is probably among the commonest problem that most newcomers to XML will face, so it's well worth trying to develop guidelines. P. -- Peter Murray-Rust, domestic net connection Virtual School of Molecular Sciences http://www.vsms.nottingham.ac.uk/ xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From tikvas at agentsoft.com Mon Aug 18 12:04:05 1997 From: tikvas at agentsoft.com (Tikva Schmidt) Date: Mon Jun 7 16:58:17 2004 Subject: Where can I find CDF dtd file? Message-ID: <33F81E15.899@agentsoft.com> I'd apprecciate it if someone would tell me where to find the CDF dtd file. Tikva Schmidt. -------------------------------------------------------------------- Tikva Schmidt. email: tikvas@agentsoft.co.il corp: Agentsoft Ltd. http://www.agentsoft.co.il Phone: 972-2-6480573 --------------------------------------------------------------------- xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From ak117 at freenet.carleton.ca Mon Aug 18 12:39:32 1997 From: ak117 at freenet.carleton.ca (David Megginson) Date: Mon Jun 7 16:58:17 2004 Subject: Where can I find CDF dtd file? In-Reply-To: <33F81E15.899@agentsoft.com> References: <33F81E15.899@agentsoft.com> Message-ID: <199708181038.GAA00192@localhost> Tikva Schmidt writes: > I'd apprecciate it if someone would tell me where to find the > CDF dtd file. You could try putting one together from the excerpts in Microstar's CDF white paper, but unfortunately, they contain many syntax errors. I wonder if there _is_ actually a DTD yet. I've done some pretty elaborate AltaVista searches (for the likely content of the DTD) and have turned up nothing so far. All the best, David -- David Megginson ak117@freenet.carleton.ca Microstar Software Ltd. dmeggins@microstar.com http://home.sprynet.com/sprynet/dmeggins/ xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From agreene at bitstream.com Mon Aug 18 15:52:29 1997 From: agreene at bitstream.com (Andrew Greene) Date: Mon Jun 7 16:58:17 2004 Subject: Conditional marked sections Message-ID: <19970818134844.AAA2763@AGREENE-PC.bitstream.com> Please forgive what I hope will turn out to be a foolish question, but upon rereading the XML spec, I was left unclear on the question of whether marked sections could be used in the document instance for anything except CDATA. That is, in full SGML, you can say: ]> This is a section. and when you run it through nsgmls, you get: (EXAMPLE -This is a marked section. )EXAMPLE C But the XML spec implies that conditional inclusion of marked sections is only approved for the DTD, and not for the document instance itself; and that the only legal use of marked sections in the document instance is for CDATA. It is also implied that parameter entities are also only valid within the DTD itself. So, which is it? I'll admit that I'll be disappointed if conditional marked sections are restricted to the DTD. Thanks, Andrew Greene xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From tbray at textuality.com Mon Aug 18 17:04:41 1997 From: tbray at textuality.com (Tim Bray) Date: Mon Jun 7 16:58:17 2004 Subject: Conditional marked sections Message-ID: <3.0.32.19970818080111.00908a70@pop.intergate.bc.ca> At 09:48 AM 18/08/97 -0400, Andrew Greene wrote: >Please forgive what I hope will turn out to be a foolish question, but >upon rereading the XML spec, I was left unclear on the question of >whether marked sections could be used in the document instance for >anything except CDATA. That's right; nothing except CDATA. >So, which is it? I'll admit that I'll be disappointed if conditional >marked sections are restricted to the DTD. Sorry to disappoint. -Tim xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From russc at watfac.org Tue Aug 19 00:16:25 1997 From: russc at watfac.org (Russell Chamberlain) Date: Mon Jun 7 16:58:17 2004 Subject: Whitespace rules (v2) Message-ID: <3.0.1.32.19970818181729.0069be80@watfac.org> In message <199708170743.IAA28970@andromeda.ndirect.co.uk> "Neil Bradley" writes: > > Peter Murray-Rust wrote: > >I think - along with TimB - that it is unrealistic to come up with s single >set of rules that will server every application. There was an enormous amount >of discussion on the XML group last year and I take it as axiomatic that we >cannot produce a set of rules which everyone agrees are: > - simple to state > - unambiguous > - intuitive and easy to learn > - universal (i.e. cover every situation) Axiomatic? Call me stubborn (you won't be the first), but I, for one, retain some hope. :-) > >I think that XML will include applications beyond 'browsers and typesetting >systems' although these will be the commonest. MathML and CML will have >chunks of material which contains whitespace not used primarily as part of >text. Here's a simple example: > > >[HT]C H N Cl[CR][LF] >[HT]O P Br[CR][LF] > > >where the whitespace is used (a) for visual effect and potential ease in >editing (b) as a delimiter (within ATOMS) [HT]=tab, for example. > >What I am after here is a convention that I can state which instructs the >processor how to treat this whitespace. ***I do not wish to have to devise >a specific convention for CML***. I want to be able to indicate that that >the W/S after is irrelevant, and that the whitespace in the ATOMS content >is normalisable and used only as a delimiter of tokens. > >I expect that many other applications will use a similar approach, so I want >to share the effort with them. Examples of metadata in XML have often been >portrayed as prettyprinted and I expect that CML could use the same conventions. >[BTW I think that there will be more human editing of XML files than is often >assumed - and metadata is a good example. Prettyprinting is a useful tool >in those cases.] > >I think that we can aim for a set of options that could be used by a post-parser >processor. Different applications (**or document authors**) could choose between >them. Examples might be: > - normaliseCRLF (Neil's Rule 1) > - discardAllWS > - normaliseToSingleSpace > >An author or application could then state which of these it was using. > >It might be that in the first instance we can only agree on (say) Rule 1, but >this would be a useful start. > >> >> > I agree with Liam - I didn't understand 'blockness'. I also think that whatever >> > is done here has to be independent of stylesheets and DTDs. The average hacker >> > like me simply won't undertsand the subtleties. >> >> I am merely trying to distinguish in-line elements from other >> elements. An in-line element implies no line-breaks above or below >> it. A 'Block' element therefore DOES imply such a break. I do not use >> the terms element and mixed content here, because it is not quite the >> same thing. As I have said before, a Para element is a 'block' >> element, and has mixed content, but an Emph element is an 'in-line' >> element, yet also has mixed content. All style sheets, including >> CSS, understand the concept of in-line and block elements. Any >> whitespace surrounding a block element MUST be irrelevant. > >It looks like the context, rather than the content is the significant >feature. > >> >> Liam raised the issue of a half-way element type, such as a header >> which implies a line-break before it, but not after, so that >> following text will appear on the same line. This one is tricky. >> Suggestions anybody? > The idea of a "half-way" element type just highlights the fact that element nesting does not necessarily map nicely to block/paragraph structure in formatting applications. I like to say that block formatting _trancends_ element nesting -- there is no direct mapping. In my experience, a pair of lower-level concepts (eg. "block start" and "block end") has proven quite useful. In the current discussion, the "blockness" of the elements might be described as follows: "block start" "block end" ----------------------------------------- Para Yes Yes Emph No No Hn Yes No where: "block start" - means start a block at the start of the element "block end" - means end a block at the end of the element A notation for describing whitespace handling must communicate the notion that whitespace processing is modal, and provide words for each mode and phrases for the transitions. Let's consider Peter's tentative rules: > - normaliseCRLF (Neil's Rule 1) Please correct me if I am wrong, but this looks like a document-wide setting whose behaviour/interpretation isn't affected by the application type. A simple on/off PI setting could be used to set this. The rest of the rules, though, could be applied on a per-element basis: > - discardAllWS > - normaliseToSingleSpace I would add: - keepAllWS (I haven't read every word of every post in this thread. Has this third one been discarded as a reasonable option? Even if it has, the rest of my discussion here isn't affected) Assuming that the three, mutually-exclusive rules (or _modes_) can be applied to any element, how can we specify this? Would being able to specify one of the three modes on a per-element basis be powerful enough? If we used PIs to do this then some HTML tags, for example, might be listed as follows (just a hypothetical notation example, _not_ a final suggestion for notation): Notes: - HTML applications could just imply these rules. - Any elements that aren't listed would just use the current mode, which depends on the context. - If the desired whitespace mode depends on something other than the current element (an attribute, say) then this mechanism won't be powerful enough. - Specifying the whitespace mode on a per-element basis should make this technique well-suited to architectural forms, though. - Russ PS - Should whitespace be blacklisted? ;-) xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From tfj at apusapus.demon.co.uk Tue Aug 19 00:39:17 1997 From: tfj at apusapus.demon.co.uk (Trevor Jenkins) Date: Mon Jun 7 16:58:17 2004 Subject: Other whitespace problems was Re: Whitespace rules (v2) In-Reply-To: <3.0.32.19970816105650.008fda00@pop.intergate.bc.ca> Message-ID: <199708182152.tfj.2174@apusapus.demon.co.uk> > The original goal as stated in SGML was to ignore white > space "caused by markup" by which they meant "used to prettyprint > markup". A worthy goal, but in fact most people would agree that > the rules you have to write to achieve this are horrendously complicated > and some would argue that SGML never actually did get it right. Whilst all the discussion upon "whitespace caused by markup" has been going-on I've had reason to look at whitespace within the various declarations. I have always been very wary of the separator rules for SGML declarations (as a computing scientist I find it odd that such separators have been hard-coded in the grammar rules themselves). I'm convinced that as they stand the separator rules in XML are ambiguous. I have been looking at the element declaration in particular and its abundance of Ss leads to ambiguity. As I read the grammar the following is ambiguous: At 09:52 PM 18/08/97 +0000, Trevor Jenkins wrote: > I'm >convinced that as they stand the separator rules in XML are >ambiguous. Yes; Michael Sperberg-McQueen and I both agree that these need some more work. If it weren't for the $#*!@#%#!ing Parameter Entities, all this would be simple and straightforward - designing a grammar for the SGML element declaration language is not exactly rocket science. But when you try to pollute the grammar by saying where you can and can't replace chunks of it with PE references, it all of a sudden gets hideously difficult. SGML gets around this with the clever device of the Ee (entity end) virtual token... which we in the XML gang thought was hopelessly unaesthetic; after some struggles with this particular problem, Ee is starting to look better. Mind you, of the 3 XML-lang co-editors, two (I and Jean Paoli) have voted against the existence of PEs at every opportunity; these votes are in some part self-serving. However, there can be no doubt that if you want to build and maintain 8879-style markup declarations, it's basically just not possible to do this without PE's. Sigh. Mind you, some of us have another solution for that... Another compromise would be to apply the internal-subset rule, i.e. you can have PE's but they have to replace whole declarations. There are other interim measures, i.e. you can only replace a whole content model; all involve severe limitations on PE usefulness as the payment for spec/grammar clarity. Anyhow, further grammar engineering is in order. One thing to think about is simply to drop the 'S' (space) nonterminal, write a couple of simple tokenization rules, and take it that way. CMSMcQ has investigated this at length, but it has problems too. Pardon me for whining; I'm sure we'll figure out something. -Tim xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From digitome at iol.ie Tue Aug 19 12:12:07 1997 From: digitome at iol.ie (Sean Mc Grath) Date: Mon Jun 7 16:58:17 2004 Subject: Whitespace Message-ID: <199708191011.LAA29289@GPO.iol.ie> >> Peter Murray-Rust wrote: >> >>I think - along with TimB - that it is unrealistic to come up with s single >>set of rules that will server every application. There was an enormous >amount >>of discussion on the XML group last year and I take it as axiomatic that we >>cannot produce a set of rules which everyone agrees are: >> - simple to state >> - unambiguous >> - intuitive and easy to learn >> - universal (i.e. cover every situation) > **Warning:** Rush of blood to the head follows. Get those flame throwers ready... I know this whole white space thing was trashed out at length some time ago but it worries me greatly that on XML-DEV the whole issue seems to be as problematic as it was before XML-Lang's rulings on whitespace handling where decided upon. It seems that the problem was not really solved - just pushed up a layer:-) It just sounds wrong to me that white space handling is to be the subject of application conventions rather than part of the core XML parsing activity. Anyway, I think everyone should be allowed over-simplify the "White Space Problem" once in there lives! Here is my contribution:- Ban mixed content. Mixed content is a markup minimization feature. If you want a chunk of PCDATA in an XML doc, use the reserved element name. I am data 1 I am data 2 Becomes I am line 1I am line 2 If you need whitespace to be something other than whitespace- i.e. a newline to be a real newline to be passed on to the application, use an empty element type to represent it. I am data 1 I am data 2 Give me five minutes to put on the asbestos suit and then you flame away.... xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From bdonoghoe at spin.net.au Tue Aug 19 15:46:42 1997 From: bdonoghoe at spin.net.au (Bill Donoghoe) Date: Mon Jun 7 16:58:17 2004 Subject: Whitespace Message-ID: <199708191344.XAA01627@spin.net.au> >Sean Mc Grath wrote: >>> Peter Murray-Rust's post removed to conserve space > >**Warning:** Rush of blood to the head follows. Get those flame throwers >ready... > >I know this whole white space thing was trashed out at length some time ago but >it worries me greatly that on XML-DEV the whole issue seems to be as problematic >as it was before XML-Lang's rulings on whitespace handling where decided upon. >It seems that the problem was not really solved - just pushed up a layer:-) > >It just sounds wrong to me that white space handling is to be the subject of >application conventions rather than part of the core XML parsing activity. > >Anyway, I think everyone should be allowed over-simplify the "White Space >Problem" >once in there lives! Here is my contribution:- > > >Ban mixed content. Mixed content is a markup minimization feature. > >If you want a chunk of PCDATA in an XML doc, use the >reserved element name. > > > I am data 1 > I am data 2 > > >Becomes >I am line 1I am line 2 > >If you need whitespace to be something other than whitespace- i.e. a >newline to be a real newline to be passed on to the application, use an >empty element type to represent it. > > > I am data 1 > I am data 2 > > > >Give me five minutes to put on the asbestos suit and then you flame >away.... > Instead of flaming you I will hope onto the bandwagon (can I borrow the asbestos suit for awhile). Firstly to paraphrase some earlier comments, the "whitespace problem" has resulted from its dual personality. Personality 1. The programmer's whitespace ("pretty printing") is used as a layout tool for visual editing of the markup and content. Besides, lots of editing applications won't allow lines over 250 characters. Personality 2. The whitespace is part of the content used because the author either wanted it that way or he/she could not see any other easy way to encode the information correctly. SGML tried to cater for both personalities and it succeeded in a moderate fashion. The downside was that it is not an easy task to maintain and process SGML documents. Now for some personal opinion on what I thought XML was all about. XML is an attempt to either simplify SGML (get rid of or change the bits which make it hard to understand/use/process) or extend HTML to deal with information content as well as presentation. I lean towards the former view "SGML for the Web". IMHO the current XML "whitespace handling" has not simplified the SGML situation significantly. Here are some comments and slight variations on Sean's suggestion. I belive that Sean's suggestion has plently of merit. What is wrong with having some standard elements (,,)which are part of every XML DTD? If you didn't want users to have to author these tags then "normalisation" applications could be developed which could convert "raw" XML into the "normalised" version. Example: I am data 1 I am data 2 could be normalised to: I am data 1 I am data 2 or I am data 1 I am data 2 depending on the DTD declarations for the elements or a style sheet (?!!) However, normalisation is not needed if the authors can be given tools which can produced the desired markup. Thus, all whitespace in the "normalised" documents could be collapsed to a single space (because we removed personality 2 we are only left with pretty printing). I will stop rambling now. IMHO the solution lies in removing the dual personalities of whitespace at document authoring time (or at its interface to XML tools for documents tagged by human hand). Regards, Bill Regards, Bill Donoghoe bdonoghoe@acslink.net.au InfoTech (NSW) Pty Ltd mobile: 014 625 397 (in Australia) SGML/HyTime/DSSSL/XML Consultancy and Development xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From tms at ansa.co.uk Tue Aug 19 20:33:32 1997 From: tms at ansa.co.uk (Toby Speight) Date: Mon Jun 7 16:58:18 2004 Subject: Whitespace In-Reply-To: bdonoghoe@spin.net.au's message of Tue, 19 Aug 1997 23:44:08 +1000 (EST) References: <199708191344.XAA01627@spin.net.au> Message-ID: A non-text attachment was scrubbed... Name: not available Type: text/plain (pgp signed) Size: 2803 bytes Desc: not available Url : http://mailman.ic.ac.uk/pipermail/xml-dev/attachments/19970819/ceb413c8/attachment.bin From dgd at cs.bu.edu Tue Aug 19 23:57:37 1997 From: dgd at cs.bu.edu (David G. Durand) Date: Mon Jun 7 16:58:18 2004 Subject: Whitespace rules (v2) In-Reply-To: <3.0.1.32.19970818181729.0069be80@watfac.org> Message-ID: I observed with dismay that the issue of whitespace has surfaced on this list, after we finally gave it the wooden-stake-in-the-heart treatment on the WG discussion lists. As a chief proponent of the current method, I'll take a shot at explaining the rationale, as that is something that doesn't really fit in a standard, but actually helps a great deal in understanding one. I'm taking some recent notes on this list as a starting point. At 5:17 PM -0500 8/18/97, Russell Chamberlain wrote: >> Peter Murray-Rust wrote: >> >>I think - along with TimB - that it is unrealistic to come up with s single >>set of rules that will server every application. There was an enormous >amount >>of discussion on the XML group last year and I take it as axiomatic that we >>cannot produce a set of rules which everyone agrees are: >> - simple to state >> - unambiguous >> - intuitive and easy to learn >> - universal (i.e. cover every situation) > >Axiomatic? Call me stubborn (you won't be the first), but I, for one, >retain some hope. :-) We all did at first. The problem is really the last point -- _universal_ and while I am tempted to agree with Peter, I do not, in fact, because I think the current method actually does satisfy all four points -- but not necessarily in the way that you would expect. >>[Peter states in detail different policies on whitespace he might need in >>different contexts.] >> >>What I am after here is a convention that I can state which instructs the >>processor how to treat this whitespace. ***I do not wish to have to devise >>a specific convention for CML***. I want to be able to indicate that that >>the W/S after is irrelevant, and that the whitespace in the ATOMS >content >>is normalisable and used only as a delimiter of tokens. The problem with this is that there are a large number of ways that whitespace can be used: the "tokens" form mentioned at the end, for example, has never been proposed for XML. >>I expect that many other applications will use a similar approach, so I want >>to share the effort with them. Examples of metadata in XML have often been >>portrayed as prettyprinted and I expect that CML could use the same >conventions. This charing makes sense, only when the sharing of effort is not imposing an unreasonable burden on others. The problem with whitespace is that the different possible policies are all unneeded by many applications. The typical browser/formatter may never need "token" style whitespace, and may implement such things by passing data to applets or other external processes that will handle them. In fact, the need to write xml->xml transducers (SGML has tought us that this need never goes away), argues that it must be _possible_ to see all whitespace at least _some_ of the time, regardless of document. That's one reason that the current "pass all whitespace" model works. The other reason that it works, is that you an always ignore data that you're not interested in (whitespace) but you can never get access to data that is hidden from you -- therefore the convenience of "automatic whitespace removal" is an inability to see that space without using non-standard tools. >>I think that we can aim for a set of options that could be used by a >post-parser >>processor. Different applications (**or document authors**) could choose >between >>them. Examples might be: >> - normaliseCRLF (Neil's Rule 1) >> - discardAllWS >> - normaliseToSingleSpace I agree that this is the right place for such processing to happen (between a parser and an application). I'm not yet sure whether these things are as reusable as people think. I do know that without the use of #FIXED attributes (so I could avoid markup in the instance) I would _not_ use these, but rather make sure that my application (or stylesheet language) had the ability to apply these policies on request, as needed. > > >A notation for describing whitespace handling must communicate the notion >that whitespace processing is modal, and provide words for each mode and >phrases for the transitions. > >Let's consider Peter's tentative rules: > >> - normaliseCRLF (Neil's Rule 1) > >Please correct me if I am wrong, but this looks like a document-wide >setting whose behaviour/interpretation isn't affected by the application >type. A simple on/off PI setting could be used to set this. One might want to do this only in specific elements. Say I'm piping some sub-elements to a stupid processor, and that requires a fixed linend convention, but none of my other processing cares. > >The rest of the rules, though, could be applied on a per-element basis: > >> - discardAllWS >> - normaliseToSingleSpace > >I would add: > > - keepAllWS > >(I haven't read every word of every post in this thread. Has this third one >been discarded as a reasonable option? Even if it has, the rest of my >discussion here isn't affected) This is the option that XML universally adopts. That means that any other method can be implemented _by any processor that cares_. If one can imagine destroying meaning of a document's content by the flattening of all whitespace strings to a single space, then you may need more elements in your content model, if you are not able to control the software that will process the document. In other words the parser guarantees all WS will be visible to applications -- this makes designing and implementing WS dependent processing easy -- but since applications are _not_ constrained as folding or other WS processing behaviour, document authors will have to be cautious in using significant whitespace. If you can't assume that applications to process your markup will do the right thing, then you should not play games with WS. This actually is not much of an issue for CML, since it's a reasonable assumption that any implementation of CML markup-display will have to do lots of special things, of which whitespace is the least. [[[Geek note: I think that authors might be a little safer if significant WS is in a CDATA marked section. Since CDATA is essentially a quoting mechanism, Applications should be more careful about such content.]]] >Would being able to specify one of the three modes on a per-element basis >be powerful enough? If we used PIs to do this then some HTML tags, for >example, might be listed as follows (just a hypothetical notation example, >_not_ a final suggestion for notation): > > > > > >Notes: > >- HTML applications could just imply these rules. > >- Any elements that aren't listed would just use the current mode, which >depends on the context. > >- If the desired whitespace mode depends on something other than the >current element (an attribute, say) then this mechanism won't be powerful >enough. > >- Specifying the whitespace mode on a per-element basis should make this >technique well-suited to architectural forms, though. One way to see that this is inadequate is to think about typesetting, where you may need to consider the whitespace and adjacent typefaces independent of their placement with respect to markup, in order to correctly handle italic corrections and the like. This is something that authors frequently fail to get right, and that is probably best solved, 90% of the time, by smart software. (Let's not even consider the problem of punctuation in the same environments!) I think XML's agnostic position is the correct one for tha language. Authors should probably assume (unless they anticipate absolutely no re-use) that HTML-style draconian normalization might occur anywhere and use markup rather than whitespace, or at least CDATA sections. This position _may_ be moderated (a little) where a well-known DTD with well-defined WS rules can be used (like the TEI or HTML). -- David _________________________________________ David Durand dgd@cs.bu.edu \ david@dynamicDiagrams.com Boston University Computer Science \ Sr. Analyst http://www.cs.bu.edu/students/grads/dgd/ \ Dynamic Diagrams --------------------------------------------\ http://dynamicDiagrams.com/ MAPA: mapping for the WWW \__________________________ xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From digitome at iol.ie Wed Aug 20 00:39:30 1997 From: digitome at iol.ie (Sean Mc Grath) Date: Mon Jun 7 16:58:18 2004 Subject: Whitespace Message-ID: <199708192239.XAA01470@GPO.iol.ie> Paul Prescod made the point that Charles Goldfarb made the "ban mixed content" suggestion some time ago. In private correspondence, a number of other XML'ers have said likewise. Paul goes on to say that it was rejected as unwieldy at the time. I was not involved in XML at the time but the more I think about it the more "wieldy" Charles' idea seems. I think it speaks volumes for the merit of Charles' idea that the best and brightest brains in the SGML world have fought with this issue since the early days of XML without achieving (IMHO) the hoped for breakthrough. If it is more complex than "I before E except after C" or "the right hand thumb rule", it is too complex IMHO. The PCDATA element trick is sooooo easy to understand! Mixed content SGML can be converted to this "mixed-content-free" format quite easily. XML started out aiming for simplicity. It has achieved this wonderfully well in a whole variety of areas but "the white space" is not one of them. If it is too late to revisit this I will have to console myself with the thought that the universe bifurcated when the white space decision was made. In some parallel universe, Charles' suggestion is simplifying XML for many people. Anyway, perhaps it is too late to revisit the mixed content problem. I hope not but will shut up when someone who knows what the position is tell me to. Sean Sean Mc Grath sean@digitome.com Digitome Electronic Publishing http://www.digitome.com xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From tbray at textuality.com Wed Aug 20 01:03:23 1997 From: tbray at textuality.com (Tim Bray) Date: Mon Jun 7 16:58:18 2004 Subject: Whitespace Message-ID: <3.0.32.19970819160004.00917aa0@pop.intergate.bc.ca> At 11:12 PM 19/08/97 +0100, Sean Mc Grath wrote: >The PCDATA element trick is sooooo easy to understand! Mixed content >SGML can be converted to this "mixed-content-free" format quite easily. Hmm, let's say the GI is the null string.

<>Some text that is italicized<>.

Whitespace discussions cause brain damage. -T. xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From ricko at allette.com.au Sun Aug 24 16:44:56 1997 From: ricko at allette.com.au (Rick Jelliffe) Date: Mon Jun 7 16:58:18 2004 Subject: Whitespace Message-ID: <199708241452.AAA07286@jawa.chilli.net.au> From: Sean Mc Grath > If you need whitespace to be something other than whitespace- i.e. a > newline to be a real newline to be passed on to the application, use an > empty element type to represent it. > > I am data 1 > I am data 2 > Yes and no. is not needed in XML. ISO10646 includes characters which unambigously represent line-breaks and paragraph breaks: U+2028 and U+2029. I am data 1
I am data 2 Any conventions for handling whitespace in XML do not need to address "hard returns". If someone wants a hard return, they can mark it up explicitly just using what XML already provides (by adopting ISO 10646). Similarly, XML-DEV does not need to make up any conventions to handle no-break spaces (  or  ) or "hard spaces" (ideographic space does not collapse:  ). Lets not make this more complicated than it is! Rick Jelliffe P.S. In the example quoted, I think probably is a closer description of the element rather than . xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From ricko at allette.com.au Sun Aug 24 17:05:00 1997 From: ricko at allette.com.au (Rick Jelliffe) Date: Mon Jun 7 16:58:18 2004 Subject: Whitespace rules (v2) Message-ID: <199708241512.BAA07702@jawa.chilli.net.au> > From: Liam Quin > The hyphen is 0255 octal (173 decimal). It is a hyphen, not a soft hyphen. > There is no soft hyphen in Latin 1. > I don't have the necessary copy of Unicode in front of me, but last time > I checked (Unicode 1.1) it was the same in this regard, and also in having > the ` character be a spacing grave accent, not a single quote. xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From ricko at allette.com.au Sun Aug 24 17:10:37 1997 From: ricko at allette.com.au (Rick Jelliffe) Date: Mon Jun 7 16:58:18 2004 Subject: Whitespace rules (v2) Message-ID: <199708241518.BAA07768@jawa.chilli.net.au> > From: Liam Quin > The hyphen is 0255 octal (173 decimal). It is a hyphen, not a soft hyphen. > There is no soft hyphen in Latin 1. > I don't have the necessary copy of Unicode in front of me, In both Unicode 1.0 and Unicode 2.0 ­ is called "soft hyphen" or "discretionary hyphen", so it is available, but perhaps not reliably supported by 8859-1 applications. Also available is the zero-width space ​ which can be used to provide non-hyphenating line-break points inside long technical terms (this might be useful in chemical names, where a dash of any kind might be misleading) and in languages in which words are not delimited by spaces. For example, supercali​fragalistic&x200B;expialladocious. Rick Jelliffe xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From liamquin at interlog.com Sun Aug 24 23:26:50 1997 From: liamquin at interlog.com (Liam Quin) Date: Mon Jun 7 16:58:18 2004 Subject: Whitespace rules (v2) In-Reply-To: <199708241518.BAA07768@jawa.chilli.net.au> Message-ID: On Mon, 25 Aug 1997, Rick Jelliffe wrote: > > From: Liam Quin > > The hyphen is 0255 octal (173 decimal). It is a hyphen, not a soft hyphen. > > There is no soft hyphen in Latin 1. > > I don't have the necessary copy of Unicode in front of me, > > In both Unicode 1.0 and Unicode 2.0 ­ is called "soft hyphen" > or "discretionary hyphen", so it is available, but perhaps not reliably > supported by 8859-1 applications. Not supported at all would be a fairer way to put it! At any rate not by _conforming_ 8859-1 applications, as far as I understand it... in the same way that most SGML applications don't treat &x; as a syntax error even when it's illegal in ISO C or FORTRAN :-) I don't have a copy of 8859 any more to check, but if the hyphen chracter is to be treated as a soft hyphen, there's no way to type a hard hyphen... > Also available is the zero-width space ​ > For example, supercali​fragalistic&x200B;expialladocious. Perhaps, but to claim that this is more readable to humans than supercali&softhy;fragalistic&softhy;expialladocious. would be absurd. If you hadn't omitted the # in the 2nd reference, the length would have been the same too. Using &hy; is even better. You can always do Lee -- Liam Quin -- the barefoot typographer -- Toronto lq-text: freely available Unix text retrieval email address: liamquin, at host: interlog dot com xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From ricko at allette.com.au Mon Aug 25 00:48:32 1997 From: ricko at allette.com.au (Rick Jelliffe) Date: Mon Jun 7 16:58:18 2004 Subject: Whitespace rules (v2) Message-ID: <199708242256.IAA14336@jawa.chilli.net.au> > From: Liam Quin > I don't have a copy of 8859 any more to check, but if the hyphen chracter > is to be treated as a soft hyphen, there's no way to type a hard hyphen... Yes. But why is this a surprise? A "hard hyphen" is a dash (copying whatever kind of dash has heen used by the application) followed by a hard return. > Perhaps, but to claim that this is more readable to humans than > supercali&softhy;fragalistic&softhy;expialladocious. > would be absurd. If you hadn't omitted the # in the 2nd reference, the > length would have been the same too. Using &hy; is even better. It might be more useful to include a hyphenation dictionary at the top of the document that can be fed into the typesetting application's hyphenation dictionary, rather than complicate the text with inplace softhyphens. You can then use any character you like to signal the soft hyphen, also, which may shorten things. over^blown, under^done > You can always do > > Yes. I think people use "­" for soft hyphen more than "&hy;". Rick Jelliffe xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From murata at apsdc.ksp.fujixerox.co.jp Mon Aug 25 04:11:14 1997 From: murata at apsdc.ksp.fujixerox.co.jp (MURATA Makoto) Date: Mon Jun 7 16:58:18 2004 Subject: Whitespace In-Reply-To: <3.0.32.19970819160004.00917aa0@pop.intergate.bc.ca> Message-ID: <9708250211.AA01302@lute.apsdc.ksp.fujixerox.co.jp> Tim Bray writes: > >Hmm, let's say the GI is the null string. > >

<>Some text that is italicized<>.

Suppose that we have different kinds of tags for mixed-content elements (e.g, and ) and element-content elements (e.g, and ). Then, even non-validating parsers can tell element contents and mixed contents. Does this help? >Whitespace discussions cause brain damage. A fatal error. I can not, and should not recover... MURATA Makoto (FAMILY Given) Fuji Xerox Information Systems Tel: 044-812-7230 Fax: 044-812-7231 E-mail: murata@apsdc.ksp.fujixerox.co.jp xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From liamquin at interlog.com Mon Aug 25 06:30:29 1997 From: liamquin at interlog.com (Liam Quin) Date: Mon Jun 7 16:58:18 2004 Subject: Whitespace rules (v2) In-Reply-To: <199708242256.IAA14336@jawa.chilli.net.au> Message-ID: On Mon, 25 Aug 1997, Rick Jelliffe wrote: > > From: Liam Quin >> I don't have a copy of 8859 any more to check, but if the hyphen chracter >> is to be treated as a soft hyphen, there's no way to type a hard hyphen... > > Yes. But why is this a surprise? A "hard hyphen" is a dash (copying whatever > kind of dash has heen used by the application) followed by a hard return. So I can't type "Forbes-Hamilton" with a hyphen? (I have used a minus sign here because I'm using 7-bit ASCII software right now!) At any rate, unless hyphenation behaviour becomes part of XML-LANG, I don't see that this discussion is relevant, although by all means mail me privately if you want to prolong it :-) Lee -- Liam Quin -- the barefoot typographer -- Toronto lq-text: freely available Unix text retrieval email address: liamquin, at host: interlog dot com xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From mrc at allette.com.au Mon Aug 25 11:39:37 1997 From: mrc at allette.com.au (Marcus Carr) Date: Mon Jun 7 16:58:18 2004 Subject: Whitespace References: <9708250211.AA01302@lute.apsdc.ksp.fujixerox.co.jp> Message-ID: <340152AF.F224A51C@allette.com.au> Apologies in advance to all those who have thought and fought over this issue for a long time, but as a self-confessed critic of the claim that "XML is SGML", I feel compelled to throw my hat into the ring. As far as I can see, there are only two circumstances when whitespace is an issue - receiving an XML document or authoring one. Receiving, it doesn't matter if you have a DTD or not - the application can determine from a well formed document whether it should regard an element's content as MIXED or ELEMENT. It does involve parsing it, but only until it sees mixed content. If elements are assumed to be ELEMENT until proven otherwise, surely this wouldn't be a massive overhead. Authoring applications would be similar - the first time a tag contained mixed content, the application would reset the status of the element. The onus would from then on be on the application to assist the user in creating semantically correct documents, by such mechanisms as not allowing hard returns at element boundaries, in short, making significant whitespace look like significant whitespace. MURATA Makoto wrote: > Suppose that we have different kinds of tags for mixed-content > elements (e.g, and ) and element-content > elements (e.g, and ). Then, even > non-validating parsers can tell element contents and mixed contents. > Does this help? It seems that the choices are either the current proposal that nobody seems to feel is entirely satisfactory, or suggestions such as the above, which would certainly work, but ultimately may involve as great an overhead as sending the DTD. It seems to me that we're throwing the baby out with the bathwater by ignoring a solution such as declaring at the start of the document how whitespace in elements should be handled. I would also like to see DTDs sent to non-validating parsers, just so they could determine how to apply whitespace rules without necessarily having to do any structural parsing. If need be, two new types of declared content could be added, ELEMENT and MIXED. They might behave the same way as ANY, or the DTD could be constructed even more loosely, where only MIXED elements were declared and everything else was defaulted to ELEMENT. This would result in a small DTD sent only for the sake of making the application aware of how to deal with whitespace. If desirable, no DTD need be sent, but the application's performance may suffer marginally for it. This is in keeping with the idea that an application need not know how to deal with a document as it comes in. As far as I can see, much of the functionality in XML (such as linking) relies on a DTD, so it's not going to be foreign to most XML applications anyway. The whitespace rules in SGML can be simplified - most people accept that they should. Because inclusions and exclusions aren't valid in XML anyway, the rules are already somewhat simpler. I would really like to see XML and SGML stay in synch - I think anything else would be to everyones disadvantage. There really isn't a lot of point in flaming me for this; the question is well intentioned and the current solution seems to have satisfied few. The concept of declaring things at the start is a tried and true methodology, yet we seem to be fleeing it in favor of something nobody's quite sure about. -- Regards Marcus Carr email: mrc@allette.com.au _______________________________________________________________ Allette Systems (Australia) email: info@allette.com.au Level 10, 91 York Street www: http://www.allette.com.au Sydney 2000 NSW Australia phone: +61 2 9262 4777 fax: +61 2 9262 4774 _______________________________________________________________ xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From Peter at ursus.demon.co.uk Mon Aug 25 13:33:45 1997 From: Peter at ursus.demon.co.uk (Peter Murray-Rust) Date: Mon Jun 7 16:58:18 2004 Subject: Whitespace rules (v2) Message-ID: <9623@ursus.demon.co.uk> I have been away for a few days so maybe it's a useful time to try to summarise the Whitespace debate and to ask a few questions. You don't need to read the rest of this unless you believe there is a problem to be addressed :-) In message dgd@cs.bu.edu (David G. Durand) writes: > I observed with dismay that the issue of whitespace has surfaced on this > list, after we finally gave it the wooden-stake-in-the-heart treatment on > the WG discussion lists. As a chief proponent of the current method, I'll :-) I am not sure what has been killed :-) > take a shot at explaining the rationale, as that is something that doesn't > really fit in a standard, but actually helps a great deal in understanding > one. I will take David's points first, because I *do* believe that many of those who were involved in the development of the spec feel that there is no scope for further discussion of this *IN THE SPEC*. I agree with this. Essentially the spec says: - This is a difficult problem. [Actually it doesn't say this, but it might help if it did in a footnote.] - We have taken a minimalist approach where we do not give any support to any whitespace philosophy [other than PRESERVE which passes everything and can be platform-dependent], but leave this to the community. DEFAULT is simply the absence of PRESERVE. I believe this solves one species of problem, where the authoring tool/system is closely coupled to the application. CDF might be such a system (e.g. I have never seen a native CDF file). *IF* this is the major use of XML - where there is a one-to-one communication of this sort - then there is no real problem. I do not believe this is the case, and I think there are at least two areas where XML will run into this general problem on numerous occasions: (A) There is a defined DTD (e.g. TEI, HTML) but a variety of authoring tools and a variety of applications from different providers. Traditionally these will come from the SGML community. I believe that there will certainly be initial problems where m'facturer X emits whitespace in a particular way which is incompatible with Y's tools for rendering/transforming it. It may also be platform dependent. We've seen this in the development of HTML systems although they are improving. Remember that most SGML systems are current implemented within a single site (the tools are chosen to be compatible throughout the process). Very little SGML is delivered over the WWW to be consistent between different m'facturers. XML is specifically designed to be delivered over the WWW in (I assume) a platform and m'facturer-independent way. Do we expect to see 'this XML file best viewed with FOO software'??? If so, we might as well give up now. IMO any developer needs to be able to say: (i) I support a wide range of XML DTDs. (ii) I can easily customise my software to support a range of commonly used DTDs (iii) Documents authored by my software should be readable by software from another m'facturer with whom I have had no formal discussions (iv) My system can support a range of applications which read documents produced by other m'facturers systems and with whom I have had no formal discussions. If all the manufacturers tell me this is a non-problem, I'll shut up (on this issue!) If each DTD defines its own use of whitespace (or worse, doesn't define it) they may have a lot of work. (B) There are generic XML applications. The XML community continues to discuss documents which 'contain information from more than one DTD' or 'are WF but not necessarily valid(atable)'. Examples of these are: (i) an XML document to which meta-data has been prepended. (ii) an XML document which includes chunks conforming to well-defined DTDs such as MathML. The possible combinations are indefinitely large. It is impossible to write bespoke software to process these documents, and we need generic mechanisms. Perhaps many will be dealt with by stylesheets, and maybe the WS issue is a question of developing appropriate conventions in stylesheets. In documents of this sort there have to be conventions and flags that indicate how to interpret the documents. The spec has indicated that it shouldn't be in the XML markup - no problem. Somehow conventions have to evolve, either conveyed implicitly or explicitly (e.g. through PIs). [Remember that there are - as yet - no agreed conventions as to what a PI can look like - you can put anything in after the target.] > [...] > >Axiomatic? Call me stubborn (you won't be the first), but I, for one, > >retain some hope. :-) > > We all did at first. The problem is really the last point -- _universal_ > and while I am tempted to agree with Peter, I do not, in fact, because I > think the current method actually does satisfy all four points -- but not > necessarily in the way that you would expect. Note; I am NOT trying to find a universal solution here. I am suggesting that we develop some common, useful approaches which will solve a reasonable number of problems. > > >>[Peter states in detail different policies on whitespace he might need in > >>different contexts.] > >> > >>What I am after here is a convention that I can state which instructs the > >>processor how to treat this whitespace. ***I do not wish to have to devise > >>a specific convention for CML***. I want to be able to indicate that that > >>the W/S after is irrelevant, and that the whitespace in the ATOMS > >content > >>is normalisable and used only as a delimiter of tokens. > > The problem with this is that there are a large number of ways that > whitespace can be used: the "tokens" form mentioned at the end, for > example, has never been proposed for XML. I agree there are a large number of ways. Some classification would be valuable and IMO the sort of thing that XML-DEV could usefully provide. [The WS-separated tokens are no different from 'words' in HTML and I would expect that a large number of people would welcome a convention on normalising whetspace between 'words'.] > > >>I expect that many other applications will use a similar approach, so I want > >>to share the effort with them. Examples of metadata in XML have often been > >>portrayed as prettyprinted and I expect that CML could use the same > >conventions. > > This charing makes sense, only when the sharing of effort is not imposing > an unreasonable burden on others. The problem with whitespace is that the > different possible policies are all unneeded by many applications. Then the application needn't implement them :-) Applications have to do *something* about whitespace. This can be: - ignore the problem (or use PRESERVE) - their own thing - a set of choices which is understood by the community - refuse to process the document. > > The typical browser/formatter may never need "token" style whitespace, and > may implement such things by passing data to applets or other external > processes that will handle them. > > In fact, the need to write xml->xml transducers (SGML has tought us that > this need never goes away), argues that it must be _possible_ to see all > whitespace at least _some_ of the time, regardless of document. That's one > reason that the current "pass all whitespace" model works. It 'works' in that it shifts the problem to the application developer. I like the idea of an XML->XML transducer - perhaps in front of the application, or callable within it. If David thinks that such tools could be built independently of applications that is exactly what I am suggesting :-) > > The other reason that it works, is that you an always ignore data that > you're not interested in (whitespace) but you can never get access to data > that is hidden from you -- therefore the convenience of "automatic > whitespace removal" is an inability to see that space without using > non-standard tools. it's clear that an application *must* have access to all whitespace if it wants it (this is made clear by, say, the requirement of XMl_LINK to search on pseudoelements). However it should also be able to access a normalised form of the document. > > >>I think that we can aim for a set of options that could be used by a > >post-parser > >>processor. Different applications (**or document authors**) could choose > >between > >>them. Examples might be: > >> - normaliseCRLF (Neil's Rule 1) > >> - discardAllWS > >> - normaliseToSingleSpace > > I agree that this is the right place for such processing to happen (between > a parser and an application). I'm not yet sure whether these things are as > reusable as people think. I do know that without the use of #FIXED > attributes (so I could avoid markup in the instance) I would _not_ use > these, but rather make sure that my application (or stylesheet language) > had the ability to apply these policies on request, as needed. But we do have #FIXED, right? In which case I generally agree. > [...] > This is the option that XML universally adopts. That means that any other > method can be implemented _by any processor that cares_. If one can imagine > destroying meaning of a document's content by the flattening of all > whitespace strings to a single space, then you may need more elements in > your content model, if you are not able to control the software that will > process the document. This is a good point. > > In other words the parser guarantees all WS will be visible to applications > -- this makes designing and implementing WS dependent processing easy -- > but since applications are _not_ constrained as folding or other WS > processing behaviour, document authors will have to be cautious in using > significant whitespace. If you can't assume that applications to process > your markup will do the right thing, then you should not play games with WS. Yes. But where is the rigour in authoring going to come from? This is where I believe that XML-DEV has a role. > > This actually is not much of an issue for CML, since it's a reasonable > assumption that any implementation of CML markup-display will have to do > lots of special things, of which whitespace is the least. No, the point was that CML wishes to re-use HTML and MathML as additonal components in the document. And then meta-data, and ... So that the application will become bloated unless it can re-use the approaches from the rest of the community. > [...] > > > I think XML's agnostic position is the correct one for tha language. > Authors should probably assume (unless they anticipate absolutely no > re-use) that HTML-style draconian normalization might occur anywhere and > use markup rather than whitespace, or at least CDATA sections. This > position _may_ be moderated (a little) where a well-known DTD with > well-defined WS rules can be used (like the TEI or HTML). I agree on this. The point I have been trying to promote is that it should be possible to collate the requirements of such systems and offer them on a re-usable basis. I know from experience that it's extremely easy to go round in circles here. If this discussion is going to echieve something - and I think that a number of peopel would welcome this - then perhaps a revised set of the rules recently suggested, and adddressed to HTML-like usage (with perhaps other common current DTDs as well) would be beneficial. An author could then say: - the content of FOO, BAR, FLIP can be expected to be treated by XML-DEV-HTML-like WS normalisation. - the content of BAZ, BLORT suffers WS stripping as described in XML-DEV-HTML-like-stripping. and that's about it. If we can get something along those lines, then I think a reasonable number of people would take note. It doesn't just have to apply to HTML DTDs. P. -- Peter Murray-Rust, domestic net connection Virtual School of Molecular Sciences http://www.vsms.nottingham.ac.uk/ xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From Peter at ursus.demon.co.uk Mon Aug 25 14:02:00 1997 From: Peter at ursus.demon.co.uk (Peter Murray-Rust) Date: Mon Jun 7 16:58:19 2004 Subject: Whitespace Message-ID: <9626@ursus.demon.co.uk> Thanks Marcus, In message <340152AF.F224A51C@allette.com.au> Marcus Carr writes: > Apologies in advance to all those who have thought and fought over this > issue for a long time, but as a self-confessed critic of the claim that > "XML is SGML", I feel compelled to throw my hat into the ring. > > As far as I can see, there are only two circumstances when whitespace is > an issue - receiving an XML document or authoring one. Receiving, it > doesn't matter if you have a DTD or not - the application can determine > from a well formed document whether it should regard an element's > content as MIXED or ELEMENT. It does involve parsing it, but only until > it sees mixed content. If elements are assumed to be ELEMENT until I may have misunderstood this, but the problem seems to be that we cannot reliably determine this if authors use whitespace for pretty-printing. If what you mean is 'non-whitespace MIXED content' (i.e. content which has at least one non-WS character in) then I'm sympathetic. IOW it is possible to say 'treat anything with only WS content or element content as having element content'. This is exectly the sort of convention that I have been suggesting people might propose. Whether it's workable depends on the reaction you get :-) > proven otherwise, surely this wouldn't be a massive overhead. Authoring > applications would be similar - the first time a tag contained mixed > content, the application would reset the status of the element. The onus > would from then on be on the application to assist the user in creating > semantically correct documents, by such mechanisms as not allowing hard > returns at element boundaries, in short, making significant whitespace > look like significant whitespace. > > MURATA Makoto wrote: > > > Suppose that we have different kinds of tags for mixed-content > > elements (e.g, and ) and element-content > > elements (e.g, and ). Then, even > > non-validating parsers can tell element contents and mixed contents. > > Does this help? I think this approach does help, but might be implementable through PIs (see below) > > It seems that the choices are either the current proposal that nobody ^^^^^^^^^^^^^^^^ I assume you mean the current XML spec. > seems to feel is entirely satisfactory, or suggestions such as the > above, which would certainly work, but ultimately may involve as great > an overhead as sending the DTD. It seems to me that we're throwing the > baby out with the bathwater by ignoring a solution such as declaring at > the start of the document how whitespace in elements should be handled. I think that this is exactly what some members of this list are striving for. The spec requires them to use one or more of: - a specific markup element (e.g. ) - a stylesheet - a PI > > I would also like to see DTDs sent to non-validating parsers, just so > they could determine how to apply whitespace rules without necessarily > having to do any structural parsing. If need be, two new types of It seems axiomatic that there are already documents that do no conform to any given DTD, so this isn't an option. It has been suggested that content could be defined on a per-element basis, but at present parsers are expected to use this to validate the whole document. > declared content could be added, ELEMENT and MIXED. They might behave > the same way as ANY, or the DTD could be constructed even more loosely, > where only MIXED elements were declared and everything else was > defaulted to ELEMENT. This would result in a small DTD sent only for the > sake of making the application aware of how to deal with whitespace. If > desirable, no DTD need be sent, but the application's performance may > suffer marginally for it. This is in keeping with the idea that an > application need not know how to deal with a document as it comes in. As > far as I can see, much of the functionality in XML (such as linking) > relies on a DTD, so it's not going to be foreign to most XML > applications anyway. This seems possible, but it requires a change to the XML-spec. XML WG members read this list and if any of them think it's a good idea they might take it up. But my impression is that most take the view that David Durand has posted - the spec is not capable of further refinement at this stage. It may be possible to implement this through a PI. This could define which elements had which type of content, e.g. > > The whitespace rules in SGML can be simplified - most people accept that > they should. Because inclusions and exclusions aren't valid in XML > anyway, the rules are already somewhat simpler. I would really like to > see XML and SGML stay in synch - I think anything else would be to > everyones disadvantage. There really isn't a lot of point in flaming me > for this; the question is well intentioned and the current solution There are no flames on xml-dev :-) We are all trying to solve a difficult technical, perceptual and cultural problem. [The general standard of debate and courtesy within the SGML community is impressive.] P. -- Peter Murray-Rust, domestic net connection Virtual School of Molecular Sciences http://www.vsms.nottingham.ac.uk/ xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From Peter at ursus.demon.co.uk Mon Aug 25 14:02:02 1997 From: Peter at ursus.demon.co.uk (Peter Murray-Rust) Date: Mon Jun 7 16:58:19 2004 Subject: XML developers' day Message-ID: <9627@ursus.demon.co.uk> Like many other readers of this list I was not able to attend the XML-developers' day. I would find it extremely useful if anyone was able to report on that, highlighting the main problems people face. Any indications as to how this list might serve the community would be valuable :-) P. -- Peter Murray-Rust, domestic net connection Virtual School of Molecular Sciences http://www.vsms.nottingham.ac.uk/ xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From dgd at cs.bu.edu Mon Aug 25 17:01:47 1997 From: dgd at cs.bu.edu (David G. Durand) Date: Mon Jun 7 16:58:19 2004 Subject: Whitespace In-Reply-To: <340152AF.F224A51C@allette.com.au> References: <9708250211.AA01302@lute.apsdc.ksp.fujixerox.co.jp> Message-ID: At 4:38 AM -0500 8/25/97, Marcus Carr wrote: >Apologies in advance to all those who have thought and fought over this >issue for a long time, but as a self-confessed critic of the claim that >"XML is SGML", I feel compelled to throw my hat into the ring. I looked with interest for the criticism of the claim, since that would be useful information -- we've gone so far as to hold off critical feeatures of XML in a few places to wait for the ISO to catch up in the current SGML revision. One of the things they kindly agreed to update is the whitespace rules, so that the XML rules can be turned on in the SGML declaration. >As far as I can see, there are only two circumstances when whitespace is >an issue - receiving an XML document or authoring one. Receiving, it >doesn't matter if you have a DTD or not - the application can determine >from a well formed document whether it should regard an element's >content as MIXED or ELEMENT. Since XML must deal with well formed documents (no DTD) the traditional SGML whitespace rules _cannot_ be used, as element content and mixed content are not distinguished in instances by _any_ dependable cues. The limited DTD proposal pleased neither the DTD-haters, nor the DTD-lovers, though it was in a draft for a long time. > It does involve parsing it, but only until >it sees mixed content. If elements are assumed to be ELEMENT until >proven otherwise, surely this wouldn't be a massive overhead. It might involve buffering large amounts for whitespace across an arbitrary parser lookahead, since there is no limit on the size of an element, or where the non-space PCDATA might show up. One would have to buffer the entire document in the parser before one could decide whether to emit any whitespace in the root element. This might be a bit of a memory performance hit... > Authoring >applications would be similar - the first time a tag contained mixed >content, the application would reset the status of the element. The onus >would from then on be on the application to assist the user in creating >semantically correct documents, by such mechanisms as not allowing hard >returns at element boundaries, in short, making significant whitespace >look like significant whitespace. Manye people have claimed that they use editors incapable of funtioning without inserting linends (of their local flavor) every 200 characters or so. I (personally) wasn't very sympathetic to this argument, but it stood in for the empirical observation that people are very loose with whitespace/linends, and that forcing tools not to emit whatever line-ending codes it wants could be a problem. >MURATA Makoto wrote: > >> Suppose that we have different kinds of tags for mixed-content >> elements (e.g, and ) and element-content >> elements (e.g, and ). Then, even >> non-validating parsers can tell element contents and mixed contents. >> Does this help? > >It seems that the choices are either the current proposal that nobody >seems to feel is entirely satisfactory, or suggestions such as the >above, which would certainly work, but ultimately may involve as great >an overhead as sending the DTD. It seems to me that we're throwing the >baby out with the bathwater by ignoring a solution such as declaring at >the start of the document how whitespace in elements should be handled. The real problem is that there's an assumption that a generic processor can solve the "whitespace problem" -- and that is not really true. In a very real sense the meaning of whitespace is a product of the document _and_ and he application. For instance, line breaks (as indicated by whitespace) might be critical in a typesetting application for poetry (but _only in elements). The same document, however, would be best processed with some form of whitespace-collapsing everywhere, when indexed by a full-text search engine. The same data may have different signficance when processed differently. The fact is that whitespace should be controlled by the application. For typesetting and display, this means that practically, it's going to be part of the "stylesheet" or other processing mechanism. The advantage of "parser handled whitespace" would be the ability to create meaningful, error-free applications that can work on arbitrary markup _whithout a stylesheet or other processing specification_. The only small problem with that convenience is that such processing is basically impossible, for many more reasons that telling where words end, or if CR; is a linend or just part of a CRLF sequence. > > ..... > As >far as I can see, much of the functionality in XML (such as linking) >relies on a DTD, so it's not going to be foreign to most XML >applications anyway. This is not necessarily the case. It's also harder to detect mixed content from DTD declarations, than simply to recognized #FIXED attributes. > >The whitespace rules in SGML can be simplified - most people accept that >they should. >I would really like to >see XML and SGML stay in synch - I think anything else would be to >everyones disadvantage. Yes, this is very true -- and this battle has been won by the compatibility camp -- they are in synch. SGML has a new "pass all whitespace" option for the declaration. This is not going to be a big problem for existing implementations, since it's incredibly easy for parsers to implement -- most have had to anyway, if they attempt to support SGML->SGML transformation tools. I think SP already can do the right thing. > There really isn't a lot of point in flaming me >for this; the question is well intentioned and the current solution >seems to have satisfied few. The concept of declaring things at the >start is a tried and true methodology, yet we seem to be fleeing it in >favor of something nobody's quite sure about. No flameage required. I agree with the intent -- just not your proposed solutions. We went through all these permutations -- any form of normalization _before_ the application causes some kind of problem. And since there is, iun any case, no universal way to handle markup without a external processing spec (that can include whitespace among its many other factors) there's no reason to make the parser cause applications more problems than they will have to solve already. _________________________________________ David Durand dgd@cs.bu.edu \ david@dynamicDiagrams.com Boston University Computer Science \ Sr. Analyst http://www.cs.bu.edu/students/grads/dgd/ \ Dynamic Diagrams --------------------------------------------\ http://dynamicDiagrams.com/ MAPA: mapping for the WWW \__________________________ xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From dgd at cs.bu.edu Mon Aug 25 17:02:02 1997 From: dgd at cs.bu.edu (David G. Durand) Date: Mon Jun 7 16:58:19 2004 Subject: Whitespace rules (v2) In-Reply-To: <9623@ursus.demon.co.uk> Message-ID: At 6:36 AM -0500 8/25/97, Peter Murray-Rust wrote: >I have been away for a few days so maybe it's a useful time to try to >summarise >the Whitespace debate and to ask a few questions. You don't need to read the >rest of this unless you believe there is a problem to be addressed :-) Afraid that I have to chime in when I see a non-problem consuming valuable time... > >In message dgd@cs.bu.edu (David >G. Durand) writes: >> I observed with dismay that the issue of whitespace has surfaced on this >> list, after we finally gave it the wooden-stake-in-the-heart treatment on >> the WG discussion lists. As a chief proponent of the current method, I'll > >:-) I am not sure what has been killed :-) I hoped the discussion. Certainly I hoped the shibboleth of a parser "normalizing" whitespace on behalf of the application. >I will take David's points first, because I *do* believe that many of those >who were involved in the development of the spec feel that there is no scope >for further discussion of this *IN THE SPEC*. I agree with this. Actually, the only question remaining, in my mind, is how the XML stylesheet language should allow shitespace to be processed. I disagree that there is any need for a non-stylesheet, non-application convention for whitespace. Note, that in some sense, the Document type _description_ (i.e. descriptive prose desribing the intent of a DTD) and the "schema" notions are application specifications, and are entitled to declare whitespace handling rules. >Essentially the spec says: > - This is a difficult problem. [Actually it doesn't say this, but >it might help if it did in a footnote.] It's only difficult if you think that it's a parser problem. It's easy in XML, because all whitespace is visible. I can think of no _simpler_ rule that a _parser_ could implement. > - We have taken a minimalist approach where we do not give any support >to any whitespace philosophy [other than PRESERVE which passes everything and >can be platform-dependent], but leave this to the community. DEFAULT is simply >the absence of PRESERVE. Yes, since there is not a universal "whitespace philosophy" even for a single document (see my response to Marcus for an example), there's no reason to declare it in the instance. >I believe this solves one species of problem, where the authoring tool/system >is closely coupled to the application. CDF might be such a system (e.g. I have >never seen a native CDF file). No, it's a case where the "philosophy" is coupled to the application, not to the "document" in the abstract -- except insofar as it is defined by a "document type description" or "schema" -- which is essentially a set of ideal constraints that applications are expected to follow. >(A) There is a defined DTD (e.g. TEI, HTML) but a variety of authoring tools >and a variety of applications from different providers. Traditionally these >will come from the SGML community. I believe that there will certainly be >initial problems where m'facturer X emits whitespace in a particular way >which is incompatible with Y's tools for rendering/transforming it. It may >also be platform dependent. We've seen this in the development of HTML >systems >although they are improving. TEI defines where whitesspace is signficant (almost nowhere if I remember correctly). >Remember that most SGML systems are current implemented within a single site >(the tools are chosen to be compatible throughout the process). Very little >SGML is delivered over the WWW to be consistent between different m'facturers. >XML is specifically designed to be delivered over the WWW in (I assume) >a platform and m'facturer-independent way. Do we expect to see 'this XML >file best viewed with FOO software'??? If so, we might as well give up now. No, but every document will _have_ to either conform to a well-known DTD or schema of some sort, or be delivered with a stylesheet, and those are usefule places that this behavior should be explained. >IMO any developer needs to be able to say: > (i) I support a wide range of XML DTDs. > (ii) I can easily customise my software to support a range of commonly >used DTDs > (iii) Documents authored by my software should be readable by software >from another m'facturer with whom I have had no formal discussions > (iv) My system can support a range of applications which read documents >produced by other m'facturers systems and with whom I have had no formal >discussions Nothing in a stylesheet based solution violates this to my mind. >If all the manufacturers tell me this is a non-problem, I'll shut up (on this >issue!) If each DTD defines its own use of whitespace (or worse, doesn't >define it) they may have a lot of work. > >(B) There are generic XML applications. The XML community continues to discuss >documents which 'contain information from more than one DTD' or 'are WF but >not necessarily valid(atable)'. Examples of these are: > (i) an XML document to which meta-data has been prepended. I'm probably not the best person to address this, as I think that the mix-and-match proposals are ill-thought out, but since the data is supposed to recognizable, presumably it is also to be ignored by all applications other than "meta-applications". So that's not a problem. > (ii) an XML document which includes chunks conforming to well-defined >DTDs such as MathML. In which case, they should have well-known stylesheets or descriptions that explain any whitespace conventions in use. > >The possible combinations are indefinitely large. But since each individual part must have defined bevhavior, this should not be a problem. >It is impossible to write bespoke software to process these documents, and we >need generic mechanisms. Perhaps many will be dealt with by stylesheets, and >maybe the WS issue is a question of developing appropriate conventions in >stylesheets. In documents of this sort there have to be conventions and flags >that indicate how to interpret the documents. The spec has indicated that it >shouldn't be in the XML markup - no problem. Somehow conventions have to >evolve, either conveyed implicitly or explicitly (e.g. through PIs). >[Remember that there are - as yet - no agreed conventions as to what a PI can >look like - you can put anything in after the target.] I used to think this might be useful, but I can't actually think of any application that could plausibly care about whitespace folding and also do meaningful processing without knowledge of the DTD. A text-indexer can work without a DTD, but also doesn't need any whitespace info (folding is always good enough) -- and it needs to see every byte, because it may have to track file offsets of hits. Can you think of any other useful examples of "DTD-blind" applications that might care about how the document _intended_ the whitespace to be processed. I cofness that I can't. >Note; I am NOT trying to find a universal solution here. I am suggesting that >we develop some common, useful approaches which will solve a reasonable >number of problems. But I don't actually see what problems we can solve with such solutions, that are not better addressed in either the stylesheet or DTD/schema problems. >> The problem with this is that there are a large number of ways that >> whitespace can be used: the "tokens" form mentioned at the end, for >> example, has never been proposed for XML. > >I agree there are a large number of ways. Some classification would be >valuable and IMO the sort of thing that XML-DEV could usefully provide. >[The WS-separated tokens are no different from 'words' in HTML and I would >expect that a large number of people would welcome a convention on >normalising whetspace between 'words'.] Enumerating these might have some pedagogical value, but I no longer see the practical value of declaring the behaviors. I used to think it might be useful, but I'm not so sure. >Then the application needn't implement them :-) Applications have to do >*something* about whitespace. This can be: > - ignore the problem (or use PRESERVE) > - their own thing > - a set of choices which is understood by the community > - refuse to process the document. Only 2 (their own thing) makes any sense -- and is typically driven by their knwoledge of a DTD or possesion and following of the dictates of a stylesheet. >It 'works' in that it shifts the problem to the application developer. I like >the idea of an XML->XML transducer - perhaps in front of the application, or >callable within it. If David thinks that such tools could be built >independently of applications that is exactly what I am suggesting :-) They are close to a _null_ application, and require _no_ whitespace normalization, since they need only pass any whitespace they see straight through. This was my original point. Only if you insist on "normalizing" do you _create_ problems with transduction. >it's clear that an application *must* have access to all whitespace if it >wants it (this is made clear by, say, the requirement of XMl_LINK to search >on pseudoelements). However it should also be able to access a normalised >form of the document. Why? I think I've argued effectively that this is not useful without a stylesheet or well-known DTD, and in those cases, it is not necessary (as the DTD or stylesheet should declare the conventions in use). >> This is the option that XML universally adopts. That means that any other >> method can be implemented _by any processor that cares_. If one can imagine >> destroying meaning of a document's content by the flattening of all >> whitespace strings to a single space, then you may need more elements in >> your content model, if you are not able to control the software that will >> process the document. > >This is a good point. > >> >> In other words the parser guarantees all WS will be visible to applications >> -- this makes designing and implementing WS dependent processing easy -- >> but since applications are _not_ constrained as folding or other WS >> processing behaviour, document authors will have to be cautious in using >> significant whitespace. If you can't assume that applications to process >> your markup will do the right thing, then you should not play games with WS. > >Yes. But where is the rigour in authoring going to come from? This is where >I believe that XML-DEV has a role. I'm not sure what you mean here... If the application or DTD depend on whitespace critically (a bad idea, probably, but a permissible one) -- then it is the author's responsibility to use it properly (and select a tool that let's her). Since the generic dumb text-editor is such a tool, and it's widely available, I don't see a big problem here. >> This actually is not much of an issue for CML, since it's a reasonable >> assumption that any implementation of CML markup-display will have to do >> lots of special things, of which whitespace is the least. > >No, the point was that CML wishes to re-use HTML and MathML as additonal >components in the document. And then meta-data, and ... So that the >application will become bloated unless it can re-use the approaches from >the rest of the community. I'm afraid I don't see how you're going to share code with an HTML processor. Nor can I psych myself up to believe that whitespace folding code: while (isspace(c = getc())) ; outchar = ' '; is a big bloat problem in a program that can render organic chem reaction diagrams. >> I think XML's agnostic position is the correct one for tha language. >> Authors should probably assume (unless they anticipate absolutely no >> re-use) that HTML-style draconian normalization might occur anywhere and >> use markup rather than whitespace, or at least CDATA sections. This >> position _may_ be moderated (a little) where a well-known DTD with >> well-defined WS rules can be used (like the TEI or HTML). > >I agree on this. The point I have been trying to promote is that it should >be possible to collate the requirements of such systems and offer them >on a re-usable basis. If it's useful, just list some policies and be done with it, I guess. In answering this mail I've found that I no longer believe that it's very important, because I don't see how to use it effectively anywhere. >An author could then say: > - the content of FOO, BAR, FLIP can be expected to be treated by >XML-DEV-HTML-like WS normalisation. > - the content of BAZ, BLORT suffers WS stripping as described in >XML-DEV-HTML-like-stripping. > >and that's about it. If we can get something along those lines, then >I think a reasonable number of people would take note. It doesn't just have >to apply to HTML DTDs. Why not. Make a web page for the policies, create a notation declaration that points at it, and then use that notation as a prefix on a PI to declare these things. It can't do any harm other than maybe wasting time. -- David _________________________________________ David Durand dgd@cs.bu.edu \ david@dynamicDiagrams.com Boston University Computer Science \ Sr. Analyst http://www.cs.bu.edu/students/grads/dgd/ \ Dynamic Diagrams --------------------------------------------\ http://dynamicDiagrams.com/ MAPA: mapping for the WWW \__________________________ xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From digitome at iol.ie Mon Aug 25 17:47:50 1997 From: digitome at iol.ie (Sean Mc Grath) Date: Mon Jun 7 16:58:19 2004 Subject: Whitespace Message-ID: <199708251547.QAA26726@GPO.iol.ie> [David Durand] > >The fact is that whitespace should be controlled by the application. I disagree. Leaving it to the application lowers the level at which XML applications can achieve a "lock in effect" on XML documents to a level that I find worrying. User A : "What file format is that?" User B : "It's MicroScape XML." User A : "I better buy a copy of MicroScape so - otherwise the white space will get busted again". xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From liamquin at interlog.com Mon Aug 25 18:11:24 1997 From: liamquin at interlog.com (Liam Quin) Date: Mon Jun 7 16:58:19 2004 Subject: Whitespace In-Reply-To: <199708251547.QAA26726@GPO.iol.ie> Message-ID: On Mon, 25 Aug 1997, Sean Mc Grath wrote: > User A : "What file format is that?" > User B : "It's MicroScape XML." > User A : "I better buy a copy of MicroScape so - otherwise the white space > will get busted again". If this happens, it wlil be time to standardise whitespace handling at the applicaton level, perhaps. Right now, I fnd this argument totally bogus. You might as well point out that Microsoft Excel (say) interprets and in one way, and PrisonGlue interprets them differently. Whitespace treatment needs to be specified in the CML specification, for example, and then any conforming CML processor will do the right thing and there's no problem. Taking CML and passing it to a CDF processor will result in different whitespace treatment, I expect... and also different treatment of all the non-whitespace too! And that's fine. Lee -- Liam Quin -- the barefoot typographer -- Toronto lq-text: freely available Unix text retrieval email address: liamquin, at host: interlog dot com xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From digitome at iol.ie Mon Aug 25 20:34:38 1997 From: digitome at iol.ie (Sean Mc Grath) Date: Mon Jun 7 16:58:19 2004 Subject: Whitespace Message-ID: <199708251834.TAA03755@GPO.iol.ie> >On Mon, 25 Aug 1997, Sean Mc Grath wrote: >> User A : "What file format is that?" >> User B : "It's MicroScape XML." >> User A : "I better buy a copy of MicroScape so - otherwise the white space >> will get busted again". [Liam Quin] >If this happens, it wlil be time to standardise whitespace handling at the >applicaton level, perhaps. Right now, I fnd this argument totally bogus. What are you saying? Lets wait and see if the horse bolts - if he does we will lock the barn door? Sean Mc Grath sean@digitome.com Digitome Electronic Publishing http://www.digitome.com xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From mrc at allette.com.au Tue Aug 26 01:21:50 1997 From: mrc at allette.com.au (Marcus Carr) Date: Mon Jun 7 16:58:19 2004 Subject: Whitespace References: <9708250211.AA01302@lute.apsdc.ksp.fujixerox.co.jp> Message-ID: <3402129C.22871191@allette.com.au> David G. Durand wrote: > > It does involve parsing it, but only until > >it sees mixed content. If elements are assumed to be ELEMENT until > >proven otherwise, surely this wouldn't be a massive overhead. > > It might involve buffering large amounts for whitespace across an > arbitrary parser lookahead, since there is no limit on the size of an > element, or where the non-space PCDATA might show up. One would have > to buffer the entire document in the parser before one could decide > whether to emit any whitespace in the root element. This might be a > bit of a memory performance hit... Why would you need to buffer anything? Every element starts with a default value of 'element'. As they're shown to be otherwise, their status is revised. This involves tracking open elements, not picking up chunks and reviewing them. One linear pass of the document tells you all you need to know. > Manye people have claimed that they use editors incapable of > funtioning without inserting linends (of their local flavor) every 200 > characters or so. I (personally) wasn't very sympathetic to this > argument, but it stood in for the empirical observation that people > are very loose with whitespace/linends, and that forcing tools not to > emit whatever line-ending codes it wants could be a problem. This would still respect the limits set by the user in the same way an application would behave when you turn off hyphenation - the line might be shorter, but it's broken in a sensible place. > The real problem is that there's an assumption that a generic > processor can solve the "whitespace problem" -- and that is not really > true. In a very real sense the meaning of whitespace is a product of > the document _and_ and he application. For instance, line breaks (as > indicated by whitespace) might be critical in a typesetting > application for poetry (but _only in elements). The same > document, however, would be best processed with some form of > whitespace-collapsing everywhere, when indexed by a full-text search > engine. The same data may have different signficance when processed > differently. If line breaks are critical, they should be marked explicitly. If you gave a hand written poem to a data entry person with no knowledge of poetry, you may have to specify that you want the current line boundaries respected. Why should an application not be given the same info? > The fact is that whitespace should be controlled by the application. > For typesetting and display, this means that practically, it's going > to be part of the "stylesheet" or other processing mechanism. Whitespace is also a mechanism used to make data readable. In that sense, a space is a character in it's own right, not just something that appears around words. Imagine the response if it wasn't whitespace that was being discussed, it was the letter 'x', and we were telling people 'x' may or may not appear in their data. > > As > >far as I can see, much of the functionality in XML (such as linking) > >relies on a DTD, so it's not going to be foreign to most XML > >applications anyway. > > This is not necessarily the case. It's also harder to detect mixed > content from DTD declarations, than simply to recognized #FIXED > attributes. It can't be that hard. If parameter entities (I assume they're allowed?) have to be unravelled anyway, surely it's just a case of looking at the content model? If it starts with #PCDATA and contains anything else, it's mixed content. > >I would really like to > >see XML and SGML stay in synch - I think anything else would be to > >everyones disadvantage. > > Yes, this is very true -- and this battle has been won by the > compatibility camp -- they are in synch. SGML has a new "pass all > whitespace" option for the declaration. This is not going to be a big > problem for existing implementations, since it's incredibly easy for > parsers to implement -- most have had to anyway, if they attempt to > support SGML->SGML transformation tools. I think SP already can do the > right thing. "Pass all whitespace" will go some distance toward fixing the problem, but what else does it impact? Does it mean that inclusions and exclusions suddenly appear differently than they did in the 'old SGML'? > And since there is, iun any case, no universal way to handle markup > without a external processing spec (that can include whitespace among > its many other factors) there's no reason to make the parser cause > applications more problems than they will have to solve already. My understanding is that one of the basic requirements of XML was that the applications had to be easy to write, so things could be allowed to happen quickly. As much as I do agree that this would have to be a good thing, (and as you pointed out, applications are coming out already) I would argue that maybe they should be more difficult to write, but should address this issue correctly. -- Regards Marcus Carr email: mrc@allette.com.au _______________________________________________________________ Allette Systems (Australia) email: info@allette.com.au Level 10, 91 York Street www: http://www.allette.com.au Sydney 2000 NSW Australia phone: +61 2 9262 4777 fax: +61 2 9262 4774 _______________________________________________________________ xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From mrc at allette.com.au Tue Aug 26 01:25:31 1997 From: mrc at allette.com.au (Marcus Carr) Date: Mon Jun 7 16:58:19 2004 Subject: Whitespace References: <199708251834.TAA03755@GPO.iol.ie> Message-ID: <340213FD.CF236CC2@allette.com.au> Sean Mc Grath wrote: > >On Mon, 25 Aug 1997, Sean Mc Grath wrote: > >> User A : "What file format is that?" > >> User B : "It's MicroScape XML." > >> User A : "I better buy a copy of MicroScape so - otherwise the > white space > >> will get busted again". > > [Liam Quin] > >If this happens, it wlil be time to standardise whitespace handling > at the > >applicaton level, perhaps. Right now, I fnd this argument totally > bogus. > > What are you saying? Lets wait and see if the horse bolts - if he does > we will lock the barn door? By then, there will be far too many hands on the door to think about locking it; the best you can hope for is to kiss the horse goodbye on the way past. -- Regards Marcus Carr email: mrc@allette.com.au _______________________________________________________________ Allette Systems (Australia) email: info@allette.com.au Level 10, 91 York Street www: http://www.allette.com.au Sydney 2000 NSW Australia phone: +61 2 9262 4777 fax: +61 2 9262 4774 _______________________________________________________________ xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From Jon.Bosak at eng.Sun.COM Tue Aug 26 07:36:27 1997 From: Jon.Bosak at eng.Sun.COM (Jon Bosak) Date: Mon Jun 7 16:58:19 2004 Subject: Whitespace In-Reply-To: <340152AF.F224A51C@allette.com.au> (message from Marcus Carr on Mon, 25 Aug 1997 19:38:56 +1000) Message-ID: <199708260532.WAA00995@boethius.eng.sun.com> It's not up to me to tell this group what to talk about, but I think that you should be aware that the WG discussed the issue of whitespace to the point of complete exhaustion during no less than three separate phases of the design process, and the chances of it being formally reconsidered in the XML 1.0 time frame are exactly zero. A discussion of conventions for specific classes of user agents (e.g., web browsers) is useful, but I feel that it's my obligation to point out to anyone mistakenly thinking that this issue might conceivably be reconsidered in the current XML specification that it is not going to happen. Jon xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From Peter at ursus.demon.co.uk Tue Aug 26 10:31:19 1997 From: Peter at ursus.demon.co.uk (Peter Murray-Rust) Date: Mon Jun 7 16:58:19 2004 Subject: Whitespace Message-ID: <9649@ursus.demon.co.uk> In message <199708260532.WAA00995@boethius.eng.sun.com> Jon.Bosak@eng.Sun.COM (Jon Bosak) writes: > It's not up to me to tell this group what to talk about, but I think > that you should be aware that the WG discussed the issue of whitespace > to the point of complete exhaustion during no less than three separate > phases of the design process, and the chances of it being formally > reconsidered in the XML 1.0 time frame are exactly zero. A discussion ^^^^^^^^^^^^ This is the position I have been taking - there is no suggestion that we should ask the WG for a change to the spec. My suggestions to this group were based on the assumption that there was a group of developers who were sufficiently interested in this problem that they could develop some protocols which might be helpful to the community. The following mechanisms are consistent with the current spec and do not require changes: 1. stylesheets. The authors can describe how they expect stylesheet processors to treat their documents. 2. PIs (e.g. 3. additional elements in the DTD (e.g. NEWLINE). 4. implicit conventions (i.e. 'always replace CR/LF with CR'). (Have I missed anything?) We are clear that this has been discussed at great length on the WG and are not seeking to re-open that discussion. My suggestion here is that we are trying to see how the WG's conclusion can be implemented. > of conventions for specific classes of user agents (e.g., web > browsers) is useful, but I feel that it's my obligation to point out ^^^^^^^^^ Some people think this is a waste of time. Perhaps it may turn out to be. Unlike the discussions on the spec, this group has no stated goals and exists to provide mutual support for those developing XML applications. If a number of people feel this is worth discussing, then see let's see if they can achieve anything. If *they* wish to spend the time trying to do this, it needn't waste other people's ... :-) My own feelings are that only mechanisms 1 and 2 above are likely to find favour. I think that PIs can be further explored in this discussion. (Perhaps I should not have used as this would (I think) require WG approval, so I would rephrase this as ) Given that, it seems possible to include PI statements within the document as the how the author intends the whitespace to be treated. It may be argued that this can be done better with stylesheets. Perhaps I'm conservative, but I see PIs embedded in a document as 'being part of the document' to a greater extent than stylesheets which are more likely to be changed by people other than the document's authors. > to anyone mistakenly thinking that this issue might conceivably be > reconsidered in the current XML specification that it is not going to > happen. > P. -- Peter Murray-Rust, domestic net connection Virtual School of Molecular Sciences http://www.vsms.nottingham.ac.uk/ xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From mrc at allette.com.au Tue Aug 26 10:37:13 1997 From: mrc at allette.com.au (Marcus Carr) Date: Mon Jun 7 16:58:19 2004 Subject: Whitespace References: <199708260532.WAA00995@boethius.eng.sun.com> Message-ID: <34029587.52568DFA@allette.com.au> Jon Bosak wrote: > It's not up to me to tell this group what to talk about, but I think > that you should be aware that the WG discussed the issue of whitespace > to the point of complete exhaustion during no less than three separate > phases of the design process, and the chances of it being formally > reconsidered in the XML 1.0 time frame are exactly zero. I did go out of my way in my mail yesterday to recognise the work that has been done on the standard, and I can appreciate how it must bore you to see all this re-hashed for the hundredth time, but not all of us have had the benefit/curse of the extensive exposure to this topic that you have. > A discussion of conventions for specific classes of user agents (e.g., > web browsers) is useful, but I feel that it's my obligation to point > out to anyone mistakenly thinking that this issue might conceivably be > reconsidered in the current XML specification that it is not going to > happen. I'm not asking for anything to happen, but I do believe these things should be allowed to be discussed. If people tire of the topic, they'll stop talking about it - knocking healthy (even if misguided) discussion on the head contributes nothing. -- Regards Marcus Carr email: mrc@allette.com.au _______________________________________________________________ Allette Systems (Australia) email: info@allette.com.au Level 10, 91 York Street www: http://www.allette.com.au Sydney 2000 NSW Australia phone: +61 2 9262 4777 fax: +61 2 9262 4774 _______________________________________________________________ xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From neil at bradley.co.uk Tue Aug 26 10:39:12 1997 From: neil at bradley.co.uk (Neil Bradley) Date: Mon Jun 7 16:58:19 2004 Subject: Whitespace Message-ID: <199708260838.JAA23835@andromeda.ndirect.co.uk> >Sean Mc Grath > >On Mon, 25 Aug 1997, Sean Mc Grath wrote: > >> User A : "What file format is that?" > >> User B : "It's MicroScape XML." > >> User A : "I better buy a copy of MicroScape so - otherwise the white space > >> will get busted again". > > [Liam Quin] > >If this happens, it wlil be time to standardise whitespace handling at the > >applicaton level, perhaps. Right now, I fnd this argument totally bogus. > > What are you saying? Lets wait and see if the horse bolts - if he > does we will lock the barn door? > > Sean Mc Grath I agree with you totally. The horse will bolt, for certain. I want to be able to use XML editor A, and allow people to view the output on browser B and C, publish it on DTP system D, send the data to someone else using editor E, and let people search for pseude-elements using extended pointers in products E and F, and all without extra spaces appearing or vital spaces disappearing at any point. I cannot understand why some people think this will not be problem. We are getting extreme views here, from let the XML processor handle it, to let every application do its own thing. Neither position is acceptable. OK, lets rule out special cases. I can accept that CML and CDF etc will have their own strict rules, perhaps, but I am far more concerned with general document editing and publishing (the sort of things HTML and SGML have been primarily used for). Personally, I am happy to say this issue is beyond the XML processor, and should be handled by the application. Fine. But let all PUBLISHING RELATED applications adopt the same guidelines. Too many developers are going to miss problems which we could help avoid if we could arrive at even a partial setof guidelines. Personally, I think we can achieve more than this. Do we want XML to gain a reputation as an unreliable data exchange and publishing format? We should not have to burden document authors with processing codes, etc. People want the ease of use of HTML (and, dare I say it, SGML too, in this respect at least). I still think this is unnecessary. Others have recently proposed the style sheet as the answer, and I agree. My original proposal to base some of the rules on in-line/block definitions assumed this approach. It is more reliable than element content versus mixed content. I do not, however, think we need to go as far as waiting for the official DSSSL based style sheet to be completed. I for one do not believe all XML-aware applicaitons will use it, and certainly not in the short term. Any config file or style sheet will suffice. People are also proposing all kind of Unicode special characters to perform vital tasks. Let's remember here that few people even have the specification, let alone use this set extensively. I am sure its time will come, but let us be realistic. XML is going to be in widespread use first, and needs to be workable with 7-bit ASCII, if possible, and ISO 8859 if not. I did not expect the rules I (nervously and tentatively) proposed to be acceptable. But I did hope they could form the basis of detail discussion, from which a better set of rules would emerge. Unfortunately, we seem to be getting nowhere. I am trying not to depair. But it's hard. Neil. ----------------------------------------------- Neil Bradley - Author of The Concise SGML Companion. neil@bradley.co.uk www.bradley.co.uk xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From Peter at ursus.demon.co.uk Tue Aug 26 14:00:46 1997 From: Peter at ursus.demon.co.uk (Peter Murray-Rust) Date: Mon Jun 7 16:58:19 2004 Subject: Whitespace Message-ID: <9653@ursus.demon.co.uk> In message <199708260838.JAA23835@andromeda.ndirect.co.uk> "Neil Bradley" writes: > [...] > OK, lets rule out special cases. I can accept that CML and CDF etc > will have their own strict rules, perhaps, but I am far more Actually I would like to develop CML *without* its own set of rules as far as possible. OK, Only chemists want to know how to display , but there is just as much material of the form:

We took 23.03+e02 gram of water

and we want to know whether there is whitespace round the contained elements. As I have repeatedly said I would like to borrow a communal solution rather than invent yet another one. > concerned with general document editing and publishing (the sort of > things HTML and SGML have been primarily used for). > > Personally, I am happy to say this issue is beyond the XML processor, > and should be handled by the application. Fine. But let all > PUBLISHING RELATED applications adopt the same guidelines. Too many > developers are going to miss problems which we could help avoid if we > could arrive at even a partial setof guidelines. Personally, I think > we can achieve more than this. CML is actually aimed very much at the publishing process. I want to be able to combine text, images, vector graphics, maths, and chemistry and for a technically oriented published to be able to process it. I accept that some people think this merging of XML from different sources is unrealistic, but there are others who share the same vision - we'll find out soon enough whether it's a disaster! In any case, we can always mix and match using XML-LINK EMBED. > > Do we want XML to gain a reputation as an unreliable > data exchange and publishing format? > > We should not have to burden document authors with processing codes, > etc. People want the ease of use of HTML (and, dare I say it, SGML > too, in this respect at least). I still think this is unnecessary. > > Others have recently proposed the style sheet as the answer, and I > agree. My original proposal to base some of the rules on in-line/block definitions > assumed this approach. It is more reliable than > element content versus mixed content. I do not, however, think we > need to go as far as waiting for the official DSSSL based style sheet Could you expand this? It is intended to produce a single official style sheet that covers all of this? > to be completed. I for one do not believe all XML-aware applicaitons > will use it, and certainly not in the short term. Any config file or > style sheet will suffice. > > People are also proposing all kind of Unicode special characters to > perform vital tasks. Let's remember here that few people even have > the specification, let alone use this set extensively. I am sure its > time will come, but let us be realistic. XML is going to be in > widespread use first, and needs to be workable with 7-bit ASCII, if > possible, and ISO 8859 if not. I would strongly argue against Unicode characters at this stage. *I* wouldn't know where to get them from, and typing by hand could be a disaster. It will take a while before Unicode is natural to HTML authors. > > I did not expect the rules I (nervously and tentatively) proposed to be acceptable. > But I did hope they could form the basis of detail discussion, from > which a better set of rules would emerge. Unfortunately, we seem to > be getting nowhere. I am trying not to depair. But it's hard. ^^^^^^^^^^^^^^^ Don't despair. There seem to be a group of people on this list who think it's worth pursuing. Several ideas have been suggested. If nothing else it's probably worth summarising what they can do and where they fall down (seriously). If they can be encapsulated in a stylesheet, perhaps so much the better. The problem is probably knowing where to draw the boundary as to what these rules will accomplish. Solve part of the problem and see if it appeals to a sufficient number of people. P. -- Peter Murray-Rust, domestic net connection Virtual School of Molecular Sciences http://www.vsms.nottingham.ac.uk/ xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From digitome at iol.ie Tue Aug 26 15:09:13 1997 From: digitome at iol.ie (Sean Mc Grath) Date: Mon Jun 7 16:58:19 2004 Subject: Whitespace Message-ID: <199708261308.OAA25545@mail.iol.ie> > >>Sean Mc Grath >> >On Mon, 25 Aug 1997, Sean Mc Grath wrote: >> >> User A : "What file format is that?" >> >> User B : "It's MicroScape XML." >> >> User A : "I better buy a copy of MicroScape so - otherwise the white space >> >> will get busted again". >> >> [Liam Quin] >> >If this happens, it wlil be time to standardise whitespace handling at the >> >applicaton level, perhaps. Right now, I fnd this argument totally bogus. >> >> What are you saying? Lets wait and see if the horse bolts - if he >> does we will lock the barn door? >> >> Sean Mc Grath [Neil Bradley] >I agree with you totally. The horse will bolt, for certain. I want to >be able to use XML editor A, and allow people to view the >output on browser B and C, publish it on DTP system D, >send the data to someone else using editor E, >and let people search for pseude-elements using extended pointers >in products E and F, and all without extra spaces appearing or >vital spaces disappearing at any point. > [Lots of v. good points about WS elided] Is this a fair summary of the position then? :- 1) WS handling is an application convention - not part of the XML standard 2) Different applicatioms are free to have different conventions 3) There is a generally agreed need to work out some conventions/idioms because:- a) They will give app. developers a leg up on a potentially difficult topic b) They will hopefully contain the "distilled essense" of the WS intelligensia c) They will give tools purchasers a stick with which to beat vendors IFF divigations from the conventions prove troublesome. I.e. "does your XML tool support the Bradley conventions for white space handling...?" If so. Lets go for it. How about we concentrate in the first instance on inter-operability of straight XML editing tools? xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From dgd at cs.bu.edu Tue Aug 26 15:52:37 1997 From: dgd at cs.bu.edu (David G. Durand) Date: Mon Jun 7 16:58:19 2004 Subject: Whitespace In-Reply-To: <199708251834.TAA03755@GPO.iol.ie> Message-ID: At 1:07 PM -0500 8/25/97, Sean Mc Grath wrote: >[Liam Quin] >>If this happens, it wlil be time to standardise whitespace handling at the >>applicaton level, perhaps. Right now, I fnd this argument totally bogus. > >What are you saying? Lets wait and see if the horse bolts - if he >does we will lock the barn door? This is a lovely example of how quoting out of context can replace giving a counter-argument. Here's the _substantive_ part of Liam's note: >Whitespace treatment needs to be specified in the CML specification, >for example, and then any conforming CML processor will do the right >thing and there's no problem. Taking CML and passing it to a CDF processor >will result in different whitespace treatment, I expect... and also different >treatment of all the non-whitespace too! And that's fine. The point is (and I also made this at length before) there are few ways to meaningfully process markup without knowing the DTD or having a stylesheet, there are even fewer ways to process such markup (w/out stylsheet or knowledge of the DTD) such that ignoring whitespace is something that you _need_ to do. XML already provides you with _all_ the whitespace (unlike original flavor SGML) -- so there's no problem with the parser hiding significant whitespace. The only question is whether we should be addding features to note that some whitespace is _insignficant_. In fact I believe that small set of ways to process markup (that you don't know the meaning of, without a processing spec), and where you _have to_ collapse or otherwise mangle the whitespace is the _null set_. As I asked before, I'd like to see even one example of of an application that needs this. -- David _________________________________________ David Durand dgd@cs.bu.edu \ david@dynamicDiagrams.com Boston University Computer Science \ Sr. Analyst http://www.cs.bu.edu/students/grads/dgd/ \ Dynamic Diagrams --------------------------------------------\ http://dynamicDiagrams.com/ MAPA: mapping for the WWW \__________________________ xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From dgd at cs.bu.edu Tue Aug 26 15:52:49 1997 From: dgd at cs.bu.edu (David G. Durand) Date: Mon Jun 7 16:58:20 2004 Subject: Whitespace In-Reply-To: <199708260838.JAA23835@andromeda.ndirect.co.uk> Message-ID: At 5:51 PM -0500 8/25/97, Neil Bradley wrote: >I want to >be able to use XML editor A, and allow people to view the >output on browser B and C, publish it on DTP system D, >send the data to someone else using editor E, >and let people search for pseude-elements using extended pointers >in products E and F, and all without extra spaces appearing or >vital spaces disappearing at any point. Vital spaces will never disappear in _XML parsing_ because all whitespace is literally passed along. This means that the safe thing is just to leave it in, and define stylesheets so they can strip any excess space. They'll only be disappearing if applications have bugs (which can be dealt with app-by-app, or if XML processors start "doing favors" for applications by "pre-normalizing" the data. >I cannot understand why some people think this will not be problem. I don't understand how it _can_ be a problem (in general, rather than due to particular bugs). >We are getting extreme views here, from let the XML processor handle >it, to let every application do its own thing. Neither position is >acceptable. >OK, lets rule out special cases. I can accept that CML and CDF etc >will have their own strict rules, perhaps, but I am far more >concerned with general document editing and publishing (the sort of >things HTML and SGML have been primarily used for). In general document editing, you still have DTDs and will still have conventions for whitespace. In particular, any formatting application _must have_ a stylesheet or other formatting spec. That is the correct place for formatting information about whitespace collapse to be specified. >Do we want XML to gain a reputation as an unreliable >data exchange and publishing format? Then we'd better not start dropping data in the parser! >We should not have to burden document authors with processing codes, >etc. People want the ease of use of HTML (and, dare I say it, SGML >too, in this respect at least). I still think this is unnecessary. >Others have recently proposed the style sheet as the answer, and I >agree. My original proposal to base some of the rules on in-line/block >definitions >assumed this approach. It is more reliable than >element content versus mixed content. I do not, however, think we >need to go as far as waiting for the official DSSSL based style sheet >to be completed. I for one do not believe all XML-aware applicaitons >will use it, and certainly not in the short term. Any config file or >style sheet will suffice. Personally, despite the sliught nausea engendered by the theought, I expect that some CSS variation will be the one in common use -- and that CSS will usually fold space like HTML does now. >People are also proposing all kind of Unicode special characters to >perform vital tasks. Let's remember here that few people even have >the specification, let alone use this set extensively. I am sure its >time will come, but let us be realistic. XML is going to be in >widespread use first, and needs to be workable with 7-bit ASCII, if >possible, and ISO 8859 if not. XML is _defined_ to be Unicode, and the only way to do simple 8-bit processors is to use UTF-8 -- but of course, that just makes special unicode chars look like "escape sequences". Not so bad, really. >I did not expect the rules I (nervously and tentatively) proposed to be >acceptable. >But I did hope they could form the basis of detail discussion, from >which a better set of rules would emerge. Unfortunately, we seem to >be getting nowhere. I am trying not to depair. But it's hard. All I care about is that XML-dev not give the impression that generic XML processors should start folding whitespace, since we explicitly removed whitespace processing from XML to avoid the "vanishing space problem". If we can find any applications other than formatting, and that don't depend on knowing the meanings of the tags, then we need to consider using PIs to declare special whitespace folding in a document. I don't currently believe that such applications exist -- because I can't some up with any. When I thought they _might_ exist, I thought that this kind of spec. would be a good idea. Now it just seems to add confusion where we had made simplicity. I still think "all whitespace is significant" is the simplest rule we can use that allows everything that we can do today. -- David _________________________________________ David Durand dgd@cs.bu.edu \ david@dynamicDiagrams.com Boston University Computer Science \ Sr. Analyst http://www.cs.bu.edu/students/grads/dgd/ \ Dynamic Diagrams --------------------------------------------\ http://dynamicDiagrams.com/ MAPA: mapping for the WWW \__________________________ xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From dgd at cs.bu.edu Tue Aug 26 16:26:36 1997 From: dgd at cs.bu.edu (David G. Durand) Date: Mon Jun 7 16:58:20 2004 Subject: Whitespace In-Reply-To: <9653@ursus.demon.co.uk> Message-ID: At 7:39 AM -0500 8/26/97, Peter Murray-Rust wrote: >Actually I would like to develop CML *without* its own set of rules as far >as possible. OK, Only chemists want to know how to display , but >there is just as much material of the form: >

We took >23.03+e02 >gram >of water >

>and we want to know whether there is whitespace round the contained elements. >As I have repeatedly said I would like to borrow a communal solution rather >than invent yet another one. But there is whitespace around the contained elements. If you don't want it, don't put it in... XML passes all whitespace in the source to the application.

We took 23.03+e02gram of water

has no space. An Author can be told to enter either one, depending on what they want. If you want the effect of my markup with the source you gave, that's a CML convention, to the effect that VAR and UNIT "eat" adjacent whitespace... or a formatting convention. The worst problem I see with whitespace is one that can't be solved by a parser easily: if I have a document bit like: This is an end of paragraph.

And this is the start of another. There's no way to tell that there isn't a word "paragraph.And" in the document, without knowing the meaning of the tags. Of course there is only one word in: Large initial letter But this tends to bear out my view that whitespace handling is just the tip of an iceberg only soluble with a lot of semantic knowledge -- that it is the duty of stylesheet and DTD authors to determine. _________________________________________ David Durand dgd@cs.bu.edu \ david@dynamicDiagrams.com Boston University Computer Science \ Sr. Analyst http://www.cs.bu.edu/students/grads/dgd/ \ Dynamic Diagrams --------------------------------------------\ http://dynamicDiagrams.com/ MAPA: mapping for the WWW \__________________________ xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From digitome at iol.ie Tue Aug 26 16:26:55 1997 From: digitome at iol.ie (Sean Mc Grath) Date: Mon Jun 7 16:58:20 2004 Subject: Whitespace Message-ID: <199708261426.PAA01594@mail.iol.ie> >At 1:07 PM -0500 8/25/97, Sean Mc Grath wrote: >>[Liam Quin] >>>If this happens, it wlil be time to standardise whitespace handling at the >>>applicaton level, perhaps. Right now, I fnd this argument totally bogus. >> >>What are you saying? Lets wait and see if the horse bolts - if he >>does we will lock the barn door? > >This is a lovely example of how quoting out of context can replace giving a >counter-argument. Here's the _substantive_ part of Liam's note: [David Durand] My counter argument (not reproduced above) *followed* the sentence you *have* reproduced. A lovely example of how quoting.... Here is a concrete scenario that either illustrates the problem or illustrates my ignorance. I want to know how two XML applications that apply different WS conventions can inter-operate losslessly. Specifically, why is this scenario wrong? :- I wish to perform a null transformation across two editing tools App A and App B. foo.xml --> App A --> App B --> bar.xml I want foo.xml == bar.xml App A : reads foo.xml and treats WS according to APPA-WS-RULES writes temp.xml App B : reads temp.xml and treats WS according to APPB-WS-RULES writes bar.xml Result : foo1.xml != bar.xml xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From dgd at cs.bu.edu Tue Aug 26 16:27:24 1997 From: dgd at cs.bu.edu (David G. Durand) Date: Mon Jun 7 16:58:20 2004 Subject: Whitespace In-Reply-To: <3402129C.22871191@allette.com.au> References: <9708250211.AA01302@lute.apsdc.ksp.fujixerox.co.jp> Message-ID: At 6:17 PM -0500 8/25/97, Marcus Carr wrote: >David G. Durand wrote: > >> > It does involve parsing it, but only until >> >it sees mixed content. If elements are assumed to be ELEMENT until >> >proven otherwise, surely this wouldn't be a massive overhead. >> >> It might involve buffering large amounts for whitespace across an >> arbitrary parser lookahead, since there is no limit on the size of an >> element, or where the non-space PCDATA might show up. One would have >> to buffer the entire document in the parser before one could decide >> whether to emit any whitespace in the root element. This might be a >> bit of a memory performance hit... > >Why would you need to buffer anything? Every element starts with a >default value of 'element'. As they're shown to be otherwise, their >status is revised. This involves tracking open elements, not picking up >chunks and reviewing them. One linear pass of the document tells you all >you need to know. Well, you can't send any of the data within an element while you are "tracking", since you don't know if whitespace is data or noise. So you have to buffer element opens and closes, and any PCDATA, until the element is over (or you find non-WS PCDATA). The easy to see worst case has megabytes of doc with one lone string: "The end" in the content of the top-level element, that otherwise contains only elements and no data. >> Manye people have claimed that they use editors incapable of >> funtioning without inserting linends (of their local flavor) every 200 >> characters or so. I (personally) wasn't very sympathetic to this >> argument, but it stood in for the empirical observation that people >> are very loose with whitespace/linends, and that forcing tools not to >> emit whatever line-ending codes it wants could be a problem. > >This would still respect the limits set by the user in the same way an >application would behave when you turn off hyphenation - the line might >be shorter, but it's broken in a sensible place. Exactly. So as I said, this ptoential justification for whitespace mangling is a non-starter. Thanks for the support. > >> The real problem is that there's an assumption that a generic >> processor can solve the "whitespace problem" -- and that is not really >> true. In a very real sense the meaning of whitespace is a product of >> the document _and_ and he application. For instance, line breaks (as >> indicated by whitespace) might be critical in a typesetting >> application for poetry (but _only in elements). The same >> document, however, would be best processed with some form of >> whitespace-collapsing everywhere, when indexed by a full-text search >> engine. The same data may have different signficance when processed >> differently. > >If line breaks are critical, they should be marked explicitly. If you >gave a hand written poem to a data entry person with no knowledge of >poetry, you may have to specify that you want the current line >boundaries respected. Why should an application not be given the same >info? It should. In a stylesheet. I've still not seen an example of a case where an application that doesn't know the DTD, and doesn't have a processing spec needs to _collapse_ whitespace. XML always passes all the whitespace, so it is never lost except by _explicit application action_. > >> The fact is that whitespace should be controlled by the application. >> For typesetting and display, this means that practically, it's going >> to be part of the "stylesheet" or other processing mechanism. > >Whitespace is also a mechanism used to make data readable. In that >sense, a space is a character in it's own right, not just something that >appears around words. Imagine the response if it wasn't whitespace that >was being discussed, it was the letter 'x', and we were telling people >'x' may or may not appear in their data. You are the one arguing that it must be possible to "turn off x's" when convenient. XML _Passes all whitespace_. The only kind of convention we can create is one that turns off some whitespace. I see that as dangerous, for the reasons you give. We may be in raging agreement! > >> > As >> >far as I can see, much of the functionality in XML (such as linking) >> >relies on a DTD, so it's not going to be foreign to most XML >> >applications anyway. >> >> This is not necessarily the case. It's also harder to detect mixed >> content from DTD declarations, than simply to recognized #FIXED >> attributes. > >It can't be that hard. If parameter entities (I assume they're allowed?) >have to be unravelled anyway, surely it's just a case of looking at the >content model? If it starts with #PCDATA and contains anything else, >it's mixed content. If you don't have a DTD, you don't have content models. Even if you do have the DTD, a minimal parse would involve "entity unravelling" -- a serious increment in complexity just to be able to ignore a few spaces. In any case, XML has decided to elkiminate SGML's arcane whitespace rules, since the ISO has agreed to create an SGML declaration option that will have the same effect. >"Pass all whitespace" will go some distance toward fixing the problem, >but what else does it impact? Does it mean that inclusions and >exclusions suddenly appear differently than they did in the 'old SGML'? There are no inclusions or exlcusions in XML. If you are using the new declaration in the new SGML you'll have to read the spec and find out, but it's irrelevant to XML. The XML authors worked through the consequences, but it wasn't very hard, since most of the problematic features of SGML (inclusion exceptions, shortrefs, minimization) were already gone, so the interactions were simple. The one wierd thing is that the distinction between whitespace behavior for element and mixed content no longer exists. You see all whitespace regardless. This was essentially required for DTDless and DTDfull parsing to produce equivalent results. So in "pure XML" whitespace is never "source-code formatting", but is _always_ data. >My understanding is that one of the basic requirements of XML was that >the applications had to be easy to write, so things could be allowed to >happen quickly. As much as I do agree that this would have to be a good >thing, (and as you pointed out, applications are coming out already) I >would argue that maybe they should be more difficult to write, but >should address this issue correctly. They do -- even when correctly means compatible with ISO SGML -- but we did get ISO to simplify some of the hard bits of SGML. -- David _________________________________________ David Durand dgd@cs.bu.edu \ david@dynamicDiagrams.com Boston University Computer Science \ Sr. Analyst http://www.cs.bu.edu/students/grads/dgd/ \ Dynamic Diagrams --------------------------------------------\ http://dynamicDiagrams.com/ MAPA: mapping for the WWW \__________________________ xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From dgd at cs.bu.edu Tue Aug 26 16:27:28 1997 From: dgd at cs.bu.edu (David G. Durand) Date: Mon Jun 7 16:58:20 2004 Subject: Whitespace In-Reply-To: <9649@ursus.demon.co.uk> Message-ID: At 4:01 AM -0500 8/26/97, Peter Murray-Rust wrote: >The following mechanisms are consistent with the current spec and do not >require changes: > 1. stylesheets. The authors can describe how they expect stylesheet > processors to treat their documents. > 2. PIs (e.g. > 3. additional elements in the DTD (e.g. NEWLINE). > 4. implicit conventions (i.e. 'always replace CR/LF with CR'). > >(Have I missed anything?) >> of conventions for specific classes of user agents (e.g., web >> browsers) is useful, but I feel that it's my obligation to point out > ^^^^^^^^^ >Some people think this is a waste of time. Perhaps it may turn out to be. >Unlike the discussions on the spec, this group has no stated goals and exists >to provide mutual support for those developing XML applications. If a number >of people feel this is worth discussing, then see let's see if they can >achieve anything. If *they* wish to spend the time trying to do this, it >needn't waste other people's ... :-) >My own feelings are that only mechanisms 1 and 2 above are likely to find >favour. I think that PIs can be further explored in this discussion. >(Perhaps I should not have used as this would (I think) >require WG approval, so I would rephrase this as ) >Given that, it seems possible to include PI statements within the document as >the how the author intends the whitespace to be treated. I'm afraid that I must ask what these are to be used for. I used to think that this was a problem, and now I don't see how we really need these declarations. They only seem to be relevant for typesetting, and if typesetting is the task, then you'll only get correct results witha well-known DTD or stylesheet in any case -- so why have the declarations. I'm not concerned about readers taking correct stylesheets and later mucking them up -- that will always be possible. >It may be argued that this can be done better with stylesheets. Perhaps I'm >conservative, but I see PIs embedded in a document as 'being part of the >document' to a greater extent than stylesheets which are more likely to be >changed by people other than the document's authors. My problem is that I'm no longer able to see why this information has to go with the document... I can't think of a case where it's necessary, and tons of other information about the meanings of tags, etc, is not also necessary. Also, since XML passes all whitespace, the only case we can deal with is one where its essential to _ignore_ whitespace in the source document. -- David _________________________________________ David Durand dgd@cs.bu.edu \ david@dynamicDiagrams.com Boston University Computer Science \ Sr. Analyst http://www.cs.bu.edu/students/grads/dgd/ \ Dynamic Diagrams --------------------------------------------\ http://dynamicDiagrams.com/ MAPA: mapping for the WWW \__________________________ xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From dgd at cs.bu.edu Tue Aug 26 16:46:17 1997 From: dgd at cs.bu.edu (David G. Durand) Date: Mon Jun 7 16:58:20 2004 Subject: Whitespace In-Reply-To: <199708261426.PAA01594@mail.iol.ie> Message-ID: At 9:26 AM -0500 8/26/97, Sean Mc Grath wrote: >[David Durand] >My counter argument (not reproduced above) *followed* the sentence you *have* >reproduced. A lovely example of how quoting.... I knew that I should have held my fingers (but hit send too soon). Apologies for implying that you are not trying for understanding. >Here is a concrete scenario that either illustrates the problem or >illustrates my ignorance. > >I want to know how two XML applications that apply different >WS conventions can inter-operate losslessly. Specifically, why is this >scenario wrong? :- > >I wish to perform a null transformation across two editing tools App A and >App B. > >foo.xml --> App A --> App B --> bar.xml > >I want foo.xml == bar.xml > >App A : reads foo.xml and treats WS according to APPA-WS-RULES > writes temp.xml > >App B : reads temp.xml and treats WS according to APPB-WS-RULES > writes bar.xml > >Result : foo1.xml != bar.xml Editing tools that change whitespace are not preserving the XML data stream that would be returned by a parser on the document. a Tool that works like this is simply buggy, since it reads in data that would return one data stream to applications, and produces output that would produce a different stream. On the current definition, even tools that normalize CRLF to LF are potentially damaging the document. This last is the only poitn that worries me much. Editors are _not allowed_ to blindly apply application conventions, unless they can _ensure_ that the document was created for, and will only be processed by, that application. The beauty of not having whitespace normalization is that it's easy to tell if you've changed anything because the only way not to change it, is to change nothing. The only safe rule for an editor is to preserve whitespace just as it is, unless it knows something about the DTD, or stylesheet, or if the author requests special handling becuase she knows something about these. --- David _________________________________________ David Durand dgd@cs.bu.edu \ david@dynamicDiagrams.com Boston University Computer Science \ Sr. Analyst http://www.cs.bu.edu/students/grads/dgd/ \ Dynamic Diagrams --------------------------------------------\ http://dynamicDiagrams.com/ MAPA: mapping for the WWW \__________________________ xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From digitome at iol.ie Tue Aug 26 18:45:21 1997 From: digitome at iol.ie (Sean Mc Grath) Date: Mon Jun 7 16:58:20 2004 Subject: Whitespace Message-ID: <199708261645.RAA22663@mail.iol.ie> [David Durand] > >Editing tools that change whitespace are not preserving the XML data stream >that would be returned by a parser on the document. a Tool that works like >this is simply buggy, since it reads in data that would return one data >stream to applications, and produces output that would produce a different >stream. > >On the current definition, even tools that normalize CRLF to LF are >potentially damaging the document. This last is the only poitn that worries >me much. It worries me too! Here is a concrete example of a CRLF bug that I hit today. I have just used an OffLine Browser called Snake to download a web site authored in MS FrontPage. some of the links have been correctly munged to local links and some have not. By inspecting the HTML it emerged that correctly munged links looked like this:- whilst un-munged links looked like this:- It is easy to see what has happened here. The s/w developers have a pattern for matching AREA elements that does not countenance the presence of a CRLF. How should analagous problems in XML be addressed. Doing WS processing makes pattern matching/state space handling easier but at the expense of making it very difficult to re-produce the elided WS to ensure lossless transformation. xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From tbray at textuality.com Tue Aug 26 19:04:03 1997 From: tbray at textuality.com (Tim Bray) Date: Mon Jun 7 16:58:20 2004 Subject: Whitespace Message-ID: <3.0.32.19970826100108.00ab7bb0@pop.intergate.bc.ca> At 05:45 PM 26/08/97 +0100, Sean Mc Grath wrote: >It is easy to see what has happened here. The s/w developers have >a pattern for matching AREA elements that does not countenance the presence >of a CRLF. Gimme a break; the software developers in this case have screwed up; there is a technical term to describe this behavior: "wrong". There may in fact be productive things to be said about particular application profiles for whitespace handing, but this example is a complete red herring. >How should analagous problems in XML be addressed. By writing software correctly. -T. xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From Peter at ursus.demon.co.uk Tue Aug 26 19:22:29 1997 From: Peter at ursus.demon.co.uk (Peter Murray-Rust) Date: Mon Jun 7 16:58:20 2004 Subject: Whitespace Message-ID: <9668@ursus.demon.co.uk> There is clearly a wide spectrum of opinion on this - and everyone is being very helpful and patient. I think I see where (at least some of) the differences lie and hope this is helpful: In message dgd@cs.bu.edu (David G. Durand) writes: > > I'm afraid that I must ask what these are to be used for. I used to think > that this was a problem, and now I don't see how we really need these > declarations. They only seem to be relevant for typesetting, and if I think this highlights that what we are doing is going through a learning process and David (and others) have already been through this :-). It took several months for XML-WG to arrive at the present position (there were intermediate drafts which included munging of various sorts). [It reminds me of a story of a very famous physicist (I forget whom) who, when asked to justify an equation in a lecture, stated it was trivial, then looked at it in silence for 15 mins, and then re-iterated 'Yes, it is trivial'.] The problem we have is not a technical one, but a variety of human perceptions and preconceptions. We agree that: 1. this is NOT a parser concern, and all whitespace is passed to the application. 2. that it is always *possible* to create an XML document in which no non-significant whitespace appears. 3. the XML-WG, in its wisdom, has found it useful to allow authors to pass the attribute XML-SPACE="DEFAULT" to the application. I believe that (2) is David's position which is logical and consistent. If (2) is universally applied then I can see no value in (3). It suggests that there is value in passing non-significant whitespace to the application and processing it in some application-dependent way. If we are processing whitespace by stylesheet, then isn't DEFAULT irrelevant? My problem is probably mainly because, after *much* debate, (3) has been included in the spec and I don't see what it is for. [David suggests that one reason to add whitespace is that it should appear in the final typeset version - this makes it significant (though I suspect that some people would prefer to pass explicit markup). Personally I do not wish to do this.] As David says, it is possible to produce an XML document with no line-ends and no other non-significant whitespace. If additional whitespace (e.g. for paragraphs) is to be included in the processed document, then it can either be explicitly included as markup, or deduced from markup through stylesheets or other methods. The reasons I can see that non-significant whitespace is contained in XML documents are: - the documents are produced to be human-readable - the authoring/editing tools used introduce non-significant whitespace - non-significant whitespace is required to allow various tools to process the documents - humans edit the XML documents I can conceive of a time (perhaps 2 years hence) when there are a wide variety of XML authoring tools and when the HTML community is educated about XML. In that state, perhaps, documents will be always created without non-significant whitespace. Then, perhaps, we shall have a non-problem. At present we have (at least) the viewpoints: - whitespace matters and authors must define precisely what they want in a document. The SGML community can understand and manage whitespace. If newcomers find it difficult, they'll have to learn the rules, or use proper tools. - most of the people who will want to use XML will graduate from HTML. This has 'taught' them that whitespace is not significant and gets normalised somewhere. They will start creating XML by analogy with HTML. XML will not succeed unless we can offer some support for this transitional period. As is fairly obvious, I take the second viewpoint. I am trying to 'sell' CML to a community which has never heard of SGML, but knows about HTML. I cannot sell them files which they can't read (because they have no line breaks) or force them to understand where space conventions differ from HTML. Remember that many XML files are going to be authored by people who never go near an SGML tool - the molecular community will probably use C programs. So - David asks for examples :-) I want to be able to state that these 3 XML documents are to be interpreted to give identical results: and Almost everyone who posts **examples** of XML files shows them prettyprinted in some fashion. No-one posts 1000 character lines to this list, or to XML-SIG - they wouldn't be popular! So the impression is probably universal outside the XML experts that XML files can be prettyprinted ad lib. I would like to preserve this prettyprinting - I suspect this is a major motive for trying to see some way forward here. A second example could be the one that I posted earlier: We took 23.02+02 gram water This is clearly contains 'text' and my community is conditioned to reading this in the same way as HTML (i.e. that the line-ends are normalised to a single space.) It seems to me that this is likely to be valuable in many applications and that interoperability and code re-use would be greatly helped by giving it a label and a set of rules. As I have said more than once I would like to avoid having to develop both my own rules and my own code. I have a fear (and I think it is shared by my community) that data within a document can be changed by changing a stylesheet. The *meaning* of the (HTML) file below differs according to whether the line-end is normalised to a space or not:

I saw a black bird

Since stylesheets can be (and will be) imposed by people other than the author (publishers, browsers, readers, etc.) there is a danger that stylesheet imposed WS processing can change meaning. Of course you can argue that the author above should have taken greater trouble to create an unambiguous text, but this is the way that I expect many newcomers to XML to approach it. P. -- Peter Murray-Rust, domestic net connection Virtual School of Molecular Sciences http://www.vsms.nottingham.ac.uk/ xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From gannon at commerce.net Tue Aug 26 19:32:54 1997 From: gannon at commerce.net (Patrick Gannon) Date: Mon Jun 7 16:58:21 2004 Subject: Papers Comparing MCF, CDF, D-C & RDF? Message-ID: <01BCB209.51EF0540@arrow-d29.sierra.net> Does anyone know of any papers that discuss and compare/contrast the scope of the following standards efforts: MCF - Meta Content Framework (Apple/Netscape) CDF - Channel Definition Format (Microsoft) D-C - Dublin Core RDF - Resource Description Framework (W3C) I recognize that much of the discussion around these various topics indicates they are in various stages of development and review. What is not clear is the precise scope each of these endeavors. What problem sets are they trying to solve? I have read most of the relevant documentation describing each of these proposals (except RDF, since there is no public document describing the scope of the W3C RDF WG that I could find). So, is there any paper or soon-to-be-written paper that addresses the relative scope of these efforts? Patrick Gannon ----------------------------------------- President & CEO Internet Shopping Directory, Inc. 702-831-2251 702-831-3925 (Fax) mailto://patrick@shoppingdirect.com http://www.shoppingdirect.com xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From tbray at textuality.com Tue Aug 26 19:45:21 1997 From: tbray at textuality.com (Tim Bray) Date: Mon Jun 7 16:58:21 2004 Subject: Whitespace Message-ID: <3.0.32.19970826103807.00ab0e60@pop.intergate.bc.ca> At 05:11 PM 26/08/97 GMT, Peter Murray-Rust wrote: > 2. that it is always *possible* to create an XML document in which no > non-significant whitespace appears. > 3. the XML-WG, in its wisdom, has found it useful to allow authors > to pass the attribute XML-SPACE="DEFAULT" to the application. > >I believe that (2) is David's position which is logical and consistent. If >(2) is universally applied then I can see no value in (3). It suggests that >there is value in passing non-significant whitespace to the application and >processing it in some application-dependent way. If we are processing >whitespace by stylesheet, then isn't DEFAULT >irrelevant? My problem is probably mainly because, after *much* debate, (3) >has been included in the spec and I don't see what it is for. Well DEFAULT is 'irrelevant' in that it expresses no opinion about what should be done with whitespace. the PRESERVE value exists to support constructs like HTML's
.  Yes, putting XML-SPACE="PRESERVE" on
something with element content is at the least questionable; but the
fact that this can be used to do something stupid does not mean it
isn't useful.

>At present we have (at least) the viewpoints:
>	- whitespace matters and authors must define precisely what they want
>		in a document. The SGML community can understand and manage
>		whitespace. If newcomers find it difficult, they'll have to
>		learn the rules, or use proper tools.

Well, they only have to learn one rule: the whitespace you put in
the document is the whitespace that is in the document.  XML neither
addeth nor taketh away.

>	- most of the people who will want to use XML will graduate from HTML.
>		This has 'taught' them that whitespace is not significant and
>		gets normalised somewhere. They will start creating XML by 
>		analogy with HTML. XML will not succeed unless we can
>		offer some support for this transitional period.

Uh, if they are using it for browser applications, I am quite sure that
browsers, while doing XML, will duplicate the HTML whitespace semantics,
i.e. eat most of it, and people will just not notice the difference.
Another way to say this is that the "HTML" whitespace semantic should
probably be renamed the "browser" whitespace semantic.

It would be a good and useful thing to write down (precisely) what
that browser semantic is; it's a little subtler than you'd think.

When they get into more ambitious apps than just browsing, they will
be glad of XML's transparency.
 - T.

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo@ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa@ic.ac.uk)


From tbray at textuality.com  Tue Aug 26 19:48:50 1997
From: tbray at textuality.com (Tim Bray)
Date: Mon Jun  7 16:58:21 2004
Subject: Papers Comparing MCF, CDF, D-C & RDF?
Message-ID: <3.0.32.19970826104550.00a72930@pop.intergate.bc.ca>

At 10:17 AM 26/08/97 -0700, Patrick Gannon wrote:
>Does anyone know of any papers that discuss and compare/contrast the scope 
>of the following standards efforts:

No such exist, to my knowledge.

>So, is there any paper or soon-to-be-written paper that addresses the 
>relative scope of these efforts?

Several of these are about to be rolled together into RDF.  There is 
a big RDF meeting tomorrow and Thursday in Seattle at which this
process gets going.

For those who are W3C members, check out
 http://www.w3.org/Metadata/RDF/Group/9708/27Agenda.html
 -Tim

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo@ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa@ic.ac.uk)


From Peter at ursus.demon.co.uk  Tue Aug 26 20:16:40 1997
From: Peter at ursus.demon.co.uk (Peter Murray-Rust)
Date: Mon Jun  7 16:58:21 2004
Subject: Whitespace
Message-ID: <9686@ursus.demon.co.uk>

Thanks Tim - I think this helps (me) considerably :-)

In message <3.0.32.19970826103807.00ab0e60@pop.intergate.bc.ca> Tim Bray writes:
[...]
> 
> Well DEFAULT is 'irrelevant' in that it expresses no opinion about what
> should be done with whitespace.  the PRESERVE value exists to support

so when might it be used (in preference to a stylesheet, for example?)

> constructs like HTML's 
.  Yes, putting XML-SPACE="PRESERVE" on

Since the whitespace is all passed, presumably a stylesheet is capable of
keeping it all?
 
> something with element content is at the least questionable; but the
> fact that this can be used to do something stupid does not mean it
> isn't useful.

It sounds as if there isn't really very much need for XML-SPACE, and maybe
that has distorted my viewpoint...

> 
> >At present we have (at least) the viewpoints:
> >	- whitespace matters and authors must define precisely what they want
> >		in a document. The SGML community can understand and manage
> >		whitespace. If newcomers find it difficult, they'll have to
> >		learn the rules, or use proper tools.
> 
> Well, they only have to learn one rule: the whitespace you put in
> the document is the whitespace that is in the document.  XML neither
> addeth nor taketh away.

Understood. It is also the whitespace that your authoring tool puts in :-)
 
> >	- most of the people who will want to use XML will graduate from HTML.
> >		This has 'taught' them that whitespace is not significant and
> >		gets normalised somewhere. They will start creating XML by 
> >		analogy with HTML. XML will not succeed unless we can
> >		offer some support for this transitional period.
> 
> Uh, if they are using it for browser applications, I am quite sure that
> browsers, while doing XML, will duplicate the HTML whitespace semantics,
> i.e. eat most of it, and people will just not notice the difference.
> Another way to say this is that the "HTML" whitespace semantic should
> probably be renamed the "browser" whitespace semantic.
> 
> It would be a good and useful thing to write down (precisely) what
> that browser semantic is; it's a little subtler than you'd think.

I think this is the key to much of this discussion. (I am under no illusions
that it may be subtler than I can think :-) It was certainly true that early
HTML browsers could display whitespace very differently and I imagine
that there are still differences.

So - with Tim's encouragement - this seems like a useful thing to aim for.
This semantic seems to be one of the things we are chasing.
 
	P.

-- 
Peter Murray-Rust, domestic net connection
Virtual School of Molecular Sciences
http://www.vsms.nottingham.ac.uk/

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo@ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa@ic.ac.uk)


From digitome at iol.ie  Tue Aug 26 22:07:06 1997
From: digitome at iol.ie (Sean Mc Grath)
Date: Mon Jun  7 16:58:21 2004
Subject: Whitespace
Message-ID: <199708262006.VAA13479@GPO.iol.ie>

>At 05:45 PM 26/08/97 +0100, Sean Mc Grath wrote:
>>It is easy to see what has happened here. The s/w developers have
>>a pattern for matching AREA elements that does not countenance the presence
>>of a CRLF.

[Tim Bray]
>Gimme a break; the software developers in this case have screwed up;
>there is a technical term to describe this behavior: "wrong".  There may
>in fact be productive things to be said about particular application
>profiles for whitespace handing, but this example is a complete
>red herring. 
>

I presented this "red herring" because it was *real*. I could have
contrived a more realistic one:-) This is an
example of a *real* programmer screwing up in a real application.

I am interested in avoiding screwups. WS is a screwup "happy hunting
ground" for us normal programmers who make mistakes day in day out.

At least I think it is. Perhaps (hopefully) I'm wrong.

I doubt if I will get this right but I will try and formulate the programming
problem as I see it. 

Here goes:-

XML processing applications that read/write XML have to faithfully
reproduce white space to avoid data loss. In the course of XML processing,
actions will regularly be triggered by context. I.e. "element X within
element Y",
"first data content chunk below element X" etc.

Take a really simple context, "X followed by Y". In order to faithfully
reproduce 
WS on output the simple pattern "XY" must be transformed into (in rusty Perl)

"(w*)X(w*)Y(w*)"

Where "w" represents the pattern for White Space.

As the state spaces get more complex (i.e. realistic) doesn't this problem
escalate?

Could someone out there who reckons this is easy kindly put
me out of my misery by showing how it can be best handled?



Sean Mc Grath

sean@digitome.com
Digitome Electronic Publishing
http://www.digitome.com


xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo@ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa@ic.ac.uk)


From mrc at allette.com.au  Wed Aug 27 00:39:48 1997
From: mrc at allette.com.au (Marcus Carr)
Date: Mon Jun  7 16:58:21 2004
Subject: Whitespace
References: <199708260532.WAA00995@boethius.eng.sun.com> <34029587.52568DFA@allette.com.au>
Message-ID: <34035B04.FC63D9D6@allette.com.au>

Marcus Carr wrote:

> Jon Bosak wrote:
>
> > A discussion of conventions for specific classes of user agents
> (e.g.,
> > web browsers) is useful, but I feel that it's my obligation to point
>
> > out to anyone mistakenly thinking that this issue might conceivably
> be
> > reconsidered in the current XML specification that it is not going
> to
> > happen.
>
> I'm not asking for anything to happen, but I do believe these things
> should be allowed to be discussed. If people tire of the topic,
> they'll stop talking about it - knocking healthy (even if misguided)
> discussion on the head contributes nothing.

It has been pointed out to me in private mail that my answer could be
perceived as somewhat unfair criticism and that I may have
misinterpreted the tone of Jon's mail. I'll plead all the usual excuses
(it was late, my cat had been run over, etc), offer my apologies and
rephrase my point.

Given that there is nothing binding the suggestions of this group to the
standard, I feel that even the most radical suggestions should be
entertained as an exercise in lateral thinking (and perhaps tolerance).
As long as we all accept that this has no impact on the formation of the
standard, and we must, this list can act as a well of diversity,
tempered only by the delete keys of its readership.

Again, Jon my apologies.


--
Regards

Marcus Carr                  email:  mrc@allette.com.au
_______________________________________________________________
Allette Systems (Australia)  email:  info@allette.com.au
Level 10, 91 York Street     www:    http://www.allette.com.au
Sydney 2000 NSW Australia    phone:  +61 2 9262 4777
                             fax:    +61 2 9262 4774
_______________________________________________________________



xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo@ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa@ic.ac.uk)


From JohnGo at asymetrix.com  Wed Aug 27 00:41:16 1997
From: JohnGo at asymetrix.com (John Gossman)
Date: Mon Jun  7 16:58:21 2004
Subject: Request for advice defining an XML based syntax
Message-ID: 

>
>
>
>    To make a long story short:  I have been developing a file format for
>data exchange between applications.  The essential purpose is to provide a
>format that objects can stream their persistent state to, for saving or
>exchanging of data.  Further I have a number of criteria for this format: 
>    1.  It must be simple 
>    2.  It must be robust--resistant to data loss 
>    3.  Flexible -- all sorts of data 
>    4.  Extensible -- developers and users can add their own data and
>datatypes 
>    5.  Human readable -- easy to understand 
>    6.  Support versioning easily 
>    7.  Support strong typing--no confusion 
>    I knew from my experience with Autodesk's DXF (Drawing eXchange Format)
>that these goals were achievable, and knew where DXF fell down.  My essential
>idea is data come in two forms--primitive fields and structured records.  For
>primitive fields I realized I needed to store 3 things--type,name, and value.
> The original format I came up with was quite simple, in fact I'll just give
>an example of a button object's data: 
>
>start button 
>    string caption="Click Here" 
>    int left = 50 
>    int right = 100 
>    int top = 80 
>    int bottom = 100 
>end 
>  
>
>    Easy to parse, easy to output, easy to read (helps if you are a
>programmer used to a typed language), and no special characters except the
>almost universally understood '='.  Several of my co-workers asked why I
>didn't use MCF or XML.  My answer was that these formats are two complex, but
>after further study of XML I realized I could make an XML-compliant version
>of the syntax quite easily.  After several iterations I arrived at this: 
>
> 
>
>    Last week in Montreal, Tim Bray confirmed my suspicion that XML did not
>allow the supression of attribute names as a form of shorthand, which is
>going to necessitate one more change.  However, on further thought, I also
>wonder if I have violated something of the spirit of XML by including all the
>data in attributes--all structure no content.  Option 1 then is the
>following: 
>
> 
>
>    There is precedent for such a thing, in HTML's IMG tag for example, which
>is an empty tag with all the "data" in attributes.  My question then.  Is
>this better?:
>
> 
>
>    So, I am asking for the kind the advice of those most familiar with XML.
>Opinions please, either here or by private e-mail (johngo@asymetrix.com), on
>this question or anything else that comes to mind.
>
>    Many thanks in advance, 
>
>    John Gossman 
>    Asymetrix 
>
>    P.S.  The format (which I call OXF for Open Exchange Format) is fully
>defined in a spec written here.  It includes the ability to create data
>schema and use inheritance to extend them, and is specifically designed to be
>non-validating (for robustness:  you don't want to throw away all the data
>because of a few problems).  I would rather not post the spec. until I have
>settled these last few issues, but I will provide a draft for the asking. 
>
>
>
>
>

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo@ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa@ic.ac.uk)


From ricko at allette.com.au  Wed Aug 27 01:49:40 1997
From: ricko at allette.com.au (Rick Jelliffe)
Date: Mon Jun  7 16:58:21 2004
Subject: Request for advice defining an XML based syntax
Message-ID: <199708262357.JAA00084@jawa.chilli.net.au>

 
> From: John Gossman 
 
> >Option 1 then is the
> >following: 
> >
> > 
> >
> >    There is precedent for such a thing, in HTML's IMG tag for example, which
> >is an empty tag with all the "data" in attributes.  My question then.  Is
> >this better?:
> >
> > 
 
According to your taste, you can weight these general rationale and 
come to your own decision--

1) Attributes are really shorthand so that you don't need complex
content models, and to allow a measure of stronger typing in particular
for ID and IDREF attributes.   This suggests it doesn't matter which you
use: you don't have a complex content model and you the value attribute
is just CDATA.

2) The content is the thing primarily described by the GI. So an empty
element with an attribute called "value" is always an over-elarabrate 
design.  This suggest you should use Option 2.

3) The content of an element is the text that a dumb browser that is not
aware of your document type will display it.   Therefore the text 
should be in the nature of an alternative string for guidance.  So
 should be content, and  etc should use attributes.

4) You may at some future stage want to extend how , 
, etc work.  So option 1 leaves you free to define a content
model later, for some other functionality.

5) Using a value attribute is more familiar to HTML people who
like the meta tag.


You should also consider:

 

The XML element type declaration for this is:




I hope this is some help.

Rick Jelliffe

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo@ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa@ic.ac.uk)


From ricko at allette.com.au  Wed Aug 27 02:38:05 1997
From: ricko at allette.com.au (Rick Jelliffe)
Date: Mon Jun  7 16:58:21 2004
Subject: Whitespace
Message-ID: <199708270046.KAA01200@jawa.chilli.net.au>


 
> From: Sean Mc Grath 
 
> Could someone out there who reckons this is easy kindly put
> me out of my misery by showing how it can be best handled?

Without addressing your dolorous (if not rubescent) herring, 
Knuth's comment in "The Errors of TeX" are useful:

"The stickiest issue in TeX has always been the treatment of 
blank spaces.  Users tend to insert spaces in their computer
files so that files look nice, but document processors muct also
treat spaces as abojects that appear in the final output...
I kept searching for rules that would be simple enough to 
easily learned, yet natural enough that they could be applied 
almost unconsiously.  I finally concluded that no such rules
existed, and I opted for the best compromise I could find."

Charles Goldfarb commented at the Barcelona WG8 meeting 
that whitespace handling was one of the design areas that 
he felt SGML got it wrong (by which I think he did not mean
that the SGML86 rules are not a workable, justifiable and 
rational compromise -- given the constraint of having to work
with fixed-line-length text editors, which is the nub of the
design decision for SGML86 -- merely that perhaps the XML 
'solution' of making it someone else's problem would 
have deflected some consternation away from ISO 8879, and 
partitioned functionality more neatly).

The solution that I think XML *now* has is this:

1) There are ISO 10646 characters available for lots of different
kinds of spaces. These can be specified directly by numeric 
character references, or indirectly using the ISO public entities.
Some of these entities are already familiar to HTML people: in
particular     is generated almost pathologically by some
versions of Netscape's HTML editor.  So if you want to force 
a break or space, these should be used.

2) If you want to force that normal spaces should not be collapsed,
then the attribute  XML-SPACE="preserve" should be specified on 
the containing element.

3) Otherwise, you should use spaces and newlines only when you
need them, and expect whitespace sequences to be collapsed.
XML generators that have access to the DTD should strip out
confusing whitespaces from element and mixed content.

4) SGML86 and XML have different whitespace rules. So you should
expect to have to process the files to add or remove space when
you convert between the two, unless you write your SGML DTD
without mixed content and/or impose some stricter discipline on 
document creation.

5) If you need to prettyprint your document text, then you are best
advised to use whitespace within tags, rather than between tags.
For example:

An element

Rather than

An element

If this looks strange to XML people, then remember that Bert Bos found it natural to do (something like) this in a paper he wrote: blah< /x> blurt< /x> So I do not think that we should assume too much about how HTML people naturally view tag integrity. (In SGML and XML, Bert's experimental markup would be invalid and not well-formed, despite its nice pretty-printing: ETAGO ' Thanks for the summary in points 1-5. Those are exactly the sort of points I am seeking clarity on. The other option would be fine if I were defining a format with types like "button". But OXF is designed to describe generic data, the button was just an example. The DTD is strictly optional, perhaps even harmful in the case of OXF, since the whole purpose is to make it so the reader can salvage even partial or poorly formed files. John Gossman Asymetrix >---------- >From: Rick Jelliffe[SMTP:ricko@allette.com.au] >Sent: Tuesday, August 26, 1997 4:47 PM >To: 'xml-dev@ic.ac.uk' >Subject: Re: Request for advice defining an XML based syntax > > >You should also consider: > > > >The XML element type declaration for this is: > > > left CDATA #REQUIRED > right CDATA #REQUIRED > top CDATA #REQUIRED > bottom CDATA #REQUIRED > > > > > xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From ricko at allette.com.au Wed Aug 27 04:43:11 1997 From: ricko at allette.com.au (Rick Jelliffe) Date: Mon Jun 7 16:58:21 2004 Subject: Whitespace Message-ID: <199708270251.MAA05357@jawa.chilli.net.au> > From: Peter Murray-Rust > I would strongly argue against Unicode characters at this stage. *I* wouldn't > know where to get them from, and typing by hand could be a disaster. I have attached a table with how XML, by adopting ISO 10646, allows developers to handle spaces, hyphenation and breaking. I hope people find it useful. (I have previously sent around versions of the ISO public entity sets converted for XML use: these are available on the Robin Cover's website at the Summer Institute of Linguistics. The table has a copyright note against printing because I have prepared it for my forthcoming book "The SGML Cookbook" out soon.) You can get more information * the Unicode 2.0 book, available in book stores * ISO 10646 standard, availabel from your national standards bocy * there is an online listing of the characters at the Unicode consortium's website, and an independent one on the SGML Oslo archive site, and by looking at the SPREAD public entity set * on NT you can use the keycaps viewer to see (printing) characters in Unicode fonts. >It will take a while before Unicode is natural to HTML authors. ISO 10646 provides a very rich set of characters to handle spaces and newlines. It is very important that XML developers understand and implement them, because then it simplifies what people need to do in their XML scripts. It removes spacing from being a "how to format this element" issue to being a "how render this character" issue, which is neater. If developers ignore these unambiguous characters, they then have to overload space and -, with unpredictable results. To get definite results you need definite markup: developers should not confuse the visual simplicity of the space and hyphen with the complexity of what must be marked-up to get them to work. There is *no* natural way for HTML people to do most of the things that ISO 10646 offers for control of spaces, hyphenation and breaking. However, it is more like what users of word processors will find natural. Rick Jelliffe -------------- next part -------------- A non-text attachment was scrubbed... Name: space.htm Type: application/octet-stream Size: 2841 bytes Desc: space.htm (Internet Document (HTML)) Url : http://mailman.ic.ac.uk/pipermail/xml-dev/attachments/19970827/bc40c921/space.obj From ricko at allette.com.au Wed Aug 27 04:57:42 1997 From: ricko at allette.com.au (Rick Jelliffe) Date: Mon Jun 7 16:58:21 2004 Subject: Request for advice defining an XML based syntax Message-ID: <199708270305.NAA05672@jawa.chilli.net.au> > From: John Gossman > The other option would be fine if I were defining a format with types > like "button". But OXF is designed to describe generic data, the button > was just an example. The DTD is strictly optional, perhaps even harmful > in the case of OXF, since the whole purpose is to make it so the reader > can salvage even partial or poorly formed files. Declarations are also useful to describe what you want to get after salvaging. So they can be documentation for humans too. Rick Jelliffe xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@ic.ac.uk the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@ic.ac.uk) From mike at datachannel.com Wed Aug 27 18:07:31 1997 From: mike at datachannel.com (Mike Dierken) Date: Mon Jun 7 16:58:21 2004 Subject: Request for advice defining an XML based syntax Message-ID: <01BCB2C8.7AB050F0@NEMO> I also have some (philosophical) questions about elements and attributes in XML. Rick J's point: > 3) The content of an element is the text that a dumb browser that is not > aware of your document type will display it. Therefore the text > should be in the nature of an alternative string for guidance. So > should be content, and etc should use attributes. made a lot of sense for me. I think, however, that John G's application of XML is such that the properties of objects 'are' the content, and therefore it's not required for other viewers to skip that information. I would like to hear some pro's & con's about the following four styles (continuing John Gossman's example): 1 Attributes within element 2 Attributes as single specific sub-element 3 Attributes as several specific sub-elements NOTE: the properties of the button are stored as in style 1 (i.e. within the element) so other viewers can skip them. 4 Attributes as several generic sub-elements NOTE: The properties of the button are stored as content, since the document is intented to be storage for objects & their properties (i.e. the properties 'are' the content). In addition I have two questions about elements and attributes. 1. Generic tag with 'type' attribute When should you use a generic versus a specific