Print

Print


Ah, the Record End issue raises its ugly head yet again!

Three issues are raised for the database world: (1) Is a given relational
database a compliant XML processor, from the standpoint of whitespace
handling? (2) If we are storing XML elements as "objects" in a relational
database, how do we know what the declared values of the elements'
attributes are? (3) Is it really necessary, as the examples given suggest,
that mixed content must be eliminated from XML documents in order to treat
them as "objects" in relational databases?

>[mark birbeck]
>
>   There are a number of simple ways of treating 'mixed content'. If we
>   have:
>
>           <TEXT>
>                   <QUOTE>
>                           Living in the
>                           <COUNTRY ISO="US">States</COUNTRY>
>                           must be great y'all
>                   </QUOTE>
>                   , said

>[peter flynn]
>
>This is going to cause problems if you ever want to get the data
>back out again and use it for typesetting, because you'll get
>
>   " [a]Living in the States must be a great y'all " , [b]said ..."

>Those intrusive spaces are what most browsers will do with the record
>ends.

[NOTE: I've added [a] and [b] above after the intrusive spaces, instead of
using "^^" indicators, for those like me whose idiotic bloatware mail
programs prevent them from using monospaced fonts unless they put the
whole mail into HTML.]

Leaving aside browsers, isn't it the case that with a conformant XML
processor, the whitespace will simply be passed through to the application?
(By XML clause 2.10).  Thus, spaces at [a] and [b] will and must "intrude"
and it's up to the composition program to deal with them.  However, with
a compliant SGML parser -- correct me if I am wrong, here, Peter -- the
spaces at [a] and [b] will be ignored, because they are all in element
content, not mixed content. (By the opaque clause 7.6.1).

So issue (1) for any database vendor/implementer:

Is the database a compliant XML processor? In particular, does it handle
whitespace according to XML 1.0's clause 2.10?

>[mark birbeck]
>           <TEXT>
>                   <QUOTE>
>                           <COUNTRY ISO="US" PRE="Living in the" POST="must
>   be great y'all">States</COUNTRY>
>                   </QUOTE>
>

>[peter flynn]

>Gag, Mark. But it resolves to the same stream.

<gag type="intense"/>Agreed!

But does it really resolve to the same stream?  (Again, the mailers may be
messing up the whitespace in the examples.)

Looks to me like the above resolves to "Living in the[a]States", since
there's no space at [a] either trailing the attribute value or leading the
<country> element.

(Assuming the declared value of the PRE attribute is CDATA, since
otherwise, by clause 3.3.3, the trailing space is normalized away.
Hmmm, guess we need the DTD after all, even in the relational world.)

So issue (2) for the relational database implementer/vendor:

If we are storing XML elements as "objects" in a relational
database, how do we know what the declared values of the elements'
attributes are?

(Which we need to know, because otherwise we can't be assured of processing
the whitespace properly when it occurs in attribute values.)

How is the following content, <country pre="Living in the"> stored in
the database so that we know that a DTD sets the declared content of the
attribute "pre" to CDATA?

>[Sam Hunting]
>   > How do you do a join on XML data that looks like this:
>
>   > <element>This is #PCDATA<mixed>with mixed content</mixed>and an
>   element
>   > mixed.</element>
>
[Peter Flynn]
>The word "join" is out of context: it belongs in the vocabulary of
>database engineering, and XML is about text markup.

Possibly join is the wrong word -- this is a conversation about XML markup
and database engineering, and I come from the markup world, not the
database world, so perhaps my usage was not on point.

Mark Birkbeck originally wrote:

    The attribute table has a join on the element table
    to say what element the attribute belongs to, whilst the element has
    joins to itself to say who the parent of an element is. This allows us
    to store an object-like tree structure, and so generate XML documents
    from any point in the tree.

I was concerned to know how this "join" approach handled mixed content.
The answer: it doesn't.

Only element content is allowed, either (a) through the gag-inducing
attribute approach used above, or (b) by having an element or pseudo-element
(say, <pcdata>) that contains only #PCDATA. Both approaches sound to me
like <ironic>optimizations</ironic> driven by the constraints of an
installed base, rather than being driven, as the XML specification is, by
the requirement that "XML documents should be human-legible and reasonably
clear".

So issue (3) for the relational database vendor/implementer:

(3) Is it really necessary, as the examples given suggest,
that mixed content must be eliminated from XML documents in order to treat
them as "objects" in relational databases?

Sam Hunting