Mark Birkbeck writes:
There are a number of simple ways of treating 'mixed content'. If we
Living in the
must be great y'all
This is going to cause problems if you ever want to get the data
back out again and use it for typesetting, because you'll get
" Living in the States must be a great y'all " , said ..."
^ ^ ^
Those intrusive spaces are what most browsers will do with the record
<PCDATA>Living in the</PCDATA>
<PCDATA>must be great y'all</PCDATA>
That's much better, except that you probebly don't even want the comma
now, because it can be inferred from the rules of English grammar,
which say that reported speech in quotes followed by the verb
expressing who uttered it is delimited with one (hunt the referent :-).
<COUNTRY ISO="US" PRE="Living in the" POST="must
be great y'all">States</COUNTRY>
Gag, Mark. But it resolves to the same stream.
The first solution feels more 'correct';
...apart from the record end problem, which is inherent to mixed
content: for correct rendering it must be encoded as
<TEXT><QUOTE>Living in the <COUNTRY
ISO="US">States</COUNTRY> must be great
avoiding the intrusive line-ends.
I haven't delved far enough
into the XML definition but it may even be 'implied' by the definition,
since untagged data is PCDATA.
I don't know where that idea comes from. All data must be enclosed in
some element: there is no such thing as "untagged" data.
The second, however, is slightly easier
to implement in a user interface, and given that's where most of the
problems lie, that's what we've done for now!
This is the approach the EuroMath DTD takes: there is no mixed
content. It's superficially attractive but more cumbersome to process,
and makes for a more complex DTD, as the element which holds the
undistinguished text usually has to occur at many levels.
> How do you do a join on XML data that looks like this:
> <element>This is #PCDATA<mixed>with mixed content</mixed>and an
The word "join" is out of context: it belongs in the vocabulary of
database engineering, and XML is about text markup. In any event, the
markup above is wrong and dangerously misleading, since it parses to
This is #PCDATAwith mixed contentand an element mixed.
which I am sure is not what the author intended. It ought to read
<element>This is #PCDATA <mixed>with mixed content</mixed> and an
(see the difference?).