Print

Print


Peter Flynn wrote:

> Mark Birkbeck writes:
>
> >  There are a number of simple ways of treating 'mixed content'. If
we
> >  have:
> >
> >           <TEXT>
> >                   <QUOTE>
> >                           Living in the
> >                           <COUNTRY ISO="US">States</COUNTRY>
> >                           must be great y'all
> >                   </QUOTE>
> >                   , said
>
> This is going to cause problems if you ever want to get the data
> back out again and use it for typesetting, because you'll get
>
>   " Living in the States must be a great y'all " , said ..."
>    ^                                          ^ ^
> Those intrusive spaces are what most browsers will do with the record
> ends.

I was laying it out for legibility - but if we want to be pedantic I
thought it would actually be passed as:

                            Living in the
                            States
                            must be great y'all
                     , said

since white-space inside elements that do not have type 'element
content' is meant to be preserved, is it not?

>           <TEXT>
>                   <QUOTE>
>                           <PCDATA>Living in the</PCDATA>
>                           <COUNTRY ISO="US">States</COUNTRY>
>                           <PCDATA>must be great y'all</PCDATA>
>                   </QUOTE>
>                   <PCDATA>, said</PCDATA>
>
> That's much better,

Thanks.

> except that you probebly don't even want the comma
> now, because it can be inferred from the rules of English grammar,
> which say that reported speech in quotes followed by the verb
> expressing who uttered it is delimited with one (hunt the referent
:-).

Mmm. I take your point, but feel a little uneasy about removing things
from the original text. I did it in my example with the quotes because I
thought that everyone might latch onto that instead of the wider point I
was trying to make. But now we have lost the original text. What of
novels such as Trainspotting by Irvin Welsh, or Patsy Clarke Ha! Ha!
Ha!, which make use of layout devices such as:

        - Eat your dinner, said mum.
        - No I don't want it, I said.
        - Eat it now shouted dad.

Now if we mark that up with a <QUOTE> tag, and then remove the comma in
the first line, as per your rule, then when we render it back again, we
get:

        - "Eat your dinner", said mum.
        - "No I don't want it", I said.
        - "Eat it now", shouted dad.

Putting aside the speech marks, which is less of an issue, a comma has
been introduced in the third line that the author never wrote!

>   The first solution feels more 'correct';
>
> ...apart from the record end problem, which is inherent to mixed
> content: for correct rendering it must be encoded as
>
>            <TEXT><QUOTE>Living in the <COUNTRY
>            ISO="US">States</COUNTRY> must be great
>            y'all</QUOTE>, said
>
> avoiding the intrusive line-ends.

As I said before, isn't all white-space passed through in mixed content?
What you are driving at is something slightly different which is that
most browsers fold a lot of white-space into one space, and so if you
have a line break in the middle of a sentence you end up with an extra
space character. You haven't solved the problem with your layout,
because you have got round this problem by using a line-break to act as
a space. Surely not on, because you have lost some of the original data!
To illustrate, if you had this:

        Let's point to my name ->Mark for want of something better.

But stored it as:

        <TEXT>Let's point to my name ->
        <NAME>Mark</NAME> for want of something better.
        </TEXT>

You would get:

        Let's point to my name -> Mark for want of something better.
                               ^
                               |
* extra space ------------------


(sorry if you're not using a fixed width font!). In other words you have
not solved the problem that you thought you had. As it happens I don't
think it's the job of XML to worry about browser problems. XSL could
deal with it though.

>   I haven't delved far enough
>   into the XML definition but it may even be 'implied' by the
definition,
>   since untagged data is PCDATA.
>
> I don't know where that idea comes from. All data must be enclosed in
> some element: there is no such thing as "untagged" data.

Fair comment. I was trying to say that if you had a mixed-content
element, at the level of implementation in our database it was
equivalent to:

        <QUOTE>
            <PCDATA>Living in the</PCDATA>
            <COUNTRY ISO="US">States</COUNTRY>
            <PCDATA>must be great y'all</PCDATA>
        </QUOTE>

and that the introduction of an implied child of type PCDATA may not be
so far from the XML definition.

>>   The second, however, is slightly easier
>>   to implement in a user interface, and given that's where most of
the
>>   problems lie, that's what we've done for now!
>
>This is the approach the EuroMath DTD takes: there is no mixed
>content. It's superficially attractive but more cumbersome to process,
>and makes for a more complex DTD, as the element which holds the
>undistinguished text usually has to occur at many levels.

I'm not talking about changing the DTDs. I leave those intact. I was
merely using XML-style notation to illustrate how we store data in the
database (or our object-reflection). Each object in our database has by
default a pre- and post-text attribute, that is part of our system, not
the DTD. It's a kludge, so I don't want it cluttering up other stuff!

Regards,

Mark


Mark Birbeck
Managing Director
Intra Extra Digital Ltd.
39 Whitfield Street
London
W1P 5RE
w: http://www.iedigital.net/
t: 0171 681 4135
e: [log in to unmask]