LISTSERV mailing list manager LISTSERV 15.5

Help for HTML-WG Archives


HTML-WG Archives

HTML-WG Archives


View:

Next Message | Previous Message
Next in Topic | Previous in Topic
Next by Same Author | Previous by Same Author
Chronologically | Most Recent First
Proportional Font | Monospaced Font

Options:

Join or Leave HTML-WG
Reply | Post New Message
Search Archives


Subject: Re: Charsets: Problem statement/requirements?
From: Gavin Nicol <[log in to unmask]>
Reply-To:[log in to unmask]
Date:Thu, 9 Feb 95 09:32:30 EST
Content-Type:text/plain
Parts/Attachments:
Parts/Attachments

text/plain (59 lines)



Bob Jung writes:

>I believe these should be treated as a byte value in the
>"charset=something-other-than-latin1" encoding.  If the content
>developer wants to specifiy a multibyte character, use something like:
>
>                &#nnn&#nnn

This is quite incorrect. SGML knows nothing at all about bytes.

Joe English asks:

>How should numeric character references (&#nnn;) be interpreted in
>text/html; charset=something-other-than-latin1 ? 

Well, if you have a copy of Goldfrab handy, have a look on page
161, section 4.5.2 where it says:

<quote>
If the function is wanted, a "named character reference"
incorporating the function name is used; otherwise a numeric character
reference is used, and the character is treated as data.
</quote>

and at page 357, section 9.5 note #2 which says:

<quote>
When a document is translated to a different document character set,
the character number of each numeric character reference must be
changed to the corresponding character number of the new set.
</quote>

Bottom line: the numeric character references must be mapped onto the
corresponding character in the new document character set <emph>if it
is being translated from one character set to another.</emph>.

>What if the MIME charset= parameter specifies a multibyte encoding?

The <emph>encoding</emph> has nothing to do with it because the parser
is not concerned at all with that: it only knows about characters, and
uses the specified character set to map codes to them.

>Will this break the "Added Latin 1 for HTML" entity set, which uses
>numeric character references to define all the entities?

Probably, because the MIME charset=xxxx is specifying the document
character set, and the numeric character entities will be resolved
using it (because we are not performing a translation). However, I
would assume that the HTML parser would be smart enough to map them
onto something reasonable (that's the benefit of using named character
references: changes are isolated to one spot, and in something like
HTML, where the parsers are hard coded, one could also hard code the
mappings for all supported character sets). 

This assumes that the characters are available in the character set...



Back to: Top of Message | Previous Page | Main HTML-WG Page

Permalink



LISTSERV.HEANET.IE

CataList Email List Search Powered by the LISTSERV Email List Manager