LISTSERV mailing list manager LISTSERV 15.5

Help for HTML-WG Archives

HTML-WG Archives

HTML-WG Archives


Next Message | Previous Message
Next in Topic | Previous in Topic
Next by Same Author | Previous by Same Author
Chronologically | Most Recent First
Proportional Font | Monospaced Font


Join or Leave HTML-WG
Reply | Post New Message
Search Archives

Subject: Re: Comments on: "Character Set" Considered Harmful
From: [log in to unmask] (Bob Jung)
Reply-To:[log in to unmask]
Date:Thu, 20 Apr 95 13:32:35 EDT

text/plain (79 lines)

I second Amanda's appeal to address the pressing pragmatic issues at hand,
especially the labelling issue.

At 10:31 AM 4/17/95, Amanda Walker wrote:
>[This conversation is getting oddly neo-Platonic for an IETF working group :)]
>I am, rather, concerned with a small set
>of pressing pragmatic issues.  Principal among them is simply being able to
>determine unambiguously what characters are being represented in an HTML
>document so that I can display them.  This is mostly a labelling issue,
>The status quo in this regard is broken.  As anyone who has tried to implement
>Japanese support in their browser can confirm, there is a lot of content out
>there whose interpretation cannot be determined unambiguously by software.
>This is bad.

Yes!  Labelling is something we need to resolve ASAP.  In Netscape's upcoming
releases, our HTTP server can add the MIME charset parameter to the
HTTP Content-Type header

        Content-Type: text/html; charset=iso-2022-jp

and our client will parse the charset parameter and do the corresponding
code conversions and font selections.  From the discussions in the http-wg,
this seems to be the direction we're heading for the HTTP spec.

While this helps content providers to get their documents rendered correctly,
we do not see this as a total solution.  We need a way to label within HTML,
so that documents can be self-labeling and easier for content developers to
add this info.

>To give a concrete example, the Macintosh on which I am typing this message
>can handle multilingual text just fine.  At the moment, it has fonts & input
>methods installed for European, Russian, Hebrew, Arabic, and Japanese.

As some of you are probably aware, in Mac files the data is tagged by using a
notion of string runs.  Each run can be associated with style info such as the
font used.  We could consider a similar concept for HTML to solve the labelling
problem.  I'm open for discussion.

>There are HTML documents in existence that contain content in one or more
>of these.

Almost all HTML I've seen has been in a single encoding.

>All I want right now is some method for determining how to match them up.  So
>far, what we do is cheat.  ISO 2022 is easy to automatically detect even in
>mislabeled text, and is reasonably popular, so we've started with Japanese.
>There's only so far we can go with clever inferences, though.

And none of these clever techniques is 100% deterministic...
And unfortunately, more and more Japanese Web data is in SJIS...

>I don't mind translating between the transport representation and IS 10646, so
>that the SGML layer only sees a sequence of IS 10646 code points.  That's
>simple.  What I do mind is endless discussion about the distinctions between
>characters, glyphs, codes, and the essential nature of reality, even though in
>other contexts I may care greatly about such issues.  They simply do not
>address the issue at hand (which Gavin's proposal does, as I see it).
>I'm not trying to squelch anyone, I just think we're getting a bit far afield.
>Amanda Walker
>InterCon Systems Corporation


Bob Jung        [log in to unmask]       +1 415 528-2688, fax +1 415 528-4122
Netscape Communications Corp.   501 E. Middlefield      Mtn View, CA   94041

Back to: Top of Message | Previous Page | Main HTML-WG Page



CataList Email List Search Powered by the LISTSERV Email List Manager