At 2004-05-31 16:58 -0700, Juan Chu Chow wrote:
>The code-points I get seem a bit weird, they are:
>0x65E5 0x672C 0x8A9E.
>I checked the first 3 ideographs with the "CJK Unified Ideographs
>Range: 4E00-9FAF" chart at: http://www.unicode.org/charts/PDF/U4E00.pdf,
>and they do match the ideographs I expect,
Then they shouldn't be considered weird.
>though I'm not sure why I get the 0xFF11 in place of the number "1".
Looking at the Unicode standard, it does make sense:
Z:\data\docs\Unicode>grep FF11 UnicodeData.txt
FF11;FULLWIDTH DIGIT ONE;Nd;0;EN;<wide> 0031;1;1;1;N;;;;;
>something strange maybe occurring in the transcoding. I'm using ICU to
>go from Shift-JIS to UTF-16.
I'm unfamiliar with that tool ... but have you tried round-tripping your
data to see if you get what you start with?
>The files I attempted to attached earlier can now be retrieved at:
Thank you ... my Python implementation is reporting a UTF-16 encoding error
(though not telling me where) ... but Windows command window and Notepad
and Word are not giving me an error and I am seeing ideographic glyphs
(though I have no idea if they are correct or not). I can reproduce your
error with Internet Explorer.
Looking at your characters 0x65E5 0x672C 0x8A9E:
(1) XML 1.0 (3rd Edition) covers all these in production  Ideographic,
which is referenced in  Letter, which is allowed in  Name and 
NameChar as characters of a name
(2) XML 1.1 covers all these in production , which is allowed in  and
[4a] as characters of a name
Looking at the character 0xFF11:
(1) it is *not* a letter character in XML 1.0
(2) it *is* a letter character in XML 1.1
So, my read of the Recommendations is that it is not well-formed for XML
1.0 but is well-formed for XML 1.1.
I downloaded Xerces-J 2.6.2 that supports XML 1.1 but that is reporting
"ByteToCharUnicode" errors in the file.
So ... I removed all occurrences of the 0xff11 character and it worked just
That leads me to suspect that the UTF-16 encoding for 0xff11 is two bytes,
not just one, and having it as only one is causing the error ... so it
could be your JIS->UTF-16 conversion tool that is at fault. I've just
spent a long time trying to find the UTF-16 representation algorithm to
find out what the UTF-16 expression for 0xff11 is, but I cannot find it and
I've run out of time this evening to work on this.
Does anyone know where to find the UTF-16 algorithm for encoding Unicode
characters? I cannot find it in the http://www.unicode.org technical reports.
If you can get your round-tripping to work with your conversion tool, you
will probably solve the problem. I would be curious to hear what you
accomplish with that.
I hope this helps.
Public courses: Spring 2004 world tour of hands-on XSL instruction
Next: 3-day XSLT/XPath; 2-day XSL-FO - Birmingham, UK June 14,2004
World-wide on-site corporate, govt. & user group XML/XSL training.
G. Ken Holman mailto:[log in to unmask]
Crane Softwrights Ltd. http://www.CraneSoftwrights.com/l/
Box 266, Kars, Ontario CANADA K0A-2E0 +1(613)489-0999 (F:-0995)
Male Breast Cancer Awareness http://www.CraneSoftwrights.com/l/bc
Legal business disclaimers: http://www.CraneSoftwrights.com/legal