delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2009/04/03/14:21:16

X-Recipient: archive-cygwin AT delorie DOT com
X-Spam-Check-By: sourceware.org
Date: Fri, 3 Apr 2009 21:20:48 +0200
From: Corinna Vinschen <corinna-cygwin AT cygwin DOT com>
To: cygwin AT cygwin DOT com
Subject: Re: [1.7] Support for CJK Character Sets
Message-ID: <20090403192048.GC852@calimero.vinschen.de>
Reply-To: cygwin AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
References: <20090403173212 DOT 51916 DOT qmail AT web4102 DOT mail DOT ogk DOT yahoo DOT co DOT jp>
MIME-Version: 1.0
In-Reply-To: <20090403173212.51916.qmail@web4102.mail.ogk.yahoo.co.jp>
User-Agent: Mutt/1.5.19 (2009-02-20)
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

On Apr  4 02:32, neomjp wrote:
> I used this Corinna's tiny program
> (http://sourceware.org/ml/cygwin/2009-04/msg00053.html )
> to create a file with a name containing a CJK character and tested
> how setting LANG works.
> 
> I changed 0x20ac to 0x4e00 (<CJK Ideograph, First>). This is one of the
> characters used in all three languages. It is 0xe4 0xb8 0x80 in
> hexadecimal UTF-8. So, without setting LANG, the file name should look
> like "qq\016\344\270\200".
> [...]
> But it failed for JIS/ISO-2022-JP and eucJP. (It was represented as
> ASCII SO(0x0e)/UTF-8 sequence).
> 
> What is going wrong here? What makes the file name conversion from
> UTF-16 to these character sets to fail? Or, what am I doing wrong?
> [...]
> LANG=en_US.ISO-2022-JP
> 0000000  71  71  0e  e4  b8  80  0a
>           q   q  so   d   8 nul  nl
> 0000007
> This must be identical to:
> 0000000  71  71  1b  24  42  30  6c  1b  28  42  0a
>           q   q esc   $   B   0   l esc   (   B  nl
> 0000013

Esc?  Uh oh.  That is really correct?

[...time passes reading http://en.wikipedia.org/wiki/ISO_2022...]

Oh well, this will not work right now.  I haven't looked into this
before and I actually thought that JIS is a double byte charset.
The properties of this charset don't allow to use the handcrafted
doublebyte charset function I created for Cygwin.

> LANG=en_US.eucJP
> 0000000  71  71  0e  e4  b8  80  0a
>           q   q  so   d   8 nul  nl
> 0000007
> This must be identical to:
> 0000000  71  71  b0  ec  0a
>           q   q   0   l  nl
> 0000005

Same here, since eucJP characters can apparently contain three bytes.
I will have to rework the doublebyte function, or I have to create
a special multibyte function for these charsets.

Thanks for the test.  I will look into that in the next couple of days.
Stay tuned.


Corinna


P.S: I'm not fluent with the Japanese charsets and codepages used on
Windows.  http://msdn.microsoft.com/en-us/library/dd317756(VS.85).aspx
contains all supported codepages/charsets.  If you look for
the codepages 50220-50222, you'll see they are all called ISO 2022
Japanese.  In Cygwin I'm using 50220 for JIS.  Is that correct?
Or should I rather use one of 50221 or 50222?

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Problem reports:       http://cygwin.com/problems.html
Documentation:         http://cygwin.com/docs.html
FAQ:                   http://cygwin.com/faq/

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019