X-Recipient: archive-cygwin AT delorie DOT com X-Spam-Check-By: sourceware.org Date: Fri, 3 Apr 2009 21:20:48 +0200 From: Corinna Vinschen To: cygwin AT cygwin DOT com Subject: Re: [1.7] Support for CJK Character Sets Message-ID: <20090403192048.GC852@calimero.vinschen.de> Reply-To: cygwin AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com References: <20090403173212 DOT 51916 DOT qmail AT web4102 DOT mail DOT ogk DOT yahoo DOT co DOT jp> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090403173212.51916.qmail@web4102.mail.ogk.yahoo.co.jp> User-Agent: Mutt/1.5.19 (2009-02-20) Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com On Apr 4 02:32, neomjp wrote: > I used this Corinna's tiny program > (http://sourceware.org/ml/cygwin/2009-04/msg00053.html ) > to create a file with a name containing a CJK character and tested > how setting LANG works. > > I changed 0x20ac to 0x4e00 (). This is one of the > characters used in all three languages. It is 0xe4 0xb8 0x80 in > hexadecimal UTF-8. So, without setting LANG, the file name should look > like "qq\016\344\270\200". > [...] > But it failed for JIS/ISO-2022-JP and eucJP. (It was represented as > ASCII SO(0x0e)/UTF-8 sequence). > > What is going wrong here? What makes the file name conversion from > UTF-16 to these character sets to fail? Or, what am I doing wrong? > [...] > LANG=en_US.ISO-2022-JP > 0000000 71 71 0e e4 b8 80 0a > q q so d 8 nul nl > 0000007 > This must be identical to: > 0000000 71 71 1b 24 42 30 6c 1b 28 42 0a > q q esc $ B 0 l esc ( B nl > 0000013 Esc? Uh oh. That is really correct? [...time passes reading http://en.wikipedia.org/wiki/ISO_2022...] Oh well, this will not work right now. I haven't looked into this before and I actually thought that JIS is a double byte charset. The properties of this charset don't allow to use the handcrafted doublebyte charset function I created for Cygwin. > LANG=en_US.eucJP > 0000000 71 71 0e e4 b8 80 0a > q q so d 8 nul nl > 0000007 > This must be identical to: > 0000000 71 71 b0 ec 0a > q q 0 l nl > 0000005 Same here, since eucJP characters can apparently contain three bytes. I will have to rework the doublebyte function, or I have to create a special multibyte function for these charsets. Thanks for the test. I will look into that in the next couple of days. Stay tuned. Corinna P.S: I'm not fluent with the Japanese charsets and codepages used on Windows. http://msdn.microsoft.com/en-us/library/dd317756(VS.85).aspx contains all supported codepages/charsets. If you look for the codepages 50220-50222, you'll see they are all called ISO 2022 Japanese. In Cygwin I'm using 50220 for JIS. Is that correct? Or should I rather use one of 50221 or 50222? -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/