X-Recipient: archive-cygwin AT delorie DOT com X-SWARE-Spam-Status: No, hits=-0.6 required=5.0 tests=AWL,BAYES_05,SPF_PASS X-Spam-Check-By: sourceware.org Message-ID: <20090403173212.51916.qmail@web4102.mail.ogk.yahoo.co.jp> Date: Sat, 4 Apr 2009 02:32:11 +0900 (JST) From: neomjp Subject: [1.7] Support for CJK Character Sets To: cygwin AT cygwin DOT com MIME-Version: 1.0 Content-Type: text/plain; charset=ascii Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com On 2009/04/02 22:46, Corinna Vinschen wrote: > > Btw., it's really not tricky to create a filename with special > > characters: I used this Corinna's tiny program (http://sourceware.org/ml/cygwin/2009-04/msg00053.html ) to create a file with a name containing a CJK character and tested how setting LANG works. I changed 0x20ac to 0x4e00 (). This is one of the characters used in all three languages. It is 0xe4 0xb8 0x80 in hexadecimal UTF-8. So, without setting LANG, the file name should look like "qq\016\344\270\200". (Note that the \016 is ASCII SO, which shows that cygwin could not convert the next character to the character set). I checked how the look of the file name changes by setting LANG to each character set. A list of supported character sets is found in http://cygwin.com/1.7/cygwin-ug-net/setup-locale.html . The result (see below) was that the filename was correctly converted to UTF-8 or SJIS or GBK or Big5 or eucKR. They correctly matched the name converted using iconv. But it failed for JIS/ISO-2022-JP and eucJP. (It was represented as ASCII SO(0x0e)/UTF-8 sequence). What is going wrong here? What makes the file name conversion from UTF-16 to these character sets to fail? Or, what am I doing wrong? Any hints? -- neomjp for lang in UTF-8 SJIS GBK Big5 ISO-2022-JP eucJP eucKR ; do export LANG="en_US.${lang}"; echo; echo LANG=${LANG}; ls q* | od -t x1 -t a; export LANG="en_US.UTF-8"; echo "This must be identical to:" ls q* | iconv -f UTF-8 -t ${lang} | od -t x1 -t a; unset LANG ; done; LANG=en_US.UTF-8 0000000 71 71 e4 b8 80 0a q q d 8 nul nl 0000006 This must be identical to: 0000000 71 71 e4 b8 80 0a q q d 8 nul nl 0000006 LANG=en_US.SJIS 0000000 71 71 88 ea 0a q q bs j nl 0000005 This must be identical to: 0000000 71 71 88 ea 0a q q bs j nl 0000005 LANG=en_US.GBK 0000000 71 71 d2 bb 0a q q R ; nl 0000005 This must be identical to: 0000000 71 71 d2 bb 0a q q R ; nl 0000005 LANG=en_US.Big5 0000000 71 71 a4 40 0a q q $ @ nl 0000005 This must be identical to: 0000000 71 71 a4 40 0a q q $ @ nl 0000005 LANG=en_US.ISO-2022-JP 0000000 71 71 0e e4 b8 80 0a q q so d 8 nul nl 0000007 This must be identical to: 0000000 71 71 1b 24 42 30 6c 1b 28 42 0a q q esc $ B 0 l esc ( B nl 0000013 LANG=en_US.eucJP 0000000 71 71 0e e4 b8 80 0a q q so d 8 nul nl 0000007 This must be identical to: 0000000 71 71 b0 ec 0a q q 0 l nl 0000005 LANG=en_US.eucKR 0000000 71 71 ec e9 0a q q l i nl 0000005 This must be identical to: 0000000 71 71 ec e9 0a q q l i nl 0000005 -------------------------------------- Power up the Internet with Yahoo! Toolbar. http://pr.mail.yahoo.co.jp/toolbar/ -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/