X-Recipient: archive-cygwin AT delorie DOT com X-Spam-Check-By: sourceware.org Date: Sat, 26 Sep 2009 11:15:04 +0200 From: Corinna Vinschen To: cygwin AT cygwin DOT com Subject: Re: The C locale Message-ID: <20090926091504.GA7275@calimero.vinschen.de> Reply-To: cygwin AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com References: <416096c60909071308qc5ff057sbe9cb1dbc270554f AT mail DOT gmail DOT com> <20090908193456 DOT GC17515 AT calimero DOT vinschen DOT de> <416096c60909081449r1fe024dbm7b82a3719be05e9e AT mail DOT gmail DOT com> <20090921103758 DOT GE20981 AT calimero DOT vinschen DOT de> <416096c60909211420g4ac8ea93l80fc1f00dcd5c0f3 AT mail DOT gmail DOT com> <3f0ad08d0909240003j435818e7h6f7cde2e26188f7e AT mail DOT gmail DOT com> <20090924073441 DOT GA30267 AT calimero DOT vinschen DOT de> <3f0ad08d0909240237s518de248jee409b731711404a AT mail DOT gmail DOT com> <20090924095701 DOT GC30851 AT calimero DOT vinschen DOT de> <20090924100006 DOT GD30851 AT calimero DOT vinschen DOT de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090924100006.GD30851@calimero.vinschen.de> User-Agent: Mutt/1.5.19 (2009-02-20) Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com On Sep 24 12:00, Corinna Vinschen wrote: > On Sep 24 11:57, Corinna Vinschen wrote: > > On Sep 24 18:37, IWAMURO Motonori wrote: > > > - CP932 (Shift_JIS) has 1byte character and 2bytes character. > > > > > > - The range of 1byte character is 0x00-0x7F and 0xA0-0xDF. > > > > > > - The range of first byte of 2byte character is 0x80-0x9F and 0xE0-0xFC. > > > > > > - The range of second byte of 2byte character is 0x40-7E and 0x80-0xFC. > > > This includes "[", "\", "]", "^", "`", "{", "|", "}". > > > > Ok, thanks for your examples, they show neatly where the problem is. > > > > As you might know, the codepage 20932 (EUC-JP) is also not the same > > as the UNIX EUC_JP implementation. The JIS-X-0212 three byte codes > > are folded into two-byte sequences as described in a comment in > > strfuncs.cc: > > > > /* Unfortunately, the Windows eucJP codepage 20932 is not really 100% > > compatible to eucJP. It's a cute approximation which makes it a > > doublebyte codepage. > > The JIS-X-0212 three byte codes (0x8f,0xa1-0xfe,0xa1-0xfe) are folded > > into two byte codes as follows: The 0x8f is stripped, the next byte is > > taken as is, the third byte is mapped into the lower 7-bit area by > > masking it with 0x7f. So, for instance, the eucJP code 0x8f,0xdd,0xf8 > > becomes 0xdd,0x78 in CP 20932. > > > > To be really eucJP compatible, we have to map the JIS-X-0212 characters > > between CP 20932 and eucJP ourselves. */ > > > > My question is this: Is the S-JIS implementation on UNIX systems > > also using a different implementation to avoid using characters > > from the ASCII range? If so, can't we change the __sjis_wctomb > > and __sjis_mbtowc functions in the same manner as the __eucjp_wctomb > > and __eucjp_mbtowc functions to get a safer implementation? > > Hmm, as far as I can see from wikipedia, S-JIS is simply defined > that way. Bah. This leads me to another question to you and other users working with Japanese systems. As far as I understood this, the default ANSI and OEM codepage on Japanese Windows systems is 932/SJIS, right? And your examples show nicely how bad codepage 932/SJIS is from a usability perspective. Right now, if you specify a locale like "ja_JP" on your machine, that is, without specifying the charset, Cygwin will fetch the ANSI codepage from Windows and use that as your charset. That means, LANG="ja_JP" will result in using the charset SJIS. The question is this: Wouldn't it be better from a usability perspective to avoid SJIS in this case, and to switch Cygwin to EUCJP instead? So, for a Japanese user: LANG="C" -> UTF-8 LANG="ja" -> EUCJP LANG="ja_JP" -> EUCJP LANG="ja_JP.SJIS" -> SJIS That would mean, *only* when specifying SJIS explicitely, Cygwin actually uses SJIS. Is that a feasible approach? Thanks, Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple