delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2009/09/26/05:15:33

X-Recipient: archive-cygwin AT delorie DOT com
X-Spam-Check-By: sourceware.org
Date: Sat, 26 Sep 2009 11:15:04 +0200
From: Corinna Vinschen <corinna-cygwin AT cygwin DOT com>
To: cygwin AT cygwin DOT com
Subject: Re: The C locale
Message-ID: <20090926091504.GA7275@calimero.vinschen.de>
Reply-To: cygwin AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
References: <416096c60909071308qc5ff057sbe9cb1dbc270554f AT mail DOT gmail DOT com> <20090908193456 DOT GC17515 AT calimero DOT vinschen DOT de> <416096c60909081449r1fe024dbm7b82a3719be05e9e AT mail DOT gmail DOT com> <20090921103758 DOT GE20981 AT calimero DOT vinschen DOT de> <416096c60909211420g4ac8ea93l80fc1f00dcd5c0f3 AT mail DOT gmail DOT com> <3f0ad08d0909240003j435818e7h6f7cde2e26188f7e AT mail DOT gmail DOT com> <20090924073441 DOT GA30267 AT calimero DOT vinschen DOT de> <3f0ad08d0909240237s518de248jee409b731711404a AT mail DOT gmail DOT com> <20090924095701 DOT GC30851 AT calimero DOT vinschen DOT de> <20090924100006 DOT GD30851 AT calimero DOT vinschen DOT de>
MIME-Version: 1.0
In-Reply-To: <20090924100006.GD30851@calimero.vinschen.de>
User-Agent: Mutt/1.5.19 (2009-02-20)
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

On Sep 24 12:00, Corinna Vinschen wrote:
> On Sep 24 11:57, Corinna Vinschen wrote:
> > On Sep 24 18:37, IWAMURO Motonori wrote:
> > > - CP932 (Shift_JIS) has 1byte character and 2bytes character.
> > > 
> > > - The range of 1byte character is 0x00-0x7F and 0xA0-0xDF.
> > > 
> > > - The range of first byte of 2byte character is 0x80-0x9F and 0xE0-0xFC.
> > > 
> > > - The range of second byte of 2byte character is 0x40-7E and 0x80-0xFC.
> > >   This includes "[", "\", "]", "^", "`", "{", "|", "}".
> > 
> > Ok, thanks for your examples, they show neatly where the problem is.
> > 
> > As you might know, the codepage 20932 (EUC-JP) is also not the same
> > as the UNIX EUC_JP implementation.  The JIS-X-0212 three byte codes
> > are folded into two-byte sequences as described in a comment in
> > strfuncs.cc:
> > 
> >   /* Unfortunately, the Windows eucJP codepage 20932 is not really 100%
> >      compatible to eucJP.  It's a cute approximation which makes it a
> >      doublebyte codepage.
> >      The JIS-X-0212 three byte codes (0x8f,0xa1-0xfe,0xa1-0xfe) are folded
> >      into two byte codes as follows: The 0x8f is stripped, the next byte is
> >      taken as is, the third byte is mapped into the lower 7-bit area by
> >      masking it with 0x7f.  So, for instance, the eucJP code 0x8f,0xdd,0xf8
> >      becomes 0xdd,0x78 in CP 20932.
> > 
> >      To be really eucJP compatible, we have to map the JIS-X-0212 characters
> >      between CP 20932 and eucJP ourselves. */
> > 
> > My question is this:  Is the S-JIS implementation on UNIX systems
> > also using a different implementation to avoid using characters
> > from the ASCII range?  If so, can't we change the __sjis_wctomb
> > and __sjis_mbtowc functions in the same manner as the __eucjp_wctomb
> > and __eucjp_mbtowc functions to get a safer implementation?
> 
> Hmm, as far as I can see from wikipedia, S-JIS is simply defined
> that way.  Bah.

This leads me to another question to you and other users working with
Japanese systems.

As far as I understood this, the default ANSI and OEM codepage on
Japanese Windows systems is 932/SJIS, right?  And your examples show
nicely how bad codepage 932/SJIS is from a usability perspective.

Right now, if you specify a locale like "ja_JP" on your machine, that
is, without specifying the charset, Cygwin will fetch the ANSI codepage
from Windows and use that as your charset.  That means, LANG="ja_JP"
will result in using the charset SJIS.

The question is this:  Wouldn't it be better from a usability perspective
to avoid SJIS in this case, and to switch Cygwin to EUCJP instead?

So, for a Japanese user:

  LANG="C"          -> UTF-8
  LANG="ja"         -> EUCJP
  LANG="ja_JP"      -> EUCJP
  LANG="ja_JP.SJIS" -> SJIS

That would mean, *only* when specifying SJIS explicitely, Cygwin actually
uses SJIS.

Is that a feasible approach?


Thanks,
Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019