delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2009/09/29/08:12:29

X-Recipient: archive-cygwin AT delorie DOT com
X-Spam-Check-By: sourceware.org
Date: Tue, 29 Sep 2009 14:12:03 +0200
From: Corinna Vinschen <corinna-cygwin AT cygwin DOT com>
To: cygwin AT cygwin DOT com
Subject: Re: The C locale
Message-ID: <20090929121203.GA19012@calimero.vinschen.de>
Reply-To: cygwin AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
References: <3f0ad08d0909240003j435818e7h6f7cde2e26188f7e AT mail DOT gmail DOT com> <20090924073441 DOT GA30267 AT calimero DOT vinschen DOT de> <3f0ad08d0909240237s518de248jee409b731711404a AT mail DOT gmail DOT com> <20090924095701 DOT GC30851 AT calimero DOT vinschen DOT de> <20090924100006 DOT GD30851 AT calimero DOT vinschen DOT de> <20090926091504 DOT GA7275 AT calimero DOT vinschen DOT de> <3f0ad08d0909262021u5fe79873r65850865166ce40f AT mail DOT gmail DOT com> <3f0ad08d0909280903t5caaf611ie4049a73beb93f06 AT mail DOT gmail DOT com> <20090928161626 DOT GC8378 AT calimero DOT vinschen DOT de> <4AC1EA0F DOT 5040603 AT towo DOT net>
MIME-Version: 1.0
In-Reply-To: <4AC1EA0F.5040603@towo.net>
User-Agent: Mutt/1.5.19 (2009-02-20)
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

On Sep 29 13:05, Thomas Wolff wrote:
> Corinna Vinschen wrote:
>> In theory this sounds like a good idea to be used for all locales which
>> don't specify the charset explicitely, because that results in using the
>> same charset, "UTF-8", for all such locales.  "C", "ja" or "en_US"
>> would all default to UTF-8.
>>   
> The keyword here again should be compatibility. That means,  
> unfortunately, that I do not think this is a good idea.
> A number of locales have been established on common systems that do not  
> specify their encoding explicitly (i.e. in their name).
> Since there is now more or less a common set of such locales among  
> various Linux and Unix systems, this seems to be
> a de-facto standard although I am not aware of any more formal  
> definition/listing/description of this.
> On a modern Linux system, use the following command to get a list (not  
> sure if it's appropriate to attach it here):
>    for l in `locale -a`
>    do      echo "$l        `LC_ALL=$l locale charmap`"
>    done
>
> I have also tried to incorporate a best guess assembly of mappings from  
> modern systems in my editor mined so it can
> derive the encoding from the locale name, so you could also take a  
> working list from there.
>
> I think this list should be used for reference to define the  
> locale/encoding mapping, other choices may be more attractive
> but only raise problems.

This isn't feasible for now.  As I described in the documentation, the
actual content of the language and territory part is not evaluated for
now.  *Only* the charset part (and the cjknarrow modifier, FWIW) have
a meaning for newlib/Cygwin so far.  What happens for now is that
Cygwin calls a function which fetches the ANSI codepage and generates
the current charset from there.  So that's what happens:

   LANG="C"               -> UTF-8
   LANG="xx"              -> charset equivalent to ANSI codepage
   LANG="xx_XX"           -> ditto
   LANG="xx_XX.CHARSET"   -> Use charset CHARSET

We won't add extra functionality.  In the long run it would be nice to
change the setlocale functionality to use actual locale files in every
respect, but that's wishful thinking for now.

To return to the original problem which started this request. 

I asked if the default charset for the japanese language should be set
to EUCJP rather than SJIS.  The actual implementation would have been
like this

  if (lang="xx or lang="xx_XX" with x in [a-z] and X in [A-Z]?)
    set_charset_from_codepage()

  set_charset_from_codepage()
  {
    switch (GetANSI ())
    [...]
    case 932:
      charset="EUCJP"    <-- Instead of the current `charset="SJIS"
    [...]
  }

Everything going beyond this in complexity is out of the question for now.
  

Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019