delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2010/12/04/16:08:38

X-Recipient: archive-cygwin AT delorie DOT com
X-SWARE-Spam-Status: No, hits=-0.3 required=5.0 tests=AWL,BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,T_TO_NO_BRKTS_FREEMAIL
X-Spam-Check-By: sourceware.org
MIME-Version: 1.0
In-Reply-To: <20101204150642.GA26471@calimero.vinschen.de>
References: <4CF96F70 DOT 3090507 AT veritech DOT com> <AANLkTikQJEJ6kHKZdzzA_YB_DHgZBevCLDKtAEm6ZgBg AT mail DOT gmail DOT com> <4CF9BA08 DOT 8060703 AT redhat DOT com> <AANLkTi=pSXnqvF5OsQbaP8nE6sGHsL6crOG3z9D6SzWs AT mail DOT gmail DOT com> <20101204150642 DOT GA26471 AT calimero DOT vinschen DOT de>
Date: Sat, 4 Dec 2010 17:08:25 -0400
Message-ID: <AANLkTimSO5T+G-6jUSpJEHqmsuB0N6yhbwGkZrzdmYSh@mail.gmail.com>
Subject: Re: Problem with Bash regex test case sensitivity
From: Lee <ler762 AT gmail DOT com>
To: cygwin AT cygwin DOT com
X-IsSubscribed: yes
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

On 12/4/10, Corinna Vinschen <corinna-cygwin > wrote:
> On Dec  4 10:05, Lee wrote:
>> On 12/3/10, Eric Blake <eblake@ > wrote:
>> > Read the FAQ.  http://www.faqs.org/faqs/unix-faq/shell/bash/, E9.
>>
>> Which says the en_US locale collates the upper and lower case letters li=
ke
>> this:
>> 	AaBb...Zz
>>
>> I got that much :)  What I don't get is why someone would _want_ the
>> collating sequence to be AaBb... or why that sequence was picked for
>> en_US instead of using the natural order of A-Za-z.
>
> It's not the "natural" order, it's an arbitrary order which has been
> chosen back in 1963 when the ASCII code has been defined.  It's not used
> as "natural" order outside of computer systems and it's not even the
> natural order on some computer systems (See EBCDIC).

My idea of "natural order" is treating each character as an unsigned
integer.  So even though ASCII has a different collating sequence than
EBCDIC, the characters are still treated as unsigned integers when
sorting them.  Setting LANG to something other than C seems to break
that model..

> If you take a look into a hardcopy encyclopedia written in english,
> you'll be very comfortable that the words are ordered lexicographically
> instead of in ASCII coding, probably.

I never paid all that much attention to how the words were ordered,
but now that I have.. they're backwards!   "god" comes before "God",
"hopper" before "Hopper", etc.

>  Needless to say that ordering
> criteria for non-english languages may contain more characters in the
> sequence, in german for instance
>
>   "Aa=E4Bb...Oo=F6...Ss=DF...Uu=FC...Zz"
>
> So, let's reiterate:
>
> - If I need the order for the computer language, I say so:
>
>    LC_COLLATE=3DC.UTF-8
>
> - Otherwise, if I need the order for the natural language, I say so:
>
>    LC_COLLATE=3Den_US.UTF-8
>    LC_COLLATE=3Dde_DE.UTF-8

You're quite good at explaining this.. I think I'm actually beginning
to understand it :)
So...  the reason for setting LANG is a shorthand method of setting
all the LC_xxx environment variables?

Thanks,
Lee

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019