X-Recipient: archive-cygwin@delorie.com
X-SWARE-Spam-Status: No, hits=-6.9 required=5.0	tests=AWL,BAYES_00,RCVD_IN_DNSWL_HI,SPF_HELO_PASS,T_RP_MATCHES_RCVD
X-Spam-Check-By: sourceware.org
Message-ID: <4C977AB8.90702@redhat.com>
Date: Mon, 20 Sep 2010 09:16:08 -0600
From: Eric Blake <eblake@redhat.com>
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.9) Gecko/20100907 Fedora/3.1.3-1.fc13 Mnenhy/0.8.3 Thunderbird/3.1.3
MIME-Version: 1.0
To: cygwin@cygwin.com
Subject: Re: awk gsub problem
References: <AANLkTikzGH8GUZ5ZUytSJShfYE=KMyphyue83Q8XMm4-@mail.gmail.com>	<20100916092458.GB15121@calimero.vinschen.de>	<AANLkTimwcbmxMtfZWbkztef+fxQfKtoM9CsFOd38E2a3@mail.gmail.com>	<20100918092139.GE14602@calimero.vinschen.de>	<20100918200851.GA5760@calimero.vinschen.de> <AANLkTi=O_VkQEdXfCLsRQa40zM7min2X=cwosFM95oTU@mail.gmail.com>
In-Reply-To: <AANLkTi=O_VkQEdXfCLsRQa40zM7min2X=cwosFM95oTU@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-IsSubscribed: yes
Mailing-List: contact cygwin-help@cygwin.com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Subscribe: <mailto:cygwin-subscribe@cygwin.com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin@cygwin.com>
List-Help: <mailto:cygwin-help@cygwin.com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner@cygwin.com
Mail-Followup-To: cygwin@cygwin.com
Delivered-To: mailing list cygwin@cygwin.com

On 09/19/2010 02:33 PM, Lee wrote:
>> If LANG is "en_US" or "en_US.utf8", then the regular expression "[a-z]"
>> does *not* correspond anymore to the ASCII codes.  Rather it corresponds
>> to something like "[aAbBcCdD...zZ]", independent of the actual character
>> encoding ISO-8859-1 or UTF-8.

In glibc, [a-z] gets translated according to locale collation order.  If 
A collates before a, then it maps to [aBbCc..Zz], if A collates after a, 
then it maps to [aAbB...yYz] (notice that in either case, one of the two 
capital letters is omitted, so it is NOT the same as all 26 letters in 
both cases).

This has been a MUCH complained-about feature of glibc, which has in 
turn been copied by bash, awk, grep, etc.

Note that POSIX explicitly states that [a-z] has unspecified results in 
any locale except C.  So the glibc behavior is permitted, but so is the 
traditional behavior of just the 26 lowercase letters.

If you can convince the glibc folks that [a-z] should have the 
traditional behavior, more power to you.

http://lists.gnu.org/archive/html/bug-grep/2010-09/msg00030.html

-- 
Eric Blake   eblake@redhat.com    +1-801-349-2682
Libvirt virtualization library http://libvirt.org

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

