delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2011/10/11/15:51:23

X-Recipient: archive-cygwin AT delorie DOT com
X-SWARE-Spam-Status: No, hits=-7.1 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_HI,RP_MATCHES_RCVD,SPF_HELO_PASS
X-Spam-Check-By: sourceware.org
Message-ID: <4E949E1F.5020403@redhat.com>
Date: Tue, 11 Oct 2011 13:50:55 -0600
From: Eric Blake <eblake AT redhat DOT com>
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.23) Gecko/20110928 Fedora/3.1.15-1.fc14 Lightning/1.0b3pre Mnenhy/0.8.4 Thunderbird/3.1.15
MIME-Version: 1.0
To: cygwin AT cygwin DOT com
Subject: Re: LC_COLLATE vs. egrep -- bug or (non-)feature?
References: <f5by5wrjpxo DOT fsf AT calexico DOT inf DOT ed DOT ac DOT uk>
In-Reply-To: <f5by5wrjpxo.fsf@calexico.inf.ed.ac.uk>
X-IsSubscribed: yes
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

On 10/11/2011 01:20 PM, Henry S. Thompson wrote:
> Is this a feature, or a bug associated with the current ongoing
> discussion about locales:

(mis)-feature, and not necessarily a cygwin bug.  Historically, POSIX 
1992 _required_ that regular expression ranges expand out to all 
characters in Collation Element Order, between the two end points.  The 
intent there was to allow accented characters common in some languages 
to automatically be picked up, so that [a-z] would also pick up accented 
vowels.  But it backfired with several unintended consequences: 1) in 
locales that collate case-insensitively, you are collating via y 
aAbBcC... or AaBbCc..., so that [a-b] now means [aAb] or [aBb], which 
adds unwanted capital letters into your range.  And although you can 
write a locale definition where collation element order is sane (all 
lowercase, followed by all uppercase, followed by collation rules that 
merge the two sets), it is not as easy to do (the naive locale 
definition writes the collation rules first, intermixing upper and lower 
case). 2) even if you write the locale definition in a sane collation 
element order, do you put the accents first or last?  That is, [a-e] is 
liable to pick up all accented a's but no accented e's, even though 
[a-z] picks up all accented lower case vowels.

POSIX 2001 and 2008 "fixed" things by saying that the use of range 
expressions in regular expressions is undefined in all but the C locale, 
but the cat is already out of the bag, and you are stuck with existing 
behavior.  glibc refuses to change their regex library, preferring to 
stick to POSIX 1992 behavior, and claiming that the "bug" instead lies 
with any locale definition that still uses naive ordering.  Cygwin could 
behave differently than glibc here and still comply with POSIX, but then 
we'd get bug reports for "why does cygwin not emulate Linux".

Meanwhile, several GNU apps are sick of bug reports about the 
unintuitive nature of ranges, and are introducing what is called native 
ordering, where range expressions _always_ mean the C locale expansion, 
even when not in the C locale; but given glibc behavior, this means 
adding code on top of glibc, for all programs that understand regex 
(awk, bash, sed, grep, m4, etc.).  So don't expect that to save you any 
time soon; likewise, that only helps you on GNU systems (Solaris will 
still continue to suffer from the confusion).

So, your only safe way to work around it is to request LC_COLLATE=C up 
front.

>
>   >  LC_ALL= egrep '^[a-b]l[dl]e.n$' /usr/share/dict/words
> aldern
>> LC_COLLATE= egrep '^[a-b]l[dl]e.n$' /usr/share/dict/words
> aldern
> Alleen
> Alleyn
>
> If it's a feature, how do I set LC_COLLATE w/o changing the other
> aspects of my locale?

LANG=preferred LC_COLLATE=C

and don't set LC_ALL.

-- 
Eric Blake   eblake AT redhat DOT com    +1-801-349-2682
Libvirt virtualization library http://libvirt.org

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019