X-Recipient: archive-cygwin@delorie.com
DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:subject:to:references:from:message-id:date
	:mime-version:in-reply-to:content-type
	:content-transfer-encoding; q=dns; s=default; b=OhPyaohq1kiwzcDj
	khBZnLmUBlmaQGc9HXlzSEQVIVRIhRO4O/4AST4Q+rVvvbP2Ylh/KNx2jQ+NJGjU
	bUw8eCF3Lt4lTufSP296MrHUY1Dq3TlHXnI09Mmn7MF1eAZmARvWGbzH+dUXNq/3
	gPkvaGXdSDa246vru2D3Tc+j9zM=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:subject:to:references:from:message-id:date
	:mime-version:in-reply-to:content-type
	:content-transfer-encoding; s=default; bh=YJt9X513AoYnn2fTdL2nyF
	1EyXI=; b=bXVBvUBTSW9aetziJu7vEFfEbhyUFyS8ojvpZs/TNHbadDFj9G0lbR
	zkQQT5JnWpjQcCmolu4njknya/e+wDJOzhXD9+EzUSvoM42n6xh8dljapEQWxXI6
	ywjCVTiYJhfbht2ZEZL/8SDO3LeHOFriaSBY231Q5nJHj5mPjA66g=
Mailing-List: contact cygwin-help@cygwin.com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Subscribe: <mailto:cygwin-subscribe@cygwin.com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin@cygwin.com>
List-Help: <mailto:cygwin-help@cygwin.com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner@cygwin.com
Mail-Followup-To: cygwin@cygwin.com
Delivered-To: mailing list cygwin@cygwin.com
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=BAYES_00,SPF_HELO_PASS,SPF_PASS,T_RP_MATCHES_RCVD autolearn=ham version=3.3.2
X-HELO: limerock03.mail.cornell.edu
X-CornellRouted: This message has been Routed already.
Subject: Re: Bug in collation functions?
To: cygwin@cygwin.com
References: <20151029075050.GE5319@calimero.vinschen.de> <20151029083057.GH5319@calimero.vinschen.de> <56321815.7000203@cornell.edu> <20151029153516.GJ5319@calimero.vinschen.de> <56323F2E.4030807@cornell.edu> <56324598.9060604@cornell.edu> <56324E82.7000402@redhat.com> <563268A4.6000005@cornell.edu> <56329462.2090206@cornell.edu> <56329BE8.808@cornell.edu> <20151030120320.GO5319@calimero.vinschen.de> <56337996.2000400@cornell.edu>
From: Ken Brown <kbrown@cornell.edu>
Message-ID: <5634F6BA.7070301@cornell.edu>
Date: Sat, 31 Oct 2015 13:13:30 -0400
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0
MIME-Version: 1.0
In-Reply-To: <56337996.2000400@cornell.edu>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
X-IsSubscribed: yes

On 10/30/2015 10:07 AM, Ken Brown wrote:
> Hi Corinna,
>
> On 10/30/2015 8:03 AM, Corinna Vinschen wrote:
>> On Oct 29 18:21, Ken Brown wrote:
>>> The fallback I had in mind is to return the shorter string if they have
>>> different lengths and otherwise to revert to wcscmp.
>  >
>> I had a longer look into this suggestion and the below code and it took
>> me some time to find out what bugged me with it:
>>
>> What about str/wcsxfrm?
>>
>> Per POSIX, calling strcmp on the result of strxfrm is equivalent to
>> calling strcoll (analogue with wcs*).  If you extend *coll to perform an
>> extra check on the length, you will have cases in which the above rule
>> fails.  You can't perform the length test on the result of *xfrm and
>> expect the same result as in *coll.
>>
>> In fact, when calling LCMapStringW with NORM_IGNORESYMOLS (you would
>> have to do this anyway if we add this flag in *coll), the resulting
>> transformed strings created from the input strings "11" and "1.1" would
>> be identical, so a length test on the xfrm string is not meaningful at
>> all.
>>
>> The bottom line is, afaics, we must make sure that CompareStringW and
>> LCMapStringW are called the same way, and their result/output has to be
>> returned to the caller.  Performing an extra check in *coll which can't
>> be reliably performed in *xfrm is not feasible.
>>
>> Does that make sense?
>
> Yes, I see the problem, and I don't see a good way around it.  So I
> think we probably have to leave things as they are and live with the
> fact that we can't do comparisons that ignore whitespace and punctuation.
>
> The alternative of allowing str/wcscoll to return 0 on unequal strings
> doesn't seem feasible in view of Eric's comments.

I have one other idea.  What would you think of defining a function 
cygwin_strcoll that's like strcoll but with an extra bool parameter 
'ignoresymbols'?  If ignoresymbols = false, this would be the same as 
strcoll.  If ignoresymbols = true, this would use NORM_IGNORESYMBOLS 
with the fallback I suggested.

That way applications that prefer to be more glibc-compatible and don't 
need strxfrm could do something like

   #define strcoll(A,B) cygwin_strcoll ((A), (B), true)

If you think this is reasonable, I'll submit a patch.  If not, no problem.

Ken


--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

