X-Recipient: archive-cygwin AT delorie DOT com DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:message-id:from:to:mime-version :content-transfer-encoding:content-type:in-reply-to:references :subject:date; q=dns; s=default; b=p75zes+Zw1jriH5k2YNv/J8LHmBmK fmAYcsZDqBTNRCqWqUzri8DnJ58TFOxsIad97pyWMpS6fWk8SoKNlDD4D8pjQGsY C1qZYU/FWytQfwV0uZlezdJ+EBzZIIBDY4sWPYmJ3g18kR5ImyIQMyQqWR5k6GRU srlK/BA6FZo80c= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:message-id:from:to:mime-version :content-transfer-encoding:content-type:in-reply-to:references :subject:date; s=default; bh=c7+TIyhlXxd6LyV5/DzzgjbygM0=; b=DzU rwXm6xsNJGU4UTIjFrDmzR1P7fb0sAgDQdxTlddLDMhfoeVTe364JrS1XyNwXOb/ xGo+dcAnTS4q0DliSIHcQ9wmqZSQfJGeXkVuLT1vvo1upFrOSkZZ0TbKVMHmBwRQ zFpwQwslgURlp/23YHhWLjqC4NH6QvhN6wpQZIWE= Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=1.9 required=5.0 tests=AWL,BAYES_50,FREEMAIL_FROM,LIKELY_SPAM_SUBJECT,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=no version=3.3.2 spammy=ronald, Ronald, UD:C.UTF-8, CUTF8 X-HELO: out1-smtp.messagingengine.com X-ME-Sender: Message-Id: <1495620878.1850033.986938960.3CBADAA7@webmail.messagingengine.com> From: Ronald Fischer To: cygwin AT cygwin DOT com MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="utf-8" In-Reply-To: <1221683706.20170524130359@yandex.ru> References: <1495612367 DOT 2760331 DOT 986814392 DOT 79C77EB2 AT webmail DOT messagingengine DOT com> <1221683706 DOT 20170524130359 AT yandex DOT ru> Subject: Re: Bug: grep behaves incorrectly under the locale C.UTF-8, if a file contains Umlaut characters Date: Wed, 24 May 2017 12:14:38 +0200 X-IsSubscribed: yes > > If I grep the file using, say, > > > $ grep . X >Y > > > (i.e. select every non-empty line and write the result to Y), this works > > fine, if LANG is set to one of: UTF-8, C, C.de_DE, C.en_EN, en_EN, > > de_DE. > > > However, if LANG is set to C.UTF-8, two things happen: > > > - grep classifies the file as binary file and produces the error message > > "Binary file X matches" > > This is an intended behavior, upstream decision since mid-2015, I recall. Might be, but this still does not explain the issues 1., 2. and 3., which I layed out in detail below. Note that never said that the fact, that grep classifies certain characters as binary, would by itself a bug. Or is the intended behaviour, that with C.UTF-8 (and *only* with this setting), the resulting standard output of grep is interspersed with "Binary file matches" lines? If this is the case, I really would like to se a justification for this decision. > > > - Both the grepped lines (i.e. in our example the non-empty lines) AND > > the error message end up in the standard output (i.e. in file Y). > > > IMO, there are several problems with this: > > > 1. It's hard to see, why an umlaut character makes the file X binary > > under encoding C.UTF-8, but not under encoding UTF-8 or C.en_EN > > > 2. If grep classifies a file as binary, I think the desired behaviour > > would be to NOT produce any output, unless the -a flag has been > > supplied. > > > 3. If grep writes a message "Binary file ... matches", this message > > should go to stderr, not stdout. The stdout is supposed to contain only > > a subset of the input lines. > > Ronald -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple