delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2017/05/24/08:19:46

X-Recipient: archive-cygwin AT delorie DOT com
DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id
:list-unsubscribe:list-subscribe:list-archive:list-post
:list-help:sender:subject:to:references:from:message-id:date
:mime-version:in-reply-to:content-type; q=dns; s=default; b=T40S
dFRks6MmLO2GRjeWZQ7IJ03RcB07RVoNTEr/YiakerlOMuDnVOWnZjeSeMgN+qzW
ZjjO3iqQrC5SNyk4LFD1S1r1tPrABb8VKP56RGRqGV3b6cIY4nI+K1nOgCFtJsMn
+8M/5fns+kL82e1D9sqjIAs9q1bTjS7Synry3X0=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id
:list-unsubscribe:list-subscribe:list-archive:list-post
:list-help:sender:subject:to:references:from:message-id:date
:mime-version:in-reply-to:content-type; s=default; bh=UUO3PNb4s0
30MqkPzajnfu8erN8=; b=F+TFGjmyLebZmr4C/IxafLx0Rj6Yq5b5KihGH8oR/1
QF5pYbkhyTCVE6d+qFDGEdKjVo1llJaAL1htX1n0bnkAqfm+aeDvpWCRACBKM8aR
g1RIBT/WIiFvLvQsD5j/7DaaJx3m4be86XzWK6rBCe7WsULPYhqcFI+Tgwfk/n9z
c=
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=2.8 required=5.0 tests=BAYES_50,LIKELY_SPAM_SUBJECT,RP_MATCHES_RCVD,SPF_HELO_PASS autolearn=no version=3.3.2 spammy=german, German, principal, ronald
X-HELO: mx1.redhat.com
DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 5924572470
Authentication-Results: ext-mx09.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com
Authentication-Results: ext-mx09.extmail.prod.ext.phx2.redhat.com; spf=pass smtp.mailfrom=eblake AT redhat DOT com
DKIM-Filter: OpenDKIM Filter v2.11.0 mx1.redhat.com 5924572470
Subject: Re: Bug: grep behaves incorrectly under the locale C.UTF-8, if a file contains Umlaut characters
To: cygwin AT cygwin DOT com, ynnor AT mm DOT st
References: <1495612367 DOT 2760331 DOT 986814392 DOT 79C77EB2 AT webmail DOT messagingengine DOT com>
From: Eric Blake <eblake AT redhat DOT com>
Openpgp: url=http://people.redhat.com/eblake/eblake.gpg
Message-ID: <3c344ecb-6ef3-9d54-a627-4714382d4d84@redhat.com>
Date: Wed, 24 May 2017 07:18:33 -0500
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.1.0
MIME-Version: 1.0
In-Reply-To: <1495612367.2760331.986814392.79C77EB2@webmail.messagingengine.com>
X-IsSubscribed: yes

--wFS364UccguVH58rEwtXcAeNNAmT3neio
Content-Type: multipart/mixed; boundary="ch42whOwlnlTecg2IkcPbo3jmO4MkXDof";
 protected-headers="v1"
From: Eric Blake <eblake AT redhat DOT com>
To: cygwin AT cygwin DOT com, ynnor AT mm DOT st
Message-ID: <3c344ecb-6ef3-9d54-a627-4714382d4d84 AT redhat DOT com>
Subject: Re: Bug: grep behaves incorrectly under the locale C.UTF-8, if a file
 contains Umlaut characters
References: <1495612367 DOT 2760331 DOT 986814392 DOT 79C77EB2 AT webmail DOT messagingengine DOT com>
In-Reply-To: <1495612367 DOT 2760331 DOT 986814392 DOT 79C77EB2 AT webmail DOT messagingengine DOT com>


--ch42whOwlnlTecg2IkcPbo3jmO4MkXDof
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: quoted-printable

On 05/24/2017 02:52 AM, Ronald Fischer wrote:
> I have a file X which contains ASCII text, but also in some lines German
> umlaut characters. The file is classified as:
>=20
>      $ file X
>      X: ISO-8859 text, with CRLF line terminators

In ISO-8859, a German umlaut occupies one byte with the high-bit set.

>=20
> However, if LANG is set to C.UTF-8, two things happen:
>=20
> - grep classifies the file as binary file and produces the error message
> "Binary file X matches"=20

In UTF-8, any one-byte sequence with the high bit set in isolation is an
encoding error (all high-bit bytes in UTF-8 occur in 2-or-more byte
sequences).  According to POSIX, grep is only required to operate on
text files, and the definition of a text file includes a requirement
that ALL bytes in the file form valid encodings of characters in the
current locale.  Yes, this means that there are files that are valid
text files in some locales and invalid in others (such as your file
here).  Once you violate the POSIX constraint of passing a non-text file
to grep, all bets are off, and grep can do whatever it wants, including
telling you that a binary file matches.

>=20
> - Both the grepped lines (i.e. in our example the non-empty lines) AND
> the error message end up in the standard output (i.e. in file Y).

Yes, that's the current intended behavior in upstream grep. It's not
unique to Cygwin, so complaining here won't change it.

>=20
> IMO, there are several problems with this:
>=20
> 1. It's hard to see, why an umlaut character makes the file X binary
> under encoding C.UTF-8,=20

Because it's not a valid UTF-8 encoding. Use iconv to convert your file
from ISO-8859 to UTF-8 if you want to grep it under C.UTF-8.

> but not under encoding UTF-8 or C.en_EN

Those aren't valid locale names.

But if you mean that it does what you want under LC_ALL=3DC, that's
because in the straight C locale, there are no multi-byte characters,
and therefore no encoding errors are possible, and therefore you can't
get a binary file in that locale due merely to an encoding error.

>=20
> 2. If grep classifies a file as binary, I think the desired behaviour
> would be to NOT produce any output, unless the -a flag has been
> supplied.

Once behavior is in the realm of the undefined, it's hard to say what
the desired behavior should be. But again, if you want the current
behavior changed, it's an upstream issue to complain about on bug-grep,
and not something that I'm going to change for Cygwin in isolation.

>=20
> 3. If grep writes a message "Binary file ... matches", this message
> should go to stderr, not stdout. The stdout is supposed to contain only
> a subset of the input lines.

The message "Binary file ... matches" has always gone to stdout, even
before upstream was tightened to flag more encoding errors as binary
files.  Whether the behavior of mixing it with actual output is
desirable is a question for upstream.

--=20
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org


--ch42whOwlnlTecg2IkcPbo3jmO4MkXDof--

--wFS364UccguVH58rEwtXcAeNNAmT3neio
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
Comment: Public key at http://people.redhat.com/eblake/eblake.gpg
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBCAAGBQJZJXoZAAoJEKeha0olJ0Nqp84H/iWSSopEVxUxiULypcaa5cw+
r69fuCD873KG1gq9iuw1U3pTT2sOy2BrfLQtAtk4KLl/UOf07WDio/CZi6J6D3tU
hm+KzLhuxotWjma4SBEumtuN0YIoNoerXnTuumsUBAHscqI/MzoSZ4Efozvl+dWn
rCiDkRx+jPUm8/VZKr7cgZmtcrCbPGaPSa/IvESX/ttXBZistvUQq0ZpvDfrYQ6O
xQV3VKEGzMs+83jU9fnRk7Ai9/JMY7RxYDD/XlpqXECYPhSKAR1Rxxc+uULfZW79
RqnkrN/U9YyErrhg1wAXqH18rj52f5kBw/IfRJ8FKVlbDHbBFyvQ6uQAxHIwcGE=
=xCqi
-----END PGP SIGNATURE-----

--wFS364UccguVH58rEwtXcAeNNAmT3neio--

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019