delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2015/05/14/12:43:29

X-Recipient: archive-cygwin AT delorie DOT com
DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id
:list-unsubscribe:list-subscribe:list-archive:list-post
:list-help:sender:message-id:date:from:mime-version:to:subject
:references:in-reply-to:content-type; q=dns; s=default; b=xZMwii
aES6EKVGFcdkqSKp8ErlGtzeeIcmuuquJf7vZf7hwf7DXCkIjdPbYU+ueZO26OFW
hj7qqad0u8h7cl95kqSrwYic6xkxw22YjigZ35vOvNva7Vz1DeOCc0iYzgIE2Hqr
3SSI6lux5Mm7k6HNwR5iyzp0EXDVsd2SK8jNY=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id
:list-unsubscribe:list-subscribe:list-archive:list-post
:list-help:sender:message-id:date:from:mime-version:to:subject
:references:in-reply-to:content-type; s=default; bh=qlsIWB9jeOqD
kY4ZNlV8BQVsrY0=; b=qmUcmba9mXf71Cafglm5/UBJSsuXyWXLL+sLL+IUEVfw
yp6OrZfVYy01nWmiSVt0l2DAoz3wDVX2Z+seFsdhuHeLIrl08Lk82pV8+TewP3Ti
++fN42uFxmfpuBeysSEGMlPiQkAuk42KhmjxtaVBsjYDpg/i5X8a2UCiOFJN5mU=
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-0.1 required=5.0 tests=AWL,BAYES_50,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_PASS,T_RP_MATCHES_RCVD autolearn=no version=3.3.2
X-HELO: mx1.redhat.com
Message-ID: <5554D09B.3030209@redhat.com>
Date: Thu, 14 May 2015 10:43:07 -0600
From: Eric Blake <eblake AT redhat DOT com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0
MIME-Version: 1.0
To: cygwin AT cygwin DOT com
Subject: Re: Grepping Unicode files?
References: <3C280897-291A-4A8C-8C3F-46D1D9BEFCFE AT solidrocksystems DOT com> <746170827 DOT 20150514185648 AT yandex DOT ru> <313678DD-A000-4F82-A015-836B882C09FC AT solidrocksystems DOT com>
In-Reply-To: <313678DD-A000-4F82-A015-836B882C09FC@solidrocksystems.com>
OpenPGP: url=http://people.redhat.com/eblake/eblake.gpg
X-IsSubscribed: yes

--3KO4tAXxTwUhdAe1P2CjEQWUVBHIg6oe9
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

On 05/14/2015 10:32 AM, Vince Rice wrote:

> locale run from a cmd.exe session says that everything is =E2=80=9CC.UTF-=
8=E2=80=9D, while locale run from mintty says that everything is en_US.UTF-=
8. A =E2=80=9Cwhich=E2=80=9D in both cases shows that the locale being run =
is cygwin=E2=80=99s, so I assume mintty does something slightly differently=
 than the normal console? I don=E2=80=99t even know if there=E2=80=99s a di=
fference. (Have I mentioned I don=E2=80=99t know anything about all of this=
?)
>=20
> From cmd.exe:
> LANG=3D
> LC_CTYPE=3D"C.UTF-8"
> LC_NUMERIC=3D"C.UTF-8"
> LC_TIME=3D"C.UTF-8"
> LC_COLLATE=3D"C.UTF-8"
> LC_MONETARY=3D"C.UTF-8"
> LC_MESSAGES=3D"C.UTF-8"
> LC_ALL=3D

That's because all programs default to C unless told otherwise; from
cmd, there is nothing stating otherwise, as each cygwin command is the
first process in its own tree of processes.

>=20
> From mintty
> LANG=3Den_US.UTF-8
> LC_CTYPE=3D"en_US.UTF-8"
> LC_NUMERIC=3D"en_US.UTF-8"
> LC_TIME=3D"en_US.UTF-8"
> LC_COLLATE=3D"en_US.UTF-8"
> LC_MONETARY=3D"en_US.UTF-8"
> LC_MESSAGES=3D"en_US.UTF-8"
> LC_ALL=3D

mintty is a cygwin process, AND it sets your locale variables to match
your Windows locale, then all other processes are children of mintty and
get the preferred locale settings by default.  Of course, if you don't
like mintty's defaults, you can set up your shell initialization scripts
to change it to your preference.

>=20
> Now, pardon my continued ignorance, but which of those variables needs to=
 be set to UTF16 in order for grep to work? And I assume it (they?) should =
be set to en_US.UTF-16?

None.  UTF16 is not a valid locale.  It is a valid encoding (wide
character), but locales must operate on multi-byte sequences, not wide
characters.  So you HAVE to convert from wide character to multi-byte
before you can do anything that requires a locale to work correctly.

>=20
> Thanks to everyone for your help. I think you=E2=80=99ve all confirmed th=
is isn=E2=80=99t cygwin-specific, but I couldn=E2=80=99t find anything even=
 searching generically (=E2=80=9Cgrep unicode=E2=80=9D and now =E2=80=9Cgre=
p utf16=E2=80=9D). I did finally find an external reference to iconv, but i=
f grep is supposed to be handle this natively, I haven=E2=80=99t been able =
to find much on how to do it.

grep cannot handle UTF16 natively.  iconv exists to do encoding
transformations, so that the rest of the system can live in multi-byte
world instead of worrying about wide-character encodings.

--=20
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


--3KO4tAXxTwUhdAe1P2CjEQWUVBHIg6oe9
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
Comment: Public key at http://people.redhat.com/eblake/eblake.gpg
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBCAAGBQJVVNCbAAoJEKeha0olJ0Nq5EoH/1FBVarDwAfLBUQ9U4J6MM2v
0Flj9PCf9XLo9Ff/JvpkW/xU6l5PospUjStcFW87Lghf5mi8FMvScF/3MHq94JEj
RghmjmjymNCDnHdnoavhvzsDdDgKim76h5AiVWZ9TsFp667TB+NazIweJ76axOxV
IKTybDiiLq2bDpoC6FeSq3iDs0anGyGMXd+emm17XUy/jcyegFSype6BuCmFfc7P
fDddDf9qaaU/WcpJRCnuHJXB1HJZXOAJ0WNMdXWSEA8bJE/paGgHfk70oN3rpbwk
94SV/KvbOCHj5hoKtFH9cog2nQ0K8nnNUNRMtEFJlyvzY+rVIr4o7tsGfPeLVFU=
=Cua9
-----END PGP SIGNATURE-----

--3KO4tAXxTwUhdAe1P2CjEQWUVBHIg6oe9--

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019