X-Recipient: archive-cygwin AT delorie DOT com DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:message-id:date:from:mime-version:to:subject :references:in-reply-to:content-type; q=dns; s=default; b=xZMwii aES6EKVGFcdkqSKp8ErlGtzeeIcmuuquJf7vZf7hwf7DXCkIjdPbYU+ueZO26OFW hj7qqad0u8h7cl95kqSrwYic6xkxw22YjigZ35vOvNva7Vz1DeOCc0iYzgIE2Hqr 3SSI6lux5Mm7k6HNwR5iyzp0EXDVsd2SK8jNY= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:message-id:date:from:mime-version:to:subject :references:in-reply-to:content-type; s=default; bh=qlsIWB9jeOqD kY4ZNlV8BQVsrY0=; b=qmUcmba9mXf71Cafglm5/UBJSsuXyWXLL+sLL+IUEVfw yp6OrZfVYy01nWmiSVt0l2DAoz3wDVX2Z+seFsdhuHeLIrl08Lk82pV8+TewP3Ti ++fN42uFxmfpuBeysSEGMlPiQkAuk42KhmjxtaVBsjYDpg/i5X8a2UCiOFJN5mU= Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.1 required=5.0 tests=AWL,BAYES_50,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_PASS,T_RP_MATCHES_RCVD autolearn=no version=3.3.2 X-HELO: mx1.redhat.com Message-ID: <5554D09B.3030209@redhat.com> Date: Thu, 14 May 2015 10:43:07 -0600 From: Eric Blake User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: cygwin AT cygwin DOT com Subject: Re: Grepping Unicode files? References: <3C280897-291A-4A8C-8C3F-46D1D9BEFCFE AT solidrocksystems DOT com> <746170827 DOT 20150514185648 AT yandex DOT ru> <313678DD-A000-4F82-A015-836B882C09FC AT solidrocksystems DOT com> In-Reply-To: <313678DD-A000-4F82-A015-836B882C09FC@solidrocksystems.com> OpenPGP: url=http://people.redhat.com/eblake/eblake.gpg Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="3KO4tAXxTwUhdAe1P2CjEQWUVBHIg6oe9" X-IsSubscribed: yes --3KO4tAXxTwUhdAe1P2CjEQWUVBHIg6oe9 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable On 05/14/2015 10:32 AM, Vince Rice wrote: > locale run from a cmd.exe session says that everything is =E2=80=9CC.UTF-= 8=E2=80=9D, while locale run from mintty says that everything is en_US.UTF-= 8. A =E2=80=9Cwhich=E2=80=9D in both cases shows that the locale being run = is cygwin=E2=80=99s, so I assume mintty does something slightly differently= than the normal console? I don=E2=80=99t even know if there=E2=80=99s a di= fference. (Have I mentioned I don=E2=80=99t know anything about all of this= ?) >=20 > From cmd.exe: > LANG=3D > LC_CTYPE=3D"C.UTF-8" > LC_NUMERIC=3D"C.UTF-8" > LC_TIME=3D"C.UTF-8" > LC_COLLATE=3D"C.UTF-8" > LC_MONETARY=3D"C.UTF-8" > LC_MESSAGES=3D"C.UTF-8" > LC_ALL=3D That's because all programs default to C unless told otherwise; from cmd, there is nothing stating otherwise, as each cygwin command is the first process in its own tree of processes. >=20 > From mintty > LANG=3Den_US.UTF-8 > LC_CTYPE=3D"en_US.UTF-8" > LC_NUMERIC=3D"en_US.UTF-8" > LC_TIME=3D"en_US.UTF-8" > LC_COLLATE=3D"en_US.UTF-8" > LC_MONETARY=3D"en_US.UTF-8" > LC_MESSAGES=3D"en_US.UTF-8" > LC_ALL=3D mintty is a cygwin process, AND it sets your locale variables to match your Windows locale, then all other processes are children of mintty and get the preferred locale settings by default. Of course, if you don't like mintty's defaults, you can set up your shell initialization scripts to change it to your preference. >=20 > Now, pardon my continued ignorance, but which of those variables needs to= be set to UTF16 in order for grep to work? And I assume it (they?) should = be set to en_US.UTF-16? None. UTF16 is not a valid locale. It is a valid encoding (wide character), but locales must operate on multi-byte sequences, not wide characters. So you HAVE to convert from wide character to multi-byte before you can do anything that requires a locale to work correctly. >=20 > Thanks to everyone for your help. I think you=E2=80=99ve all confirmed th= is isn=E2=80=99t cygwin-specific, but I couldn=E2=80=99t find anything even= searching generically (=E2=80=9Cgrep unicode=E2=80=9D and now =E2=80=9Cgre= p utf16=E2=80=9D). I did finally find an external reference to iconv, but i= f grep is supposed to be handle this natively, I haven=E2=80=99t been able = to find much on how to do it. grep cannot handle UTF16 natively. iconv exists to do encoding transformations, so that the rest of the system can live in multi-byte world instead of worrying about wide-character encodings. --=20 Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org --3KO4tAXxTwUhdAe1P2CjEQWUVBHIg6oe9 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 Comment: Public key at http://people.redhat.com/eblake/eblake.gpg Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBCAAGBQJVVNCbAAoJEKeha0olJ0Nq5EoH/1FBVarDwAfLBUQ9U4J6MM2v 0Flj9PCf9XLo9Ff/JvpkW/xU6l5PospUjStcFW87Lghf5mi8FMvScF/3MHq94JEj RghmjmjymNCDnHdnoavhvzsDdDgKim76h5AiVWZ9TsFp667TB+NazIweJ76axOxV IKTybDiiLq2bDpoC6FeSq3iDs0anGyGMXd+emm17XUy/jcyegFSype6BuCmFfc7P fDddDf9qaaU/WcpJRCnuHJXB1HJZXOAJ0WNMdXWSEA8bJE/paGgHfk70oN3rpbwk 94SV/KvbOCHj5hoKtFH9cog2nQ0K8nnNUNRMtEFJlyvzY+rVIr4o7tsGfPeLVFU= =Cua9 -----END PGP SIGNATURE----- --3KO4tAXxTwUhdAe1P2CjEQWUVBHIg6oe9--