DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:message-id:date:from:mime-version:to:subject
	:references:in-reply-to:content-type; q=dns; s=default; b=Iktjkl
	3rsGtljZLwPuWHYrKkNoucNzQ8bK0OEHQMv2stJf0D/0PckOv05to+kgyT+oHtry
	OAO1tmBla1IGDoz+EELRbIl4MHZnSk1fWSX+HsCeOVwg4b6A7BgOkfKg5nK43dy1
	b4v6XgYdWWE6uiB1mT9afDvVzvzTmDnuz5D1E=
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Message-ID: <5554FCEB.9070307@redhat.com>
Date: Thu, 14 May 2015 13:52:11 -0600
From: Eric Blake <eblake AT redhat DOT com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0
MIME-Version: 1.0
To: cygwin AT cygwin DOT com
Subject: Re: Grepping Unicode files?
References: <3C280897-291A-4A8C-8C3F-46D1D9BEFCFE AT solidrocksystems DOT com> <746170827 DOT 20150514185648 AT yandex DOT ru> <313678DD-A000-4F82-A015-836B882C09FC AT solidrocksystems DOT com> <5554D09B DOT 3030209 AT redhat DOT com> <47AFF066-46C5-41FA-A99B-F53C680DF09A AT solidrocksystems DOT com>
In-Reply-To: <47AFF066-46C5-41FA-A99B-F53C680DF09A@solidrocksystems.com>
OpenPGP: url=http://people.redhat.com/eblake/eblake.gpg
Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="cks6M4nf1K5xRmkKmGakcdoAPnm5cioj0"

--cks6M4nf1K5xRmkKmGakcdoAPnm5cioj0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

On 05/14/2015 11:14 AM, Vince Rice wrote:

Your mails are hard to read:
https://cygwin.com/acronyms/#PCYMTWLL

>>
>> None.  UTF16 is not a valid locale.  It is a valid encoding (wide
>> character), but locales must operate on multi-byte sequences, not wide
>> characters.  So you HAVE to convert from wide character to multi-byte
>> before you can do anything that requires a locale to work correctly.
>=20
> Oh my, the rabbit-hole gets deeper. I don=E2=80=99t know the difference b=
etween wide character and multi-byte. A little searching appears to indicat=
e that Unicode is a type of wide-character, while multi-byte is =E2=80=A6 w=
ell, I still don=E2=80=99t know what multi-byte is. :) But, we=E2=80=99re d=
efinitely out in the weeds of non-cygwinness here, and my file is UTF16, so=
 I can learn what multi-byte is and the difference later.

First, you need to learn the difference between a character (which has a
name, a glyph when represented in a font, and a code point for what
order the character appears when listed in a set) and an encoding (which
describes how many bytes and the values of those bytes represent a code
point).  An encoding should have a mapping back to the character set,
but it is possible for some byte values to not have an assigned
character; it is also possible to require more than one byte to
represent a character.  A single character set can have more than one
encoding, and a character can exist in more than one character set.

Unicode is a definition of a character set (it covers the range u+00000
to u+10fff, although not all of those values have a character assigned).
 It is a superset of most other character definitions (ASCII being a
common one; other names you might have heard are Latin-1 and Latin-15).
 In fact, it aims to someday be a character set that IS a superset of
all others (but it is constantly being amended and more characters
defined, as people point out useful? characters that have not yet been
incorporated).  Conversely, for any other character set out there, there
is a character that is defined in Unicode but not defined in the weaker set.

Unicode has multiple encodings; among them, the more popular encodings
are UTF-32 (also called UCS-4) (every character occupies exactly 4
bytes), UTF-16 (most characters occupy 2 bytes each, but some characters
require 4 bytes because they are represented as surrogate pairs), UTF-8
(characters occupy a variable number of bytes, where ASCII characters
are 1 byte, and the maximum space required is 4 bytes), and the Java
variant of UTF-8 (like UTF-8, except that u+0000 is encoded specially
and surrogate pairs are encoded literally requiring 6 bytes rather than
4 for characters above u+0ffff).  Other encodings are also mentioned
here: https://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings

Meanwhile, a single-byte encoding is one that has at most 256
characters; many older character sets meet this property (ASCII,
Latin-1, etc).  And there are more character sets than Unicode that
require multi-byte encodings (such as Shift-JIS, Big5), but as they
encode fewer characters than Unicode, they tend to be not as popular
today.  Which means the character set of choice if you need to
communicate internationally is Unicode.

More concretely, consider these examples (assuming your email client is
set to read UTF-8 email, because that's what I'm sending):

'a' (the character named "lowercase a"): defined in ASCII (code point
0x61, single-byte encoding '\x61'), defined in Latin-1 (code point 0x61,
single-byte encoding '\x61', defined in Latin-15 (code point 0x61,
single-byte encoding '\x61'), defined in Unicode (code point u+00061,
single-byte UTF-8 encoding '\x61', single-byte Java encoding '\x61',
2-byte UTF-16 encoding '\x00\x61', four-byte UTF-32 encoding
'\x00\x00\x00\x61')

'=E2=82=AC' (the character named "euro sign"): not defined in ASCII, not de=
fined
in Latin-1, defined in Latin-15 (code point 0xa4, single-byte encoding
'\xa4'), defined in Unicode (code point u+020ac, 3-byte UTF-8 encoding
'\xe2\x82\xac', 3-byte Java encoding '\xe2\x82\xac', 2-byte UTF-16
encoding '\x20\xac', 4-byte UTF-32 encoding '\x00\x00\x20\xac')

and my favorite, from
http://www.fileformat.info/info/unicode/char/1F4A9/index.htm

'=F0=9F=92=A9' (the character named "pile of poo") (if your system font has=
 a
rendering for this font, consider yourself lucky! - or is that cursed?):
not defined in ASCII, not defined in Latin-1, not defined in Latin-15,
defined in Unicode (code point u+1f4a9, 4-byte UTF-8 encoding
'\xf0\x9f\x92\xa9',  6-byte Java encoding '\xed\xa0\xbd\xed\xb2\xa9',
4-byte UTF-16 encoding '\xd8\x3d\xdc\xa9',4-byte UTF-32 encoding
'\x00\x01\xf4\xa9').

One more piece of information: on Cygwin, wchar_t is 2 bytes (for
compatibility with windows); that means that cygwin prefers wide
operations in UTF-16, and has to use surrogate pairs for characters over
u+ffff. On Linux, glibc sets wchar_t to 4 bytes, and prefers wide
operations in UCS-4.

>> grep cannot handle UTF16 natively.  iconv exists to do encoding
>> transformations, so that the rest of the system can live in multi-byte
>> world instead of worrying about wide-character encodings.
>=20
> =E2=80=A6 grep can=E2=80=99t handle unicode files. Good to know. iconv it=
 is.

No, grep can't handle UTF-16 or any other wide-character format.  But it
CAN handle unicode files, provided those files are encoded in multibyte
UTF-8.

--=20
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


--cks6M4nf1K5xRmkKmGakcdoAPnm5cioj0
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
Comment: Public key at http://people.redhat.com/eblake/eblake.gpg
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBCAAGBQJVVPzrAAoJEKeha0olJ0NqaHEIAJtwJahDI3rreQZXu6T8FjC3
+UEyde5P1Ldx1HWRPLxRArX/J4XVE4CXasMrUtjtKtZgNZ//DxwhKZyUkv5bxlna
iGTvn3N87LiU+Fn3K0KIMlkdRxLUS5pBXSmes6VpvkHC9o/mjrs0go+dYrVIJiIk
m8FgSCC3/+fipsfBpDM18l96Xih5dxWQgRcvZlWNRS1tpJnxG/iTCB6quAueqxQB
Mtn1qxu8qLao0Y3gY7TDdjhX+DGIHoJU3SVHKJInm30Usr1lGxU92bgJi/N+MvHI
T6GFIIEiZCTPAafTmd5uWwIJ98ckcA/BH35gpWCweGUErYna5oTODb/BOo6aPyA=
=0BT8
-----END PGP SIGNATURE-----

--cks6M4nf1K5xRmkKmGakcdoAPnm5cioj0--