X-Recipient: archive-cygwin AT delorie DOT com DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:message-id:date:from:mime-version:to:subject :references:in-reply-to:content-type; q=dns; s=default; b=Iktjkl 3rsGtljZLwPuWHYrKkNoucNzQ8bK0OEHQMv2stJf0D/0PckOv05to+kgyT+oHtry OAO1tmBla1IGDoz+EELRbIl4MHZnSk1fWSX+HsCeOVwg4b6A7BgOkfKg5nK43dy1 b4v6XgYdWWE6uiB1mT9afDvVzvzTmDnuz5D1E= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:message-id:date:from:mime-version:to:subject :references:in-reply-to:content-type; s=default; bh=rOWPcpckJhh+ 5tdj//YVU7hoD/E=; b=YOEt/CbSo9mpyBKIgyGXv5Al7OwTZFPFAmL1+wgfKqMB 5uq1qHU9dYWGY1apYMs9ZYomI64rZ7WnCgIkvju+6aTjR/Z8h/Ko8pGHaGfFxnji x5VpDhmRyeG4p3/GaUzh8XF5+pP0lBJaWlWOE7nimrCx4kl4tblrZnBxNQZmsdY= Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=0.3 required=5.0 tests=AWL,BAYES_50,KAM_INFOUSMEBIZ,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_PASS,T_RP_MATCHES_RCVD autolearn=no version=3.3.2 X-HELO: mx1.redhat.com Message-ID: <5554FCEB.9070307@redhat.com> Date: Thu, 14 May 2015 13:52:11 -0600 From: Eric Blake User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: cygwin AT cygwin DOT com Subject: Re: Grepping Unicode files? References: <3C280897-291A-4A8C-8C3F-46D1D9BEFCFE AT solidrocksystems DOT com> <746170827 DOT 20150514185648 AT yandex DOT ru> <313678DD-A000-4F82-A015-836B882C09FC AT solidrocksystems DOT com> <5554D09B DOT 3030209 AT redhat DOT com> <47AFF066-46C5-41FA-A99B-F53C680DF09A AT solidrocksystems DOT com> In-Reply-To: <47AFF066-46C5-41FA-A99B-F53C680DF09A@solidrocksystems.com> OpenPGP: url=http://people.redhat.com/eblake/eblake.gpg Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="cks6M4nf1K5xRmkKmGakcdoAPnm5cioj0" X-IsSubscribed: yes --cks6M4nf1K5xRmkKmGakcdoAPnm5cioj0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable On 05/14/2015 11:14 AM, Vince Rice wrote: Your mails are hard to read: https://cygwin.com/acronyms/#PCYMTWLL >> >> None. UTF16 is not a valid locale. It is a valid encoding (wide >> character), but locales must operate on multi-byte sequences, not wide >> characters. So you HAVE to convert from wide character to multi-byte >> before you can do anything that requires a locale to work correctly. >=20 > Oh my, the rabbit-hole gets deeper. I don=E2=80=99t know the difference b= etween wide character and multi-byte. A little searching appears to indicat= e that Unicode is a type of wide-character, while multi-byte is =E2=80=A6 w= ell, I still don=E2=80=99t know what multi-byte is. :) But, we=E2=80=99re d= efinitely out in the weeds of non-cygwinness here, and my file is UTF16, so= I can learn what multi-byte is and the difference later. First, you need to learn the difference between a character (which has a name, a glyph when represented in a font, and a code point for what order the character appears when listed in a set) and an encoding (which describes how many bytes and the values of those bytes represent a code point). An encoding should have a mapping back to the character set, but it is possible for some byte values to not have an assigned character; it is also possible to require more than one byte to represent a character. A single character set can have more than one encoding, and a character can exist in more than one character set. Unicode is a definition of a character set (it covers the range u+00000 to u+10fff, although not all of those values have a character assigned). It is a superset of most other character definitions (ASCII being a common one; other names you might have heard are Latin-1 and Latin-15). In fact, it aims to someday be a character set that IS a superset of all others (but it is constantly being amended and more characters defined, as people point out useful? characters that have not yet been incorporated). Conversely, for any other character set out there, there is a character that is defined in Unicode but not defined in the weaker set. Unicode has multiple encodings; among them, the more popular encodings are UTF-32 (also called UCS-4) (every character occupies exactly 4 bytes), UTF-16 (most characters occupy 2 bytes each, but some characters require 4 bytes because they are represented as surrogate pairs), UTF-8 (characters occupy a variable number of bytes, where ASCII characters are 1 byte, and the maximum space required is 4 bytes), and the Java variant of UTF-8 (like UTF-8, except that u+0000 is encoded specially and surrogate pairs are encoded literally requiring 6 bytes rather than 4 for characters above u+0ffff). Other encodings are also mentioned here: https://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings Meanwhile, a single-byte encoding is one that has at most 256 characters; many older character sets meet this property (ASCII, Latin-1, etc). And there are more character sets than Unicode that require multi-byte encodings (such as Shift-JIS, Big5), but as they encode fewer characters than Unicode, they tend to be not as popular today. Which means the character set of choice if you need to communicate internationally is Unicode. More concretely, consider these examples (assuming your email client is set to read UTF-8 email, because that's what I'm sending): 'a' (the character named "lowercase a"): defined in ASCII (code point 0x61, single-byte encoding '\x61'), defined in Latin-1 (code point 0x61, single-byte encoding '\x61', defined in Latin-15 (code point 0x61, single-byte encoding '\x61'), defined in Unicode (code point u+00061, single-byte UTF-8 encoding '\x61', single-byte Java encoding '\x61', 2-byte UTF-16 encoding '\x00\x61', four-byte UTF-32 encoding '\x00\x00\x00\x61') '=E2=82=AC' (the character named "euro sign"): not defined in ASCII, not de= fined in Latin-1, defined in Latin-15 (code point 0xa4, single-byte encoding '\xa4'), defined in Unicode (code point u+020ac, 3-byte UTF-8 encoding '\xe2\x82\xac', 3-byte Java encoding '\xe2\x82\xac', 2-byte UTF-16 encoding '\x20\xac', 4-byte UTF-32 encoding '\x00\x00\x20\xac') and my favorite, from http://www.fileformat.info/info/unicode/char/1F4A9/index.htm '=F0=9F=92=A9' (the character named "pile of poo") (if your system font has= a rendering for this font, consider yourself lucky! - or is that cursed?): not defined in ASCII, not defined in Latin-1, not defined in Latin-15, defined in Unicode (code point u+1f4a9, 4-byte UTF-8 encoding '\xf0\x9f\x92\xa9', 6-byte Java encoding '\xed\xa0\xbd\xed\xb2\xa9', 4-byte UTF-16 encoding '\xd8\x3d\xdc\xa9',4-byte UTF-32 encoding '\x00\x01\xf4\xa9'). One more piece of information: on Cygwin, wchar_t is 2 bytes (for compatibility with windows); that means that cygwin prefers wide operations in UTF-16, and has to use surrogate pairs for characters over u+ffff. On Linux, glibc sets wchar_t to 4 bytes, and prefers wide operations in UCS-4. >> grep cannot handle UTF16 natively. iconv exists to do encoding >> transformations, so that the rest of the system can live in multi-byte >> world instead of worrying about wide-character encodings. >=20 > =E2=80=A6 grep can=E2=80=99t handle unicode files. Good to know. iconv it= is. No, grep can't handle UTF-16 or any other wide-character format. But it CAN handle unicode files, provided those files are encoded in multibyte UTF-8. --=20 Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org --cks6M4nf1K5xRmkKmGakcdoAPnm5cioj0 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 Comment: Public key at http://people.redhat.com/eblake/eblake.gpg Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBCAAGBQJVVPzrAAoJEKeha0olJ0NqaHEIAJtwJahDI3rreQZXu6T8FjC3 +UEyde5P1Ldx1HWRPLxRArX/J4XVE4CXasMrUtjtKtZgNZ//DxwhKZyUkv5bxlna iGTvn3N87LiU+Fn3K0KIMlkdRxLUS5pBXSmes6VpvkHC9o/mjrs0go+dYrVIJiIk m8FgSCC3/+fipsfBpDM18l96Xih5dxWQgRcvZlWNRS1tpJnxG/iTCB6quAueqxQB Mtn1qxu8qLao0Y3gY7TDdjhX+DGIHoJU3SVHKJInm30Usr1lGxU92bgJi/N+MvHI T6GFIIEiZCTPAafTmd5uWwIJ98ckcA/BH35gpWCweGUErYna5oTODb/BOo6aPyA= =0BT8 -----END PGP SIGNATURE----- --cks6M4nf1K5xRmkKmGakcdoAPnm5cioj0--