delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2015/05/14/13:14:50

X-Recipient: archive-cygwin AT delorie DOT com
DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id
:list-unsubscribe:list-subscribe:list-archive:list-post
:list-help:sender:content-type:mime-version:subject:from
:in-reply-to:date:content-transfer-encoding:message-id
:references:to; q=dns; s=default; b=QfiFyb/0p+WuHyfr4KBQJ4Z1lmPl
B7geJB1KY2cyIcS67P5RV9SwgM+lZ0v4+HwTG+9k6jIQIQquNn0u5jIR54j0RkXR
R1mmk49mwP/Ntgdyx8kACPcmzyuF7ST2dKcV2ll2baIdC88eaO/qLBzAoahCFL1G
5iMnIteFTLN2pNo=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id
:list-unsubscribe:list-subscribe:list-archive:list-post
:list-help:sender:content-type:mime-version:subject:from
:in-reply-to:date:content-transfer-encoding:message-id
:references:to; s=default; bh=++lvGGtq+qTpQbkjXH060saji/4=; b=um
qejv6/wxMBUVcr4UfqNo5NiDDzfV2LeRobuDOVT+XCgt9UNdlkl0O2HZaw+2antl
wPzhP/NyOkS1USsUuhk6aaQ5C4a+UZW8RvEbLjdWxCdQ1vI/O8DB9SuXF2kQF/tI
O7R+ovnyCl/v0UGUIa8FBdl11o9geHQGArqT8eE9A=
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=1.6 required=5.0 tests=AWL,BAYES_50,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2
X-HELO: gproxy8-pub.mail.unifiedlayer.com
X-Authority-Analysis: v=2.1 cv=Zox+dbLG c=1 sm=1 tr=0 a=x/h8IXy5FZdipniTS+KQtQ==:117 a=x/h8IXy5FZdipniTS+KQtQ==:17 a=cNaOj0WVAAAA:8 a=f5113yIGAAAA:8 a=IkcTkHD0fZMA:10 a=z1iSbGl3AAAA:8 a=CnPQkyIfcMwA:10 a=rD4U560VbWoA:10 a=h1PgugrvaO0A:10 a=20KFwNOVAAAA:8 a=WYcy3mCKFWwyspbR7_MA:9 a=QEXdDO2ut3YA:10
Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2098\))
Subject: Re: Grepping Unicode files?
From: Vince Rice <vrice AT solidrocksystems DOT com>
In-Reply-To: <5554D09B.3030209@redhat.com>
Date: Thu, 14 May 2015 12:14:20 -0500
Message-Id: <47AFF066-46C5-41FA-A99B-F53C680DF09A@solidrocksystems.com>
References: <3C280897-291A-4A8C-8C3F-46D1D9BEFCFE AT solidrocksystems DOT com> <746170827 DOT 20150514185648 AT yandex DOT ru> <313678DD-A000-4F82-A015-836B882C09FC AT solidrocksystems DOT com> <5554D09B DOT 3030209 AT redhat DOT com>
To: cygwin AT cygwin DOT com
X-Identified-User: {3986:box867.bluehost.com:solidrr2:solidrocksystems.com} {sentby:smtp auth 65.118.57.199 authed with vrice AT solidrocksystems DOT com}
X-IsSubscribed: yes
X-MIME-Autoconverted: from quoted-printable to 8bit by delorie.com id t4EHEkH1029580

> On May 14, 2015, at 11:43 AM, Eric Blake <eblake AT redhat DOT com> wrote:
> 
> On 05/14/2015 10:32 AM, Vince Rice wrote:
> 
> …
>> 
>> Now, pardon my continued ignorance, but which of those variables needs to be set to UTF16 in order for grep to work? And I assume it (they?) should be set to en_US.UTF-16?
> 
> None.  UTF16 is not a valid locale.  It is a valid encoding (wide
> character), but locales must operate on multi-byte sequences, not wide
> characters.  So you HAVE to convert from wide character to multi-byte
> before you can do anything that requires a locale to work correctly.

Oh my, the rabbit-hole gets deeper. I don’t know the difference between wide character and multi-byte. A little searching appears to indicate that Unicode is a type of wide-character, while multi-byte is … well, I still don’t know what multi-byte is. :) But, we’re definitely out in the weeds of non-cygwinness here, and my file is UTF16, so I can learn what multi-byte is and the difference later.

Bottom-line…

>> 
>> Thanks to everyone for your help. I think you’ve all confirmed this isn’t cygwin-specific, but I couldn’t find anything even searching generically (“grep unicode” and now “grep utf16”). I did finally find an external reference to iconv, but if grep is supposed to be handle this natively, I haven’t been able to find much on how to do it.
> 
> grep cannot handle UTF16 natively.  iconv exists to do encoding
> transformations, so that the rest of the system can live in multi-byte
> world instead of worrying about wide-character encodings.

… grep can’t handle unicode files. Good to know. iconv it is.

Thanks again!
--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


- Raw text -


  webmaster     delorie software   privacy  
  Copyright 2019   by DJ Delorie     Updated Jul 2019