delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2012/07/19/08:36:21

X-Recipient: archive-cygwin AT delorie DOT com
X-SWARE-Spam-Status: No, hits=-4.8 required=5.0 tests=AWL,BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,KHOP_RCVD_TRUST,KHOP_THREADED,RCVD_IN_DNSWL_LOW,RCVD_IN_HOSTKARMA_YE,TW_VM
X-Spam-Check-By: sourceware.org
MIME-Version: 1.0
In-Reply-To: <20120719113927.GH31055@calimero.vinschen.de>
References: <loom DOT 20120719T103849-659 AT post DOT gmane DOT org> <20120719092024 DOT GA31055 AT calimero DOT vinschen DOT de> <loom DOT 20120719T131247-62 AT post DOT gmane DOT org> <20120719113927 DOT GH31055 AT calimero DOT vinschen DOT de>
Date: Thu, 19 Jul 2012 14:35:56 +0200
Message-ID: <CAEhDDbCJyHY-MWPCZ5=OQJFyvohuUU4AFsoPDzFudLQgfb-8Jw@mail.gmail.com>
Subject: Re: length in gawk returns wrong value
From: Csaba Raduly <rcsaba AT gmail DOT com>
To: cygwin AT cygwin DOT com
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com
X-MIME-Autoconverted: from quoted-printable to 8bit by delorie.com id q6JCaI5S028048

On Thu, Jul 19, 2012 at 1:39 PM, Corinna Vinschen  wrote:
> On Jul 19 11:27, Ralf wrote:
>> Corinna Vinschen <corinna-cygwin <at> cygwin.com> writes:
>>
>> >
>> > Uh oh.  1.7.9 is old.  Please update.
>> >
>> > > 0000000   R 374   c   k   e   n  \r  \n
>> > > 0000010
>> > > Length: 1
>> > >
>> > > What can I do to get the correct length in gawk without changing
>> > > ttt.txt?
>> >
>> > Dunno.  This is not what I see.  What did you have $LANG and $LC_CTYPE
>> > set to?  Here's what I see:
>> >
>> >   $ uname -a
>> >   CYGWIN_NT-6.1 vmbert7 1.7.16(0.261/5/3) 2012-07-09 14:51 i686 Cygwin
>> >
>> >   $ echo $LANG
>> >   C.UTF-8
>> >
>> >   $ echo "Rücken" > ttt.txt
>> >   $ od -c ttt.txt
>> >   0000000   R 303 274   c   k   e   n  \n
>> >   0000010
>> >
>> >   $ gawk '{print "Length: " length($0)}' ttt.txt
>> >   Length: 6
>> >
>> >   $ gawk --version | head -1
>> >   GNU Awk 4.0.1
>> >
>> > Corinna
>> >
>>
>> After updating I added following lines on top of my script:
>>  export LANG=C.UTF-8
>>  echo LANG: $LANG
>>  echo LC_CTYPE: $LC_TYPE
>>  c:/unix/bin/gawk --version | head -1
>>
>> And this is my output:
>>  LANG: C.UTF-8
>>  LC_CTYPE:
>>  GNU Awk 4.0.1
>>  CYGWIN_NT-6.0-WOW64 WIESWEG 1.7.15(0.260/5/3) 2012-05-09 10:25 i686 Cygwin
>>  0000000   R 374   c   k   e   n  \r  \n
>>  0000010
>>  Length: 5
>>
>> Very strange!
>
> Not at all.  The file contains an invalid character.  0374 is the
> umlaut-u in the ISO-8859-1 or ISO-8859-15 codesets.  Try this:
>
>   $ LC_ALL=de_DE gawk '{print "Length: " length($0)}' ttt.txt
>   Length: 6
>
> When you create the file under the UTF-8 codeset, you'll get:
>
>   0000000   R 303 274   c   k   e   n  \n
>

Proving, once again, that "There Ain't No Such Thing as Plain Text"
http://www.joelonsoftware.com/articles/Unicode.html


Csaba
-- 
GCS a+ e++ d- C++ ULS$ L+$ !E- W++ P+++$ w++$ tv+ b++ DI D++ 5++
The Tao of math: The numbers you can count are not the real numbers.
Life is complex, with real and imaginary parts.
"Ok, it boots. Which means it must be bug-free and perfect. " -- Linus Torvalds
"People disagree with me. I just ignore them." -- Linus Torvalds

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019