delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2010/05/20/12:13:39

X-Recipient: archive-cygwin AT delorie DOT com
X-SWARE-Spam-Status: No, hits=-2.0 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_NONE,SPF_HELO_PASS
X-Spam-Check-By: sourceware.org
Message-ID: <4BF55F87.4060407@towo.net>
Date: Thu, 20 May 2010 18:12:55 +0200
From: Thomas Wolff <towo AT towo DOT net>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.9) Gecko/20100317 Thunderbird/3.0.4
MIME-Version: 1.0
To: cygwin AT cygwin DOT com
Subject: Re: sed doesn't like LANG= anymore
References: <20100520123926 DOT GA1432 AT onderneming10 DOT xs4all DOT nl> <AANLkTilpbuyiJIswTZGQN5jsHsK793ITUP9pcB95Hf1l AT mail DOT gmail DOT com>
In-Reply-To: <AANLkTilpbuyiJIswTZGQN5jsHsK793ITUP9pcB95Hf1l@mail.gmail.com>
X-IsSubscribed: yes
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

Am 20.05.2010 18:05, schrieb Andy Koppe:
> On Thursday, May 20, 2010, Jurriaan wrote:
>    
>> A very long sed script that's been working for ages (back from the 1.5
>> age) here has stopped working.
>>
>> It turned out sed doesn't like some strings anymore when environment
>> variable LANG is empty. With LANG=ASCII, there are no problems.
>>
>> The actual text in the SED command is shown below as spaces, but it's a
>> Swedish a with a small o on top of it, like this:
>>
>> sed -e"s/@a/ a/g;"
>>
>> where a is character 0xe5.
>>
>> Running with LANG=ASCII works, with LANG empty I get 'unterminated `s'
>> command' from sed (which confused me for a while).
>>      
> With empty LANG you're using the default UTF-8 encoding, where that
> 0xe5 byte constitutes an incomplete character. You need to either run
> with a LANG setting that fits your script, e.g. C.ISO-8859-1, or
> convert your script to UTF-8. I'm puzzled as to why LANG=ASCII would
> have worked, since that's not a valid setting.
>    
With LANG=anything-unknown, the charmap is set to ASCII, so it works (as 
there is at least no multibyte character then).
Considering the described effect, I doubt that a UTF-8 decoder should 
swallow an ASCII byte after an incomplete UTF-8 sequence;
it should rather stop at the last UTF-8 sequence byte, and consider any 
subsequent initial UTF-8 or ASCII byte as a new character.
I guess the script would still work on Linux (can't try right now, 
sorry) even in a "wrong" locale, so I think something should be fixed in 
the newlib conversion functions here.
------
Thomas

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019