Mail Archives: cygwin/2010/05/20/12:13:39
Am 20.05.2010 18:05, schrieb Andy Koppe:
> On Thursday, May 20, 2010, Jurriaan wrote:
>
>> A very long sed script that's been working for ages (back from the 1.5
>> age) here has stopped working.
>>
>> It turned out sed doesn't like some strings anymore when environment
>> variable LANG is empty. With LANG=ASCII, there are no problems.
>>
>> The actual text in the SED command is shown below as spaces, but it's a
>> Swedish a with a small o on top of it, like this:
>>
>> sed -e"s/@a/ a/g;"
>>
>> where a is character 0xe5.
>>
>> Running with LANG=ASCII works, with LANG empty I get 'unterminated `s'
>> command' from sed (which confused me for a while).
>>
> With empty LANG you're using the default UTF-8 encoding, where that
> 0xe5 byte constitutes an incomplete character. You need to either run
> with a LANG setting that fits your script, e.g. C.ISO-8859-1, or
> convert your script to UTF-8. I'm puzzled as to why LANG=ASCII would
> have worked, since that's not a valid setting.
>
With LANG=anything-unknown, the charmap is set to ASCII, so it works (as
there is at least no multibyte character then).
Considering the described effect, I doubt that a UTF-8 decoder should
swallow an ASCII byte after an incomplete UTF-8 sequence;
it should rather stop at the last UTF-8 sequence byte, and consider any
subsequent initial UTF-8 or ASCII byte as a new character.
I guess the script would still work on Linux (can't try right now,
sorry) even in a "wrong" locale, so I think something should be fixed in
the newlib conversion functions here.
------
Thomas
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
- Raw text -