X-Recipient: archive-cygwin AT delorie DOT com X-SWARE-Spam-Status: No, hits=-2.0 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_NONE,SPF_HELO_PASS X-Spam-Check-By: sourceware.org Message-ID: <4BF55F87.4060407@towo.net> Date: Thu, 20 May 2010 18:12:55 +0200 From: Thomas Wolff User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.9) Gecko/20100317 Thunderbird/3.0.4 MIME-Version: 1.0 To: cygwin AT cygwin DOT com Subject: Re: sed doesn't like LANG= anymore References: <20100520123926 DOT GA1432 AT onderneming10 DOT xs4all DOT nl> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-IsSubscribed: yes Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com Am 20.05.2010 18:05, schrieb Andy Koppe: > On Thursday, May 20, 2010, Jurriaan wrote: > >> A very long sed script that's been working for ages (back from the 1.5 >> age) here has stopped working. >> >> It turned out sed doesn't like some strings anymore when environment >> variable LANG is empty. With LANG=ASCII, there are no problems. >> >> The actual text in the SED command is shown below as spaces, but it's a >> Swedish a with a small o on top of it, like this: >> >> sed -e"s/@a/ a/g;" >> >> where a is character 0xe5. >> >> Running with LANG=ASCII works, with LANG empty I get 'unterminated `s' >> command' from sed (which confused me for a while). >> > With empty LANG you're using the default UTF-8 encoding, where that > 0xe5 byte constitutes an incomplete character. You need to either run > with a LANG setting that fits your script, e.g. C.ISO-8859-1, or > convert your script to UTF-8. I'm puzzled as to why LANG=ASCII would > have worked, since that's not a valid setting. > With LANG=anything-unknown, the charmap is set to ASCII, so it works (as there is at least no multibyte character then). Considering the described effect, I doubt that a UTF-8 decoder should swallow an ASCII byte after an incomplete UTF-8 sequence; it should rather stop at the last UTF-8 sequence byte, and consider any subsequent initial UTF-8 or ASCII byte as a new character. I guess the script would still work on Linux (can't try right now, sorry) even in a "wrong" locale, so I think something should be fixed in the newlib conversion functions here. ------ Thomas -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple