Mail Archives: cygwin/2011/03/21/08:38:04
X-Recipient: | archive-cygwin AT delorie DOT com
|
X-SWARE-Spam-Status: | No, hits=-2.5 required=5.0 tests=AWL,BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,T_TO_NO_BRKTS_FREEMAIL
|
X-Spam-Check-By: | sourceware.org
|
MIME-Version: | 1.0
|
In-Reply-To: | <20110321111746.GP31220@calimero.vinschen.de>
|
References: | <4D8651F2 DOT 3000200 AT cwilson DOT fastmail DOT fm> <AANLkTi=2pKTTo0+nUFa9Qaad7FxJwhhbQ5wJqtqtCpaw AT mail DOT gmail DOT com> <20110321111746 DOT GP31220 AT calimero DOT vinschen DOT de>
|
Date: | Mon, 21 Mar 2011 12:37:50 +0000
|
Message-ID: | <AANLkTimBFu3=4UCkKL=jraDLX00-MwhYpujm-vsRYsuc@mail.gmail.com>
|
Subject: | Re: cygwin + GetConsoleOutputCP
|
From: | Andy Koppe <andy DOT koppe AT gmail DOT com>
|
To: | cygwin AT cygwin DOT com
|
X-IsSubscribed: | yes
|
Mailing-List: | contact cygwin-help AT cygwin DOT com; run by ezmlm
|
List-Id: | <cygwin.cygwin.com>
|
List-Unsubscribe: | <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
|
List-Subscribe: | <mailto:cygwin-subscribe AT cygwin DOT com>
|
List-Archive: | <http://sourceware.org/ml/cygwin/>
|
List-Post: | <mailto:cygwin AT cygwin DOT com>
|
List-Help: | <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
|
Sender: | cygwin-owner AT cygwin DOT com
|
Mail-Followup-To: | cygwin AT cygwin DOT com
|
Delivered-To: | mailing list cygwin AT cygwin DOT com
|
On 21 March 2011 11:17, Corinna Vinschen wrote:
> On Mar 21 07:53, Andy Koppe wrote:
>> On 20 March 2011 19:13, Charles Wilson wrote:
>> > So basically if you specify -iso (or --conv iso) without any of the
>> > "input encoding specification" options like -437 etc, then dos2unix wi=
ll
>> > autodetect attempt to detect the *console* encoding. =C2=A0If it succe=
eds,
>> > then it will "convert" character codes from that encoding to their
>> > equivalent in ISO-8859-1 ("Latin 1") [unconvertible codes are replaced
>> > with an ascii dot]
>> >
>> > Note that this autodetect, if it works, assumes that the console's CP =
is
>> > the input file's CP. =C2=A0Fair enough -- and it's an overridable defa=
ult
>> > anyway. =C2=A0However, I wonder if, in cygwin-1.7, we actually can/sho=
uld use
>> > the "console codepage" in ANY way. =C2=A0Here's the code:
>> >
>> > querycp.c:
>> > #elif defined (WIN32) || defined(__CYGWIN__)
>> >
>> > /* Erwin Waterlander */
>> >
>> > #include <windows.h>
>> > unsigned short query_con_codepage(void) {
>> > =C2=A0 return((unsigned short)GetConsoleOutputCP());
>> > }
>> > #else
>> >
>> > Or if instead, on cygwin, we should use some other mechanism (locale
>> > settings?) to determine the correct default "input" codepage.
>>
>> I think defaulting to the console codepage makes sense for the DOS
>> side of the conversion. Having said that, Windows files that aren't
>> "Unicode", i.e. UTF-16, are usually encoded in the so-called ANSI
>> codepage, e.g. CP1252, so it would make more sense to default to that.
>
> I agree with Andy here. =C2=A0I don't think there are really a lot of fil=
es
> left today, which are encoded using the old DOS codepages.
>
>> However, the real problem with this feature is that the Unix side of
>> the conversion is fixed to ISO-8859-1, which makes it near-useless
>> when Cygwin defaults to UTF-8. And it's no use for non-Western
>> European languages in any case.
>
> Right again. =C2=A0And not only Cygwin, almost all modern UNIX systems are
> using UTF-8 now. =C2=A0The -iso option just doesn't make sense.
>
>> A worthwhile conversion feature would use
>> MultiByteToWideChar()/WideCharToMultiByte() defaulting to the system's
>> ANSI codepage on the DOS side, and mbstowcs()/wcstombs() defaulting to
>
> Well, I'm not sure about that. =C2=A0The complexity of codepage settings =
on a
> Windows system makes the whole afair a guesswork which will always tend
> to do the wrong thing anyway. =C2=A0There are the following codepages ava=
ilable:
>
> - The current input console codepage, GetConsoleCP().
>
> - The current output console codepage, GetConsoleOutputCP().
>
> - The current OEM codepage, GetOEMCP().
>
> - The current ANSI codepage, GetACP().
>
> - The default OEM codepage of the default system locale,
> =C2=A0GetLocaleInfo (LOCALE_SYSTEM_DEFAULT, LOCALE_IDEFAULTCODEPAGE, ...).
>
> - The default ANSI codepage of the default system locale,
> =C2=A0GetLocaleInfo (LOCALE_SYSTEM_DEFAULT, LOCALE_IDEFAULTANSICODEPAGE, =
...).
>
> - The default OEM codepage of the current user or process,
> =C2=A0GetLocaleInfo (LOCALE_USER_DEFAULT, LOCALE_IDEFAULTCODEPAGE, ...).
>
> - The default ANSI codepage of the current user or process,
> =C2=A0GetLocaleInfo (LOCALE_USER_DEFAULT, LOCALE_IDEFAULTANSICODEPAGE, ..=
.).
>
> - The default OEM codepage used for system invariant operations,
> =C2=A0GetLocaleInfo (LOCALE_INVARIANT, LOCALE_IDEFAULTCODEPAGE, ...).
>
> - The default ANSI codepage used for system invariant operations,
> =C2=A0GetLocaleInfo (LOCALE_INVARIANT, LOCALE_IDEFAULTANSICODEPAGE, ...).
>
> Which is the right one?
GetACP(), which "retrieves the current Windows ANSI code page
identifier for the operating system". That's what programs using the
non-Unicode APIs get. It's also the default in Notepad and other
editors.
Other code pages would need to be specified explicitly by the user.
> In theory the option is not useful and should just go away.=C2=A0If you
> have to keep it for backward compatibility, stick to the current
> behaviour and outlaw its use, perhaps be printing a nagging warning
> to stderr.
... and pointing them at iconv (which, to be fair, the -iso
description already does).
Andy
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
- Raw text -