X-Recipient: archive-cygwin AT delorie DOT com X-SWARE-Spam-Status: No, hits=-2.5 required=5.0 tests=AWL,BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: sourceware.org MIME-Version: 1.0 In-Reply-To: <20110321111746.GP31220@calimero.vinschen.de> References: <4D8651F2 DOT 3000200 AT cwilson DOT fastmail DOT fm> <20110321111746 DOT GP31220 AT calimero DOT vinschen DOT de> Date: Mon, 21 Mar 2011 12:37:50 +0000 Message-ID: Subject: Re: cygwin + GetConsoleOutputCP From: Andy Koppe To: cygwin AT cygwin DOT com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-IsSubscribed: yes Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com On 21 March 2011 11:17, Corinna Vinschen wrote: > On Mar 21 07:53, Andy Koppe wrote: >> On 20 March 2011 19:13, Charles Wilson wrote: >> > So basically if you specify -iso (or --conv iso) without any of the >> > "input encoding specification" options like -437 etc, then dos2unix wi= ll >> > autodetect attempt to detect the *console* encoding. =C2=A0If it succe= eds, >> > then it will "convert" character codes from that encoding to their >> > equivalent in ISO-8859-1 ("Latin 1") [unconvertible codes are replaced >> > with an ascii dot] >> > >> > Note that this autodetect, if it works, assumes that the console's CP = is >> > the input file's CP. =C2=A0Fair enough -- and it's an overridable defa= ult >> > anyway. =C2=A0However, I wonder if, in cygwin-1.7, we actually can/sho= uld use >> > the "console codepage" in ANY way. =C2=A0Here's the code: >> > >> > querycp.c: >> > #elif defined (WIN32) || defined(__CYGWIN__) >> > >> > /* Erwin Waterlander */ >> > >> > #include >> > unsigned short query_con_codepage(void) { >> > =C2=A0 return((unsigned short)GetConsoleOutputCP()); >> > } >> > #else >> > >> > Or if instead, on cygwin, we should use some other mechanism (locale >> > settings?) to determine the correct default "input" codepage. >> >> I think defaulting to the console codepage makes sense for the DOS >> side of the conversion. Having said that, Windows files that aren't >> "Unicode", i.e. UTF-16, are usually encoded in the so-called ANSI >> codepage, e.g. CP1252, so it would make more sense to default to that. > > I agree with Andy here. =C2=A0I don't think there are really a lot of fil= es > left today, which are encoded using the old DOS codepages. > >> However, the real problem with this feature is that the Unix side of >> the conversion is fixed to ISO-8859-1, which makes it near-useless >> when Cygwin defaults to UTF-8. And it's no use for non-Western >> European languages in any case. > > Right again. =C2=A0And not only Cygwin, almost all modern UNIX systems are > using UTF-8 now. =C2=A0The -iso option just doesn't make sense. > >> A worthwhile conversion feature would use >> MultiByteToWideChar()/WideCharToMultiByte() defaulting to the system's >> ANSI codepage on the DOS side, and mbstowcs()/wcstombs() defaulting to > > Well, I'm not sure about that. =C2=A0The complexity of codepage settings = on a > Windows system makes the whole afair a guesswork which will always tend > to do the wrong thing anyway. =C2=A0There are the following codepages ava= ilable: > > - The current input console codepage, GetConsoleCP(). > > - The current output console codepage, GetConsoleOutputCP(). > > - The current OEM codepage, GetOEMCP(). > > - The current ANSI codepage, GetACP(). > > - The default OEM codepage of the default system locale, > =C2=A0GetLocaleInfo (LOCALE_SYSTEM_DEFAULT, LOCALE_IDEFAULTCODEPAGE, ...). > > - The default ANSI codepage of the default system locale, > =C2=A0GetLocaleInfo (LOCALE_SYSTEM_DEFAULT, LOCALE_IDEFAULTANSICODEPAGE, = ...). > > - The default OEM codepage of the current user or process, > =C2=A0GetLocaleInfo (LOCALE_USER_DEFAULT, LOCALE_IDEFAULTCODEPAGE, ...). > > - The default ANSI codepage of the current user or process, > =C2=A0GetLocaleInfo (LOCALE_USER_DEFAULT, LOCALE_IDEFAULTANSICODEPAGE, ..= .). > > - The default OEM codepage used for system invariant operations, > =C2=A0GetLocaleInfo (LOCALE_INVARIANT, LOCALE_IDEFAULTCODEPAGE, ...). > > - The default ANSI codepage used for system invariant operations, > =C2=A0GetLocaleInfo (LOCALE_INVARIANT, LOCALE_IDEFAULTANSICODEPAGE, ...). > > Which is the right one? GetACP(), which "retrieves the current Windows ANSI code page identifier for the operating system". That's what programs using the non-Unicode APIs get. It's also the default in Notepad and other editors. Other code pages would need to be specified explicitly by the user. > In theory the option is not useful and should just go away.=C2=A0If you > have to keep it for backward compatibility, stick to the current > behaviour and outlaw its use, perhaps be printing a nagging warning > to stderr. ... and pointing them at iconv (which, to be fair, the -iso description already does). Andy -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple