delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2011/03/21/07:18:24

X-Recipient: archive-cygwin AT delorie DOT com
X-Spam-Check-By: sourceware.org
Date: Mon, 21 Mar 2011 12:17:46 +0100
From: Corinna Vinschen <corinna-cygwin AT cygwin DOT com>
To: cygwin AT cygwin DOT com
Subject: Re: cygwin + GetConsoleOutputCP
Message-ID: <20110321111746.GP31220@calimero.vinschen.de>
Reply-To: cygwin AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
References: <4D8651F2 DOT 3000200 AT cwilson DOT fastmail DOT fm> <AANLkTi=2pKTTo0+nUFa9Qaad7FxJwhhbQ5wJqtqtCpaw AT mail DOT gmail DOT com>
MIME-Version: 1.0
In-Reply-To: <AANLkTi=2pKTTo0+nUFa9Qaad7FxJwhhbQ5wJqtqtCpaw@mail.gmail.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

On Mar 21 07:53, Andy Koppe wrote:
> On 20 March 2011 19:13, Charles Wilson wrote:
> > So basically if you specify -iso (or --conv iso) without any of the
> > "input encoding specification" options like -437 etc, then dos2unix will
> > autodetect attempt to detect the *console* encoding.  If it succeeds,
> > then it will "convert" character codes from that encoding to their
> > equivalent in ISO-8859-1 ("Latin 1") [unconvertible codes are replaced
> > with an ascii dot]
> >
> > Note that this autodetect, if it works, assumes that the console's CP is
> > the input file's CP.  Fair enough -- and it's an overridable default
> > anyway.  However, I wonder if, in cygwin-1.7, we actually can/should use
> > the "console codepage" in ANY way.  Here's the code:
> >
> > querycp.c:
> > #elif defined (WIN32) || defined(__CYGWIN__)
> >
> > /* Erwin Waterlander */
> >
> > #include <windows.h>
> > unsigned short query_con_codepage(void) {
> >   return((unsigned short)GetConsoleOutputCP());
> > }
> > #else
> >
> > Or if instead, on cygwin, we should use some other mechanism (locale
> > settings?) to determine the correct default "input" codepage.
> 
> I think defaulting to the console codepage makes sense for the DOS
> side of the conversion. Having said that, Windows files that aren't
> "Unicode", i.e. UTF-16, are usually encoded in the so-called ANSI
> codepage, e.g. CP1252, so it would make more sense to default to that.

I agree with Andy here.  I don't think there are really a lot of files
left today, which are encoded using the old DOS codepages.

> However, the real problem with this feature is that the Unix side of
> the conversion is fixed to ISO-8859-1, which makes it near-useless
> when Cygwin defaults to UTF-8. And it's no use for non-Western
> European languages in any case.

Right again.  And not only Cygwin, almost all modern UNIX systems are
using UTF-8 now.  The -iso option just doesn't make sense.

> A worthwhile conversion feature would use
> MultiByteToWideChar()/WideCharToMultiByte() defaulting to the system's
> ANSI codepage on the DOS side, and mbstowcs()/wcstombs() defaulting to

Well, I'm not sure about that.  The complexity of codepage settings on a
Windows system makes the whole afair a guesswork which will always tend
to do the wrong thing anyway.  There are the following codepages available:

- The current input console codepage, GetConsoleCP().

- The current output console codepage, GetConsoleOutputCP().

- The current OEM codepage, GetOEMCP().

- The current ANSI codepage, GetACP().

- The default OEM codepage of the default system locale,
  GetLocaleInfo (LOCALE_SYSTEM_DEFAULT, LOCALE_IDEFAULTCODEPAGE, ...).

- The default ANSI codepage of the default system locale,
  GetLocaleInfo (LOCALE_SYSTEM_DEFAULT, LOCALE_IDEFAULTANSICODEPAGE, ...).

- The default OEM codepage of the current user or process,
  GetLocaleInfo (LOCALE_USER_DEFAULT, LOCALE_IDEFAULTCODEPAGE, ...).

- The default ANSI codepage of the current user or process,
  GetLocaleInfo (LOCALE_USER_DEFAULT, LOCALE_IDEFAULTANSICODEPAGE, ...).

- The default OEM codepage used for system invariant operations,
  GetLocaleInfo (LOCALE_INVARIANT, LOCALE_IDEFAULTCODEPAGE, ...).

- The default ANSI codepage used for system invariant operations,
  GetLocaleInfo (LOCALE_INVARIANT, LOCALE_IDEFAULTANSICODEPAGE, ...).

Which is the right one?

> the charset specified by the LC_CTYPE locale category on the Unix
> side.

In theory the option is not useful and should just go away.  If you
have to keep it for backward compatibility, stick to the current
behaviour and outlaw its use, perhaps be printing a nagging warning
to stderr.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019