X-Recipient: archive-cygwin AT delorie DOT com X-Spam-Check-By: sourceware.org Date: Mon, 21 Mar 2011 12:17:46 +0100 From: Corinna Vinschen To: cygwin AT cygwin DOT com Subject: Re: cygwin + GetConsoleOutputCP Message-ID: <20110321111746.GP31220@calimero.vinschen.de> Reply-To: cygwin AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com References: <4D8651F2 DOT 3000200 AT cwilson DOT fastmail DOT fm> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com On Mar 21 07:53, Andy Koppe wrote: > On 20 March 2011 19:13, Charles Wilson wrote: > > So basically if you specify -iso (or --conv iso) without any of the > > "input encoding specification" options like -437 etc, then dos2unix will > > autodetect attempt to detect the *console* encoding.  If it succeeds, > > then it will "convert" character codes from that encoding to their > > equivalent in ISO-8859-1 ("Latin 1") [unconvertible codes are replaced > > with an ascii dot] > > > > Note that this autodetect, if it works, assumes that the console's CP is > > the input file's CP.  Fair enough -- and it's an overridable default > > anyway.  However, I wonder if, in cygwin-1.7, we actually can/should use > > the "console codepage" in ANY way.  Here's the code: > > > > querycp.c: > > #elif defined (WIN32) || defined(__CYGWIN__) > > > > /* Erwin Waterlander */ > > > > #include > > unsigned short query_con_codepage(void) { > >   return((unsigned short)GetConsoleOutputCP()); > > } > > #else > > > > Or if instead, on cygwin, we should use some other mechanism (locale > > settings?) to determine the correct default "input" codepage. > > I think defaulting to the console codepage makes sense for the DOS > side of the conversion. Having said that, Windows files that aren't > "Unicode", i.e. UTF-16, are usually encoded in the so-called ANSI > codepage, e.g. CP1252, so it would make more sense to default to that. I agree with Andy here. I don't think there are really a lot of files left today, which are encoded using the old DOS codepages. > However, the real problem with this feature is that the Unix side of > the conversion is fixed to ISO-8859-1, which makes it near-useless > when Cygwin defaults to UTF-8. And it's no use for non-Western > European languages in any case. Right again. And not only Cygwin, almost all modern UNIX systems are using UTF-8 now. The -iso option just doesn't make sense. > A worthwhile conversion feature would use > MultiByteToWideChar()/WideCharToMultiByte() defaulting to the system's > ANSI codepage on the DOS side, and mbstowcs()/wcstombs() defaulting to Well, I'm not sure about that. The complexity of codepage settings on a Windows system makes the whole afair a guesswork which will always tend to do the wrong thing anyway. There are the following codepages available: - The current input console codepage, GetConsoleCP(). - The current output console codepage, GetConsoleOutputCP(). - The current OEM codepage, GetOEMCP(). - The current ANSI codepage, GetACP(). - The default OEM codepage of the default system locale, GetLocaleInfo (LOCALE_SYSTEM_DEFAULT, LOCALE_IDEFAULTCODEPAGE, ...). - The default ANSI codepage of the default system locale, GetLocaleInfo (LOCALE_SYSTEM_DEFAULT, LOCALE_IDEFAULTANSICODEPAGE, ...). - The default OEM codepage of the current user or process, GetLocaleInfo (LOCALE_USER_DEFAULT, LOCALE_IDEFAULTCODEPAGE, ...). - The default ANSI codepage of the current user or process, GetLocaleInfo (LOCALE_USER_DEFAULT, LOCALE_IDEFAULTANSICODEPAGE, ...). - The default OEM codepage used for system invariant operations, GetLocaleInfo (LOCALE_INVARIANT, LOCALE_IDEFAULTCODEPAGE, ...). - The default ANSI codepage used for system invariant operations, GetLocaleInfo (LOCALE_INVARIANT, LOCALE_IDEFAULTANSICODEPAGE, ...). Which is the right one? > the charset specified by the LC_CTYPE locale category on the Unix > side. In theory the option is not useful and should just go away. If you have to keep it for backward compatibility, stick to the current behaviour and outlaw its use, perhaps be printing a nagging warning to stderr. Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple