delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2011/03/21/10:34:41

X-Recipient: archive-cygwin AT delorie DOT com
X-SWARE-Spam-Status: No, hits=-2.2 required=5.0 tests=AWL,BAYES_00,DKIM_SIGNED,DKIM_VALID,RCVD_IN_DNSWL_LOW
X-Spam-Check-By: sourceware.org
Message-ID: <4D8761DE.1070300@cwilson.fastmail.fm>
Date: Mon, 21 Mar 2011 10:34:06 -0400
From: Charles Wilson <cygwin AT cwilson DOT fastmail DOT fm>
Reply-To: Charles Wilson <cygwin AT cwilson DOT fastmail DOT fm>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.9.2.15) Gecko/20110303 Thunderbird/3.1.9
MIME-Version: 1.0
To: cygwin AT cygwin DOT com
Subject: Re: cygwin + GetConsoleOutputCP
References: <4D8651F2 DOT 3000200 AT cwilson DOT fastmail DOT fm> <AANLkTi=2pKTTo0+nUFa9Qaad7FxJwhhbQ5wJqtqtCpaw AT mail DOT gmail DOT com>
In-Reply-To: <AANLkTi=2pKTTo0+nUFa9Qaad7FxJwhhbQ5wJqtqtCpaw@mail.gmail.com>
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

On 3/21/2011 3:53 AM, Andy Koppe wrote:
> I think defaulting to the console codepage makes sense for the DOS
> side of the conversion. Having said that, Windows files that aren't
> "Unicode", i.e. UTF-16, are usually encoded in the so-called ANSI
> codepage, e.g. CP1252, so it would make more sense to default to that.
> 
> However, the real problem with this feature is that the Unix side of
> the conversion is fixed to ISO-8859-1, which makes it near-useless
> when Cygwin defaults to UTF-8. And it's no use for non-Western
> European languages in any case.

Meh...the same basic set of options/conversions is provided if unix2dos
is compiled on linux.  Only there, the "offending" function is
implemented as:

unsigned short query_con_codepage(void) {
   return(0);
}

However, each time query_con_codepage is called, it is followed by:
 if ([return value of query_con_codepage] < 2)
           pFlag->ConvMode = CONVMODE_437;

IOW, on linux, when using -iso with no specific code page, it acts just
as if you had simply specified -437 for the "dos" side; the "unix" side
is still, as always, iso-8859-1.

> A worthwhile conversion feature would use
> MultiByteToWideChar()/WideCharToMultiByte() defaulting to the system's
> ANSI codepage on the DOS side, and mbstowcs()/wcstombs() defaulting to
> the charset specified by the LC_CTYPE locale category on the Unix
> side.

Well, if you want full-featured charset conversion, then that's what
iconv(1) is for.  These added features of dos2unix/unix2dos are, in
reality, quick and dirty approaches to *single byte* charset conversion
for a *limit set* of charsets.

I'm not looking to re-implement the whole thing or modify the semantics
of the options. (Or even add a new set of options.) I'm just trying to
make sure that, given the existing semantics of the options, that
dos2unix selects the proper default CP for the "dos" side -- using
whatever is considered the definitive source for the current "dosish"
active codepage on the cygwin platform -- when the existing options are
used.)

--
Chuck

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019