Mail Archives: cygwin/2011/03/21/03:53:59
X-Recipient: | archive-cygwin AT delorie DOT com
|
X-SWARE-Spam-Status: | No, hits=-2.4 required=5.0 tests=AWL,BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,T_TO_NO_BRKTS_FREEMAIL
|
X-Spam-Check-By: | sourceware.org
|
MIME-Version: | 1.0
|
In-Reply-To: | <4D8651F2.3000200@cwilson.fastmail.fm>
|
References: | <4D8651F2 DOT 3000200 AT cwilson DOT fastmail DOT fm>
|
Date: | Mon, 21 Mar 2011 07:53:40 +0000
|
Message-ID: | <AANLkTi=2pKTTo0+nUFa9Qaad7FxJwhhbQ5wJqtqtCpaw@mail.gmail.com>
|
Subject: | Re: cygwin + GetConsoleOutputCP
|
From: | Andy Koppe <andy DOT koppe AT gmail DOT com>
|
To: | cygwin AT cygwin DOT com
|
X-IsSubscribed: | yes
|
Mailing-List: | contact cygwin-help AT cygwin DOT com; run by ezmlm
|
List-Id: | <cygwin.cygwin.com>
|
List-Unsubscribe: | <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
|
List-Subscribe: | <mailto:cygwin-subscribe AT cygwin DOT com>
|
List-Archive: | <http://sourceware.org/ml/cygwin/>
|
List-Post: | <mailto:cygwin AT cygwin DOT com>
|
List-Help: | <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
|
Sender: | cygwin-owner AT cygwin DOT com
|
Mail-Followup-To: | cygwin AT cygwin DOT com
|
Delivered-To: | mailing list cygwin AT cygwin DOT com
|
On 20 March 2011 19:13, Charles Wilson wrote:
> Question about porting the upstream "dos2unix" utilities. =C2=A0These
> implementations provide capabilities to convert text files from a
> certain limited set of INPUT encodings (most are DOS codepages):
>
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D
> CONVERSION MODES
> =C2=A0 =C2=A0 =C2=A0 Conversion modes ascii, 7bit, and iso are
> =C2=A0 =C2=A0 =C2=A0 similar to those of dos2unix/unix2dos under
> =C2=A0 =C2=A0 =C2=A0 SunOS/Solaris.
>
> =C2=A0 =C2=A0 =C2=A0 ascii
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 In mode "ascii" only line breaks are
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 converted. This is the default convers=
ion
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 mode.
>
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Although the name of this mode is ASCI=
I,
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 which is a 7 bit standard, the actual =
mode
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 is 8 bit. Use always this mode when
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 converting Unicode UTF-8 files.
>
> =C2=A0 =C2=A0 =C2=A0 7bit
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 In this mode all 8 bit non-ASCII chara=
cters
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 (with values from 128 to 255) are conv=
erted
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 to a 7 bit space.
>
> =C2=A0 =C2=A0 =C2=A0 iso Characters are converted between a DOS
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 character set (code page) and ISO char=
acter
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 set ISO-8859-1 (Latin-1) on Unix. DOS
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 characters without ISO-8859-1 equivale=
nt,
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 for which conversion is not possible, =
are
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 converted to a dot. The same counts for
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ISO-8859-1 characters without DOS
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 counterpart.
>
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 When only option "-iso" is used dos2un=
ix
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 will try to determine the active code =
page.
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 When this is not possible dos2unix wil=
l use
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 default code page CP437, which is main=
ly
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 used in the USA. =C2=A0To force a spec=
ific code
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 page use options "-437" (US), "-850"
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 (Western European), "-860" (Portuguese=
),
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 "-863" (French Canadian), or "-865"
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 (Nordic). =C2=A0Windows code page CP12=
52
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 (Western European) is also supported w=
ith
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 option "-1252". For other code pages u=
se
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 dos2unix in combination with iconv(1).
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Iconv can convert between a long list =
of
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 character encodings.
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D
>
> So basically if you specify -iso (or --conv iso) without any of the
> "input encoding specification" options like -437 etc, then dos2unix will
> autodetect attempt to detect the *console* encoding. =C2=A0If it succeeds,
> then it will "convert" character codes from that encoding to their
> equivalent in ISO-8859-1 ("Latin 1") [unconvertible codes are replaced
> with an ascii dot]
>
> Note that this autodetect, if it works, assumes that the console's CP is
> the input file's CP. =C2=A0Fair enough -- and it's an overridable default
> anyway. =C2=A0However, I wonder if, in cygwin-1.7, we actually can/should=
use
> the "console codepage" in ANY way. =C2=A0Here's the code:
>
> querycp.c:
> #elif defined (WIN32) || defined(__CYGWIN__)
>
> /* Erwin Waterlander */
>
> #include <windows.h>
> unsigned short query_con_codepage(void) {
> =C2=A0 return((unsigned short)GetConsoleOutputCP());
> }
> #else
>
> Or if instead, on cygwin, we should use some other mechanism (locale
> settings?) to determine the correct default "input" codepage.
I think defaulting to the console codepage makes sense for the DOS
side of the conversion. Having said that, Windows files that aren't
"Unicode", i.e. UTF-16, are usually encoded in the so-called ANSI
codepage, e.g. CP1252, so it would make more sense to default to that.
However, the real problem with this feature is that the Unix side of
the conversion is fixed to ISO-8859-1, which makes it near-useless
when Cygwin defaults to UTF-8. And it's no use for non-Western
European languages in any case.
A worthwhile conversion feature would use
MultiByteToWideChar()/WideCharToMultiByte() defaulting to the system's
ANSI codepage on the DOS side, and mbstowcs()/wcstombs() defaulting to
the charset specified by the LC_CTYPE locale category on the Unix
side.
Andy
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
- Raw text -