X-Recipient: archive-cygwin AT delorie DOT com X-SWARE-Spam-Status: No, hits=-2.4 required=5.0 tests=AWL,BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: sourceware.org MIME-Version: 1.0 In-Reply-To: <4D8651F2.3000200@cwilson.fastmail.fm> References: <4D8651F2 DOT 3000200 AT cwilson DOT fastmail DOT fm> Date: Mon, 21 Mar 2011 07:53:40 +0000 Message-ID: Subject: Re: cygwin + GetConsoleOutputCP From: Andy Koppe To: cygwin AT cygwin DOT com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-IsSubscribed: yes Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com On 20 March 2011 19:13, Charles Wilson wrote: > Question about porting the upstream "dos2unix" utilities. =C2=A0These > implementations provide capabilities to convert text files from a > certain limited set of INPUT encodings (most are DOS codepages): > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D > CONVERSION MODES > =C2=A0 =C2=A0 =C2=A0 Conversion modes ascii, 7bit, and iso are > =C2=A0 =C2=A0 =C2=A0 similar to those of dos2unix/unix2dos under > =C2=A0 =C2=A0 =C2=A0 SunOS/Solaris. > > =C2=A0 =C2=A0 =C2=A0 ascii > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 In mode "ascii" only line breaks are > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 converted. This is the default convers= ion > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 mode. > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Although the name of this mode is ASCI= I, > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 which is a 7 bit standard, the actual = mode > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 is 8 bit. Use always this mode when > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 converting Unicode UTF-8 files. > > =C2=A0 =C2=A0 =C2=A0 7bit > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 In this mode all 8 bit non-ASCII chara= cters > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 (with values from 128 to 255) are conv= erted > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 to a 7 bit space. > > =C2=A0 =C2=A0 =C2=A0 iso Characters are converted between a DOS > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 character set (code page) and ISO char= acter > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 set ISO-8859-1 (Latin-1) on Unix. DOS > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 characters without ISO-8859-1 equivale= nt, > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 for which conversion is not possible, = are > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 converted to a dot. The same counts for > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ISO-8859-1 characters without DOS > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 counterpart. > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 When only option "-iso" is used dos2un= ix > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 will try to determine the active code = page. > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 When this is not possible dos2unix wil= l use > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 default code page CP437, which is main= ly > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 used in the USA. =C2=A0To force a spec= ific code > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 page use options "-437" (US), "-850" > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 (Western European), "-860" (Portuguese= ), > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 "-863" (French Canadian), or "-865" > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 (Nordic). =C2=A0Windows code page CP12= 52 > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 (Western European) is also supported w= ith > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 option "-1252". For other code pages u= se > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 dos2unix in combination with iconv(1). > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Iconv can convert between a long list = of > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 character encodings. > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D > > So basically if you specify -iso (or --conv iso) without any of the > "input encoding specification" options like -437 etc, then dos2unix will > autodetect attempt to detect the *console* encoding. =C2=A0If it succeeds, > then it will "convert" character codes from that encoding to their > equivalent in ISO-8859-1 ("Latin 1") [unconvertible codes are replaced > with an ascii dot] > > Note that this autodetect, if it works, assumes that the console's CP is > the input file's CP. =C2=A0Fair enough -- and it's an overridable default > anyway. =C2=A0However, I wonder if, in cygwin-1.7, we actually can/should= use > the "console codepage" in ANY way. =C2=A0Here's the code: > > querycp.c: > #elif defined (WIN32) || defined(__CYGWIN__) > > /* Erwin Waterlander */ > > #include > unsigned short query_con_codepage(void) { > =C2=A0 return((unsigned short)GetConsoleOutputCP()); > } > #else > > Or if instead, on cygwin, we should use some other mechanism (locale > settings?) to determine the correct default "input" codepage. I think defaulting to the console codepage makes sense for the DOS side of the conversion. Having said that, Windows files that aren't "Unicode", i.e. UTF-16, are usually encoded in the so-called ANSI codepage, e.g. CP1252, so it would make more sense to default to that. However, the real problem with this feature is that the Unix side of the conversion is fixed to ISO-8859-1, which makes it near-useless when Cygwin defaults to UTF-8. And it's no use for non-Western European languages in any case. A worthwhile conversion feature would use MultiByteToWideChar()/WideCharToMultiByte() defaulting to the system's ANSI codepage on the DOS side, and mbstowcs()/wcstombs() defaulting to the charset specified by the LC_CTYPE locale category on the Unix side. Andy -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple