Mail Archives: cygwin/2011/03/21/07:18:24
On Mar 21 07:53, Andy Koppe wrote:
> On 20 March 2011 19:13, Charles Wilson wrote:
> > So basically if you specify -iso (or --conv iso) without any of the
> > "input encoding specification" options like -437 etc, then dos2unix will
> > autodetect attempt to detect the *console* encoding. Â If it succeeds,
> > then it will "convert" character codes from that encoding to their
> > equivalent in ISO-8859-1 ("Latin 1") [unconvertible codes are replaced
> > with an ascii dot]
> >
> > Note that this autodetect, if it works, assumes that the console's CP is
> > the input file's CP. Â Fair enough -- and it's an overridable default
> > anyway. Â However, I wonder if, in cygwin-1.7, we actually can/should use
> > the "console codepage" in ANY way. Â Here's the code:
> >
> > querycp.c:
> > #elif defined (WIN32) || defined(__CYGWIN__)
> >
> > /* Erwin Waterlander */
> >
> > #include <windows.h>
> > unsigned short query_con_codepage(void) {
> > Â return((unsigned short)GetConsoleOutputCP());
> > }
> > #else
> >
> > Or if instead, on cygwin, we should use some other mechanism (locale
> > settings?) to determine the correct default "input" codepage.
>
> I think defaulting to the console codepage makes sense for the DOS
> side of the conversion. Having said that, Windows files that aren't
> "Unicode", i.e. UTF-16, are usually encoded in the so-called ANSI
> codepage, e.g. CP1252, so it would make more sense to default to that.
I agree with Andy here. I don't think there are really a lot of files
left today, which are encoded using the old DOS codepages.
> However, the real problem with this feature is that the Unix side of
> the conversion is fixed to ISO-8859-1, which makes it near-useless
> when Cygwin defaults to UTF-8. And it's no use for non-Western
> European languages in any case.
Right again. And not only Cygwin, almost all modern UNIX systems are
using UTF-8 now. The -iso option just doesn't make sense.
> A worthwhile conversion feature would use
> MultiByteToWideChar()/WideCharToMultiByte() defaulting to the system's
> ANSI codepage on the DOS side, and mbstowcs()/wcstombs() defaulting to
Well, I'm not sure about that. The complexity of codepage settings on a
Windows system makes the whole afair a guesswork which will always tend
to do the wrong thing anyway. There are the following codepages available:
- The current input console codepage, GetConsoleCP().
- The current output console codepage, GetConsoleOutputCP().
- The current OEM codepage, GetOEMCP().
- The current ANSI codepage, GetACP().
- The default OEM codepage of the default system locale,
GetLocaleInfo (LOCALE_SYSTEM_DEFAULT, LOCALE_IDEFAULTCODEPAGE, ...).
- The default ANSI codepage of the default system locale,
GetLocaleInfo (LOCALE_SYSTEM_DEFAULT, LOCALE_IDEFAULTANSICODEPAGE, ...).
- The default OEM codepage of the current user or process,
GetLocaleInfo (LOCALE_USER_DEFAULT, LOCALE_IDEFAULTCODEPAGE, ...).
- The default ANSI codepage of the current user or process,
GetLocaleInfo (LOCALE_USER_DEFAULT, LOCALE_IDEFAULTANSICODEPAGE, ...).
- The default OEM codepage used for system invariant operations,
GetLocaleInfo (LOCALE_INVARIANT, LOCALE_IDEFAULTCODEPAGE, ...).
- The default ANSI codepage used for system invariant operations,
GetLocaleInfo (LOCALE_INVARIANT, LOCALE_IDEFAULTANSICODEPAGE, ...).
Which is the right one?
> the charset specified by the LC_CTYPE locale category on the Unix
> side.
In theory the option is not useful and should just go away. If you
have to keep it for backward compatibility, stick to the current
behaviour and outlaw its use, perhaps be printing a nagging warning
to stderr.
Corinna
--
Corinna Vinschen Please, send mails regarding Cygwin to
Cygwin Project Co-Leader cygwin AT cygwin DOT com
Red Hat
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
- Raw text -