Mail Archives: cygwin/2011/03/20/15:14:21
Question about porting the upstream "dos2unix" utilities. These
implementations provide capabilities to convert text files from a
certain limited set of INPUT encodings (most are DOS codepages):
=====================================================
CONVERSION MODES
Conversion modes ascii, 7bit, and iso are
similar to those of dos2unix/unix2dos under
SunOS/Solaris.
ascii
In mode "ascii" only line breaks are
converted. This is the default conversion
mode.
Although the name of this mode is ASCII,
which is a 7 bit standard, the actual mode
is 8 bit. Use always this mode when
converting Unicode UTF-8 files.
7bit
In this mode all 8 bit non-ASCII characters
(with values from 128 to 255) are converted
to a 7 bit space.
iso Characters are converted between a DOS
character set (code page) and ISO character
set ISO-8859-1 (Latin-1) on Unix. DOS
characters without ISO-8859-1 equivalent,
for which conversion is not possible, are
converted to a dot. The same counts for
ISO-8859-1 characters without DOS
counterpart.
When only option "-iso" is used dos2unix
will try to determine the active code page.
When this is not possible dos2unix will use
default code page CP437, which is mainly
used in the USA. To force a specific code
page use options "-437" (US), "-850"
(Western European), "-860" (Portuguese),
"-863" (French Canadian), or "-865"
(Nordic). Windows code page CP1252
(Western European) is also supported with
option "-1252". For other code pages use
dos2unix in combination with iconv(1).
Iconv can convert between a long list of
character encodings.
=====================================================
So basically if you specify -iso (or --conv iso) without any of the
"input encoding specification" options like -437 etc, then dos2unix will
autodetect attempt to detect the *console* encoding. If it succeeds,
then it will "convert" character codes from that encoding to their
equivalent in ISO-8859-1 ("Latin 1") [unconvertible codes are replaced
with an ascii dot]
Note that this autodetect, if it works, assumes that the console's CP is
the input file's CP. Fair enough -- and it's an overridable default
anyway. However, I wonder if, in cygwin-1.7, we actually can/should use
the "console codepage" in ANY way. Here's the code:
querycp.c:
#elif defined (WIN32) || defined(__CYGWIN__)
/* Erwin Waterlander */
#include <windows.h>
unsigned short query_con_codepage(void) {
return((unsigned short)GetConsoleOutputCP());
}
#else
Or if instead, on cygwin, we should use some other mechanism (locale
settings?) to determine the correct default "input" codepage.
Comments?
--
Chuck
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
- Raw text -