Message-ID: <000a01c0a8f8$a9f90940$ed08e289@mpaul>
From: "Matthias Paul" <Matthias DOT Paul AT post DOT rwth-aachen DOT de>
To: <opendos AT delorie DOT com>
References: <01FD6EC775C6D4119CDF0090273F74A4021FC5 AT emwatent02 DOT meters DOT com DOT au>
Subject: Re: Text file format .ASC ? (#2.1)
Date: Sat, 10 Mar 2001 01:16:38 +0100
Organization: Rechenzentrum RWTH Aachen
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 5.50.4522.1200
X-MIMEOLE: Produced By Microsoft MimeOLE V5.50.4522.1200
Content-Transfer-Encoding: 8bit
X-MIME-Autoconverted: from quoted-printable to 8bit by delorie.com id TAA30933
Reply-To: opendos AT delorie DOT com

On 2001-03-09, Joe da Silva asked:

>When you say "the first range is ... 40h-7Eh", do you mean
>these codings don't have the Roman characters, etc. in the
>usual place (ala. ASCII)? In other words, if they support
>Roman characters at all, they only have a two byte coding for
>them?

Well, Arkady has solved the mystery already, but FWIW
I still want to answer your question:

Those ranges were meant as maximum extents. According to
William Spencer Hall (Novell) in his article "Internationalizing
Windows Software" (from "Microsoft Windows 3.1 Developer´s
Workshop", Microsoft Press, 1993, ISBN 1-55615-480-1), which
also gives a very good general description of I18N issues for DOS
at both, developer and user level, some Code Pages might actually
have a window below 128.
For *common* DBCS Code Pages the range for the Lead Byte is above
127, so the 7-bit ASCII part is not changed for them (although my own
experience is that sometimes they have non-ASCII characters in the
non-alphabetic and non-numeric Code Points). Here are a few examples:

 Codepage - Lead Byte Range - Trail Byte Range

 932 - 81h..9Fh, E0h..FCh - 40h..7Eh, 80h..FCh
 936 - A1h..A9h, B0h..F7h - A1h..FEh
 949 - A1h..ACh, B0h..C8h, CAh..FDh - A1h..FEh
 950 - A1h..C6h, C9h..F9h - 40h..7Eh, 80h..FEh

(from Nadine Kano´s "Developing International Software for
Windows 95 and Windows NT", Microsoft Press, 1995,
ISBN 1-55615-840-8, superseeding "International Handbook",
MS, 1991, and "Developing International Software for
Microsoft Windows", MS, 1995).

But even if the above mentioned Code Pages leave 7-bit ASCII
unchanged, they usually duplicate the Roman letters as double-byte
characters: Like most of the other double-byte characters, these
alternative characters are displayed in doubled width by the front-end.
Some DBCS Code Pages also contain Greek and other characters.

The most complete and very recommendable reference on the topic
I have seen so far, is "CJKV Information Processing - Chinese,
Japanese, Korean & Vietnamese Computing" by Ken Lunde,
O´Reilly Associates, 1999, ISBN 1-56592-224-7 (superseeding
his "Understanding Japanese Information Processing", ORA, 1993).
It contains an long list of DBCS, TBCS, and MBCS Code Pages 
(with glyphs!) and associated standards.
For those interested, another - more formal - documentation is
"Character Data Representation Architecture Reference and Registry
(CDRA) level 2 + Extension papers", IBM, 1995, SC09-2190-00
(superseeding SC09-1391-00 and SC09-1391-01), containing the
hugest list of Code Pages I have ever seen (unfortunately I miss the
enclosed CD). Although from my own research in NLS issues I can
say, it is still far from being complete. IBM does not seem to have
a more recent publication on the subject available at the moment,
but I have heard that they already have updated internal drafts, so
it seems it´s just a matter of time...

 Matthias

------------------------------------------------------------
Matthias Paul, Ubierstrasse 28, D-50321 Bruehl, Germany
<Matthias DOT Paul AT post DOT rwth-aachen DOT de> <mpaul AT drdos DOT org>
http://www.uni-bonn.de/~uzs180/mpdokeng.html
------------------------------------------------------------
My homepage has moved, please update your pointers.