From: "Juan Manuel Guerrero" <ST001906 AT HRZ1 DOT HRZ DOT TU-Darmstadt DOT De>
Organization: Darmstadt University of Technology
To: JT Williams <jeffw AT darwin DOT sfbr DOT org>, Eli Zaretskii <eliz AT is DOT elta DOT co DOT il>,
        Eli Zaretskii <eliz AT is DOT elta DOT co DOT il>, djgpp-workers AT delorie DOT com,
        salvador <salvador AT inti DOT gov DOT ar>, Eli Zaretskii <eliz AT is DOT elta DOT co DOT il>
Date: Sat, 4 Aug 2001 11:03:22 +0200
MIME-Version: 1.0
Content-type: text/plain; charset=US-ASCII
Content-transfer-encoding: 7BIT
Subject: Re: gettext port
CC: djgpp-workers AT delorie DOT com
X-mailer: Pegasus Mail for Windows (v2.54DE)
Message-ID: <34090C8714C@HRZ1.hrz.tu-darmstadt.de>
Reply-To: djgpp-workers AT delorie DOT com

I have been following the discussion about gettext and libiconv ports
and I would like to make some comments about this issue.
The initial goal of the gettext port has been to allow NLS for DJGPP ports
of GNU siftware or other software that uses the gettext functionality for NLS.
For getting NLS it is clear that the charsets that have been used to code the
.po/.mo files (usualy some unix charsets) must be recoded to the appropiate
dos codepages if they exist at all.
Starting with gettext-0.10.35 port, I have written a small shell script that
recodes the unix charset used in the .po files into the appropiate dos codepage.
For this purpose a table has been created using the information from microsoft's:
MS-DOS 6.22 COUNTRY.TXT file available as:
  ftp://ftp.microsoft.com/peropsys/msdos/kb/q117/8/50.txt
The Recode program was needed for this job. The recoding of the .po file becomes
a part of the DJGPP specific configuration process. The average user that wanted to
reconfigure the source package did not need to have Recode installed because the
recoded .po files had been distributed with the preconfigured DJGPP source package.
Starting with gettext-0.10.36 port, the .po files are no longer recoded. Instead the
.mo files are recoded at run time. This is done with the functionality provided by
libiconv.a. gettext and libiconv inspect the LANG environment variable. gettext uses
the language code (e.g.: es for spanish) to construct the path to locate the .mo file.
This means it constructs a string like "share/locale/es/LC_MESSAGE" and loads the
.mo file from there. libiconv uses the language code as an index into charset.alias.
E.g.: es_AR CP850 (the argentine spanish messages will be displayed using CP850).
The codepage used by libiconv.a is read ones and can not be changed while the program
is running. The second environment variable LANGUAGES allowes to select different
languages. E.G: LANGUAGES=es:de, this will display spanish messages or german messages
if the spanish ones are not found. This is *only* possible as long as the languages
use the *same* codepage. Something like spanish and hebrew, LANGUAGES=es_AR:he_IL,
will *not* work due to the different codepages needed to display the messages. The
LANGUAGE variable is *only* honored if the LANG variable has been set. It is only the LANG
variable that defines the gettext path and the target charset (dos codepage in this case)
to be used for displaying the message. It is important to realise that it is the *exclusive*
responsability of the user to make match the dos codepage loaded by autoexec.bat with the
codepage defined int charset.alias. E.G.: I want to display the messages in german of some
GNU program that has NLS. The charset.alias tells that CP850 will be used to display the
messages. In this case my autoexec.bat *must* contain the following lines:
C:\DOS\MODE CON CODEPAGE PREPARE=((850) C:\DOS\EGA.CPI)
C:\DOS\MODE CON CODEPAGE SELECT=850
It is completely impossible to load, let's say, CP437 in autoexec.bat and expect that spanish,
french, uk_english, german, italian and all the other CP850 languages will be displayed correctly.
The loaded codepage at boot time (in autoexec.bat) must match the definitions in charset.alias.
Please note that the definitions in charset.alias are *not* arbitrary. This definitions are
the appropiate values of the codepage for that particular language. Those definitions
are identical with the codepages that the original MSDOS install program will put in the
autoexec.bat of an user of that country that installs MSDOS on his computer.
All this only as information to understand how all this works.


IMHO it is crucial to realise that we are dealing with an operating system (MSDOS) that is simple
**DEAD**. This is very sad but true (may be I am the last one on this planet using dos in his day
time job). If I talk about an OS I mean MSDOS 6.22 or any previous version. I do *not* mean some
kind of DOS emulation in some kind of Windows like WinNT, Win2000, WinXP. MSDOS is no longer
actively developed and support by microsoft AFAIK. This implies that the NLS provided by MSDOS is
limited to the actual existing codepages. A list of the existing codepages can be found in:
ftp://ftp.microsoft.com/peropsys/msdos/kb/q117/8/50.txt. Because of the limited amount of
available codepages, the amount of needed and possible recoding from some unix charset to some
dos codepage is limited too. All this implies that only a very limited amount of languages can
be supported by DJGPP ports at all. The amount of supported languages by MSDOS is well known
and will probably not be increased by new MSDOS codepages produced and released by microsoft in
the future. OTOH we have libiconv.a. This is certainly a very powerful tool. But it is, IMVHO,
*to* powerful for the recoding job needed for getting NLS for DJGPP. As Eli pointed out, it
recodes, at runtime, from a source charset to unicode and from unicode to the target charset.
Because the amount of charsets increases, the size of libiconv.a increases too. But at the same
time the amount of existing MSDOS codepages do *not* increase anymore. All this implies that,
from a NLS specific point of view, we have *no* gain at all if we use libiconv.a. Using libiconv.a
means that we recode at runtime instead of recoding at configuration time when the permited and
available dos codepages are very well known. If we recode at configuration time we will use the
recode program. The recode sources includes the libiconv.a sources. Indeed, the recode program is
only a handy driver for the functionality offered by libiconv.a. Using recode means that we will
have the some recoding power than using libiconv.a at runtime. Eli pointed out:
> But the downside is that you need to produce a separate message
> catalogue for each possible codepage.  For example, with Cyrillic
> languages, there are half a dozen possible encodings, maybe more.
This is certainly true but the argument does not hold for NLS, IMHO. We do not need runtime
recoding at all, IMHO. We know *apriori* what msdos codepage does exist for that particular
language and country. *No* other dos codepage can be used to display that messages. Unix charset
that can not be recoded at configuartion time to an appropiate dos codepage by the recode program
via its build-in libiconv.a also can *not* be recoded at runtime to the appropiate dos codepage
by a binary with its build-in libiconv.a. A very good example are the cyrillic charsets like
KOI8-R (russia) and KOI8-U(ucraina). Because there is no appropiate dos codepage, this charsets
are mapped to itselfs in charset.alias. This means the messages are displayed using the original
unix encodings. The DJGPP port of some program will recode at runtime from KOI8-R to unicode and
from unicode back to KOI8-R and then display the result on the dos machine. What the user will
see on the screen is probabely garbadge. Of course, the user can edit charset.alias, assuming he
knows what that is and where to find it, and set some dos cyrillic codepage like CP866. This
means he must replace the lines:
ru KOI8-R
ru_RU KOI8-R
by the lines:
ru CP866
ru_RU CP866
But this will *not* solve the difficulties because CP866 is not the appropiate dos codepage.
Indeed there exists no appropiate dos codepage for KOI8-R/U, AFAIK. IMHO, the conclusion of all
this is that runtime recoding is a nice feature to quickly experiment with different charsets
and codepages, but if there is *no* appropiate codepage to recode to, then a binary with runtime
recoding is as useless as a binary without runtime recoding. If a user has an appropiate dos
codepage he will certainly never change it and never change the settings of LANG. In this case
he does not need any runtime recode functionality at all. OTOH if there is no appropiate dos
codepage available at all, IMHO, runtime recoding will not solve the situation at all.

The use of libiconv.a also introduces some DOS/DJGPP specific configuration difficulties usualy
not seen on linux. linux does not need any recoding of .po/.mo files at all, so non of the huge
amount of gettext.m4 files floating around checkes for the existance of libiconv.a Of course,
because they do not check for libiconv.a, they do also not link libiconv.a into the test binaries
that checks for gettext() and dcgettext() and some other GNU gettext funcionality. The DJGPP
specific concequence of this is that all NLS tests fail no mather if a GNU gettext port is
installed or not. This is certainly not a trivial issue. The DJGPP port a2ps413[bds].zip is a
good example that shows what happens if a users tries to configure a packages when he does not
unterstand how DJGPP's GNU gettext works. For an amount of reasons this packages is completely
broken (no ofending intention here, but it is simply the truth) I will limit the explaination to
the NLS specific issue. This is a command line of configure that checks for some gettext
functionality:

if { (eval echo configure:2934: \"$ac_link\") 1>&5; (eval $ac_link) 2>&5; } && test -s conftest${ac_exeext}; then

The important thing is the variable $ac_link. This variable contains the command to create a
test binary. This binary is linked with -lintl but not with -liconv. For DJGPP this implies
that the test always fails because no binary is created at all. The compilation of the binary
is aborted due to unresolved externals of libintl.a. Due to the amount of gettext.m4 floating
around the above line looks different in every macro. A DJGPP ported must carefully inspect
the configure script and search for that lines and write appropiate sed command into config.sed
to fix this issue. Every porter that is not aware of this fact will distribute source packages
that can not be configured for NLS, no mather if gtxt0NNb.zip and licvNNNb.zip is installed or
not. Of course, I have complained about this to Bruno Haible some months ago and he has changed
the gettext.m4 file accordingly. Netherless this does not solve the dificulty for all existing
GNU packages. They have not used this new gettext.m4. Also I have not seen a new GNU package
using this new macro, but may be I have missed something.


JT Williams wrote:
> If I add the following lines to djgpp.env
>
> +LANG=de
> +LANGUAGE=de,en
>
> then I _do_ get German text (e.g., from `sed --version'), but the
> umlauted and ess-tzett chars are incorrectly mapped on my screen
> under CP437.  These chars _are_ available in the upper half of CP437,
> however.  What I have done wrong?  Does NLS require DOS NLSFUNC?
>
> I'm using the latest sed+NLS release with stock djdev 2.03 and DOS 5.0.

As explained above, LANG=ll (ll=language code) selects the, hopefull, appropiate dos codepage
for that language code that libiconv.a will use for runtime recoding from whatever charset has
been used to create the .mo file. In this particular case you are asking for recoding the german
.mo file written with ISO-8859-1 into dos CP850. But at the same time you are using CP473 for
displaying messages to the screen. This can not be. It makes no sense to recode to a codepage
that is not used to display the messages at the same time. As explained above, you can not set
LANGUAGES to a set of different languages if this languages need different codepages to be
displayed. But this is exactely your case. You have loaded CP473 at boot time for us-english
and want to display german text that needs CP850. As you pointed out, CP473 and CP850 are identical
for the first 127 chars (ascii-code) but they differ for everything beyond the 127th char.
This is the usual way this codepages are coded, ascii is always available in the first 127 chars,
no mather if its the chinese codepage or the european codepage or the cyrillic one. The first 127
chars are always identical in *all* codepages.
OFYI: libiconv.a reads LANG only one time and it does not allow for later changes.
This means that the target charset, that is the charset that will be used to display the text on
screen, in this case, can be choosed only once. If you want to change the containts of LANG,
you must abort and restart the program. That is the only way to reread the LANG variable.


Last but not least, I have reconfigured and recompiled gettext-0.10.39 but this time without
libiconv. Of course all .po files have been recoded apriori using the recode program so runtime
recoding via build-in libiconv is not needed anymore. A comparation between xgettext.exe from
gtxt039b.zip and the newone shows:
 xgettext.exe with libiconv (from the port available at simtel):    689648 bytes 
 xgettext.exe without libiconv (newone):                             72480 bytes
On average, binaries without runtime recoding may be around 10 times smaller than binaries with
runtime recoding. The same applies to *all* other binaries of the package. If it is taken note
of the fact that usualy there is only one appropiate codepage and that almost 99% of the users
will never change this codepage (the value of LANG and the codepage selected in autoexec.bat),
the build-in runtime recoding facility seems completely superflous to me. The only reason for
runtime recoding facility seems to me to be the support of users that have no appropiate codepage.
But if there is no appropiate codepage then, like in the KOI8-R case, no mather what runtime
recoding is done, the result will never be sactifactory. In this case the user should completely
forget about NLS.


In conclusion:
1) It is not my intension to start an useless and prolongated discussion about the use
   or the discard of libiconv.a. In view of the amount of inconvenieces (amount of work
   that is needed to get working configure scripts (gettext.m4 issue) and the huge size
   of the produced binaries) and the limited benefits, there is no justification for the
   use of runtime recoding (libiconv.a) at all, IMHO. The quality of the message output
   of an NLS binary is limited by the existance of appropiate dos codepages. The actualy
   existing codepages are defined in charset.alias. This table is identical to the table
   I have used in gtxt035b.zip to recode .po files at configuration time using the recode
   program. It is realy quite simple: .po files that can not be recoded during configuration
   also can not be recoded at runtime. If this is the case, runtime recoding seems superflous.

2) I am certainly not objecting the intention of introducing some sort of DJGPP specific
   DLL functionality into libiconv.a. Netherless I assume that it will become very hard
   to convince Bruno Haible to adopt that very MSDOS/DJGPP specific code into his POSIX
   centric code.
   The most important point is IMO, that no DLL library can ever be so small as an library
   that is not linked into the binary at all.

Anyway, whatever the mayority of the audience here decides will be ok with me.

Regards,
Guerrero, Juan Manuel