delorie.com/archives/browse.cgi   search  
Mail Archives: djgpp-workers/2001/08/04/07:36:32

Date: Sat, 04 Aug 2001 14:35:00 +0300
From: "Eli Zaretskii" <eliz AT is DOT elta DOT co DOT il>
Sender: halo1 AT zahav DOT net DOT il
To: ST001906 AT HRZ1 DOT HRZ DOT TU-Darmstadt DOT De
Message-Id: <3405-Sat04Aug2001143459+0300-eliz@is.elta.co.il>
X-Mailer: Emacs 20.6 (via feedmail 8.3.emacs20_6 I) and Blat ver 1.8.9
CC: jeffw AT darwin DOT sfbr DOT org, djgpp-workers AT delorie DOT com, salvador AT inti DOT gov DOT ar
In-reply-to: <34090C8714C@HRZ1.hrz.tu-darmstadt.de>
(ST001906 AT HRZ1 DOT HRZ DOT TU-Darmstadt DOT De)
Subject: Re: gettext port
References: <34090C8714C AT HRZ1 DOT hrz DOT tu-darmstadt DOT de>
Reply-To: djgpp-workers AT delorie DOT com
Errors-To: nobody AT delorie DOT com
X-Mailing-List: djgpp-workers AT delorie DOT com
X-Unsubscribes-To: listserv AT delorie DOT com

> From: "Juan Manuel Guerrero" <ST001906 AT HRZ1 DOT HRZ DOT TU-Darmstadt DOT De>
> Date: Sat, 4 Aug 2001 11:03:22 +0200
>
> to be used for displaying the message. It is important to realise
> that it is the *exclusive* responsability of the user to make match
> the dos codepage loaded by autoexec.bat with the codepage defined
> int charset.alias. E.G.: I want to display the messages in german of
> some GNU program that has NLS. The charset.alias tells that CP850
> will be used to display the messages. In this case my autoexec.bat
> *must* contain the following lines: 
> C:\DOS\MODE CON CODEPAGE PREPARE=((850) C:\DOS\EGA.CPI)
> C:\DOS\MODE CON CODEPAGE SELECT=850
> It is completely impossible to load, let's say, CP437 in
> autoexec.bat and expect that spanish, french, uk_english, german,
> italian and all the other CP850 languages will be displayed
> correctly.

Why is that?  Doesn't libiconv support incomplete mappings?  IIRC,
recode does make it possible to force recoding even if the target
codepage didn't cover the original charset completely.

> The loaded codepage at boot time (in autoexec.bat) must
> match the definitions in charset.alias.

Isn't it possible to override the definitions in charset.alias with
the appropriate setting of LANG or LANGUAGE?  That is, can't I
determine the codepage by an appropriate setting of one of these two
variables, instead of editing charset.alias?

> The amount of supported languages by MSDOS is well known and will
> probably not be increased by new MSDOS codepages produced and
> released by microsoft in the future.

Microsoft will probably not produce new codepages for DOS, but we have
to keep in mind that:

  - you don't need Microsoft to produce a new codepage file;

  - some OEM versions of Windows have codepages for DOS sessions which
    differ from codepages used in plain DOS versions for the same
    locales; in other words, a small number of new codepages _is_
    being produced in some cases.

> OTOH we have libiconv.a. This
> is certainly a very powerful tool. But it is, IMVHO, *to* powerful
> for the recoding job needed for getting NLS for DJGPP. As Eli
> pointed out, it recodes, at runtime, from a source charset to
> unicode and from unicode to the target charset.  Because the amount
> of charsets increases, the size of libiconv.a increases too. But at
> the same time the amount of existing MSDOS codepages do *not*
> increase anymore. All this implies that, from a NLS specific point
> of view, we have *no* gain at all if we use libiconv.a.

Again, not entirely accurate: since each recoding of a message
catalog involves *two* transformations, using libiconv does free us
from the need to worry about changes in the original encoding used by
the .po file.  For example, we don't care whether the European *.po
files are encoded in ISO-8859-1, ISO-8859-9, or even UTF-8.

> Eli
> pointed out: 
>
> > But the downside is that you need to produce a separate message 
> > catalogue for each possible codepage.  For example, with Cyrillic
> > languages, there are half a dozen possible encodings, maybe more.
>
> This is certainly true but the argument does
> not hold for NLS, IMHO. We do not need runtime recoding at all,
> IMHO. We know *apriori* what msdos codepage does exist for that
> particular language and country.

I don't think this is true.  Just to take one example, Laurynas told
me some time ago that there's a whole lot of different codepages used
in the Baltic Rim countries, some of them have support for Cyrillic
characters, others only for Baltic native character sets.  In this
situation, there's no way a person who ports a package can ever know
what codepage will be installed on the user's machine.

As another pertinent example, perhaps you saw in the news that
Azerbaijan decided just two days ago to switch from Cyrillic alphabet
to a Latin one.

So the codepage installed by the user in a given locale is not
entirely deterministically predictable.

> Unix charset that can not be recoded at
> configuartion time to an appropiate dos codepage by the recode
> program via its build-in libiconv.a also can *not* be recoded at
> runtime to the appropiate dos codepage by a binary with its build-in
> libiconv.a. A very good example are the cyrillic charsets like
> KOI8-R (russia) and KOI8-U(ucraina). Because there is no appropiate
> dos codepage, this charsets are mapped to itselfs in
> charset.alias.

You mean, it's not possible to recode ru.po into cp866?  I'd be
surprised: cp866 includes all characters used in Russian, even though
KOI8-R has some additional characters.

I think this non-support for Russian and Ukrainian catalogs is a
serious misfeature.  If this cannot be solved with libiconv's
on-the-fly recoding, maybe we should ask for such a feature to be
added (e.g., it could simply skip characters it cannot recode).  Or
maybe we should ask the Translation Project to change the guidelines
for the encodings they accept for Russian and Ukrainian translations,
so that libiconv would not refuse to convert them into the appropiate
codepage.

> A comparation between xgettext.exe from gtxt039b.zip and the newone shows:
> xgettext.exe with libiconv (from the port available at simtel): 689648 bytes 
> xgettext.exe without libiconv (newone):                          72480 bytes
> On average, binaries without runtime recoding may be around 10 times
> smaller than binaries with runtime recoding.

A more accurate statement would probably be ``smaller by 600KB'',
since the overhead is additive, not multiplicative.

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019