Date: Sat, 04 Aug 2001 14:35:00 +0300 From: "Eli Zaretskii" Sender: halo1 AT zahav DOT net DOT il To: ST001906 AT HRZ1 DOT HRZ DOT TU-Darmstadt DOT De Message-Id: <3405-Sat04Aug2001143459+0300-eliz@is.elta.co.il> X-Mailer: Emacs 20.6 (via feedmail 8.3.emacs20_6 I) and Blat ver 1.8.9 CC: jeffw AT darwin DOT sfbr DOT org, djgpp-workers AT delorie DOT com, salvador AT inti DOT gov DOT ar In-reply-to: <34090C8714C@HRZ1.hrz.tu-darmstadt.de> (ST001906 AT HRZ1 DOT HRZ DOT TU-Darmstadt DOT De) Subject: Re: gettext port References: <34090C8714C AT HRZ1 DOT hrz DOT tu-darmstadt DOT de> Reply-To: djgpp-workers AT delorie DOT com Errors-To: nobody AT delorie DOT com X-Mailing-List: djgpp-workers AT delorie DOT com X-Unsubscribes-To: listserv AT delorie DOT com Precedence: bulk > From: "Juan Manuel Guerrero" > Date: Sat, 4 Aug 2001 11:03:22 +0200 > > to be used for displaying the message. It is important to realise > that it is the *exclusive* responsability of the user to make match > the dos codepage loaded by autoexec.bat with the codepage defined > int charset.alias. E.G.: I want to display the messages in german of > some GNU program that has NLS. The charset.alias tells that CP850 > will be used to display the messages. In this case my autoexec.bat > *must* contain the following lines: > C:\DOS\MODE CON CODEPAGE PREPARE=((850) C:\DOS\EGA.CPI) > C:\DOS\MODE CON CODEPAGE SELECT=850 > It is completely impossible to load, let's say, CP437 in > autoexec.bat and expect that spanish, french, uk_english, german, > italian and all the other CP850 languages will be displayed > correctly. Why is that? Doesn't libiconv support incomplete mappings? IIRC, recode does make it possible to force recoding even if the target codepage didn't cover the original charset completely. > The loaded codepage at boot time (in autoexec.bat) must > match the definitions in charset.alias. Isn't it possible to override the definitions in charset.alias with the appropriate setting of LANG or LANGUAGE? That is, can't I determine the codepage by an appropriate setting of one of these two variables, instead of editing charset.alias? > The amount of supported languages by MSDOS is well known and will > probably not be increased by new MSDOS codepages produced and > released by microsoft in the future. Microsoft will probably not produce new codepages for DOS, but we have to keep in mind that: - you don't need Microsoft to produce a new codepage file; - some OEM versions of Windows have codepages for DOS sessions which differ from codepages used in plain DOS versions for the same locales; in other words, a small number of new codepages _is_ being produced in some cases. > OTOH we have libiconv.a. This > is certainly a very powerful tool. But it is, IMVHO, *to* powerful > for the recoding job needed for getting NLS for DJGPP. As Eli > pointed out, it recodes, at runtime, from a source charset to > unicode and from unicode to the target charset. Because the amount > of charsets increases, the size of libiconv.a increases too. But at > the same time the amount of existing MSDOS codepages do *not* > increase anymore. All this implies that, from a NLS specific point > of view, we have *no* gain at all if we use libiconv.a. Again, not entirely accurate: since each recoding of a message catalog involves *two* transformations, using libiconv does free us from the need to worry about changes in the original encoding used by the .po file. For example, we don't care whether the European *.po files are encoded in ISO-8859-1, ISO-8859-9, or even UTF-8. > Eli > pointed out: > > > But the downside is that you need to produce a separate message > > catalogue for each possible codepage. For example, with Cyrillic > > languages, there are half a dozen possible encodings, maybe more. > > This is certainly true but the argument does > not hold for NLS, IMHO. We do not need runtime recoding at all, > IMHO. We know *apriori* what msdos codepage does exist for that > particular language and country. I don't think this is true. Just to take one example, Laurynas told me some time ago that there's a whole lot of different codepages used in the Baltic Rim countries, some of them have support for Cyrillic characters, others only for Baltic native character sets. In this situation, there's no way a person who ports a package can ever know what codepage will be installed on the user's machine. As another pertinent example, perhaps you saw in the news that Azerbaijan decided just two days ago to switch from Cyrillic alphabet to a Latin one. So the codepage installed by the user in a given locale is not entirely deterministically predictable. > Unix charset that can not be recoded at > configuartion time to an appropiate dos codepage by the recode > program via its build-in libiconv.a also can *not* be recoded at > runtime to the appropiate dos codepage by a binary with its build-in > libiconv.a. A very good example are the cyrillic charsets like > KOI8-R (russia) and KOI8-U(ucraina). Because there is no appropiate > dos codepage, this charsets are mapped to itselfs in > charset.alias. You mean, it's not possible to recode ru.po into cp866? I'd be surprised: cp866 includes all characters used in Russian, even though KOI8-R has some additional characters. I think this non-support for Russian and Ukrainian catalogs is a serious misfeature. If this cannot be solved with libiconv's on-the-fly recoding, maybe we should ask for such a feature to be added (e.g., it could simply skip characters it cannot recode). Or maybe we should ask the Translation Project to change the guidelines for the encodings they accept for Russian and Ukrainian translations, so that libiconv would not refuse to convert them into the appropiate codepage. > A comparation between xgettext.exe from gtxt039b.zip and the newone shows: > xgettext.exe with libiconv (from the port available at simtel): 689648 bytes > xgettext.exe without libiconv (newone): 72480 bytes > On average, binaries without runtime recoding may be around 10 times > smaller than binaries with runtime recoding. A more accurate statement would probably be ``smaller by 600KB'', since the overhead is additive, not multiplicative.