From: "Juan Manuel Guerrero" Organization: Darmstadt University of Technology To: JT Williams , Eli Zaretskii , Eli Zaretskii , djgpp-workers AT delorie DOT com, salvador , Eli Zaretskii Date: Sat, 4 Aug 2001 11:03:22 +0200 MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII Content-transfer-encoding: 7BIT Subject: Re: gettext port CC: djgpp-workers AT delorie DOT com X-mailer: Pegasus Mail for Windows (v2.54DE) Message-ID: <34090C8714C@HRZ1.hrz.tu-darmstadt.de> Reply-To: djgpp-workers AT delorie DOT com I have been following the discussion about gettext and libiconv ports and I would like to make some comments about this issue. The initial goal of the gettext port has been to allow NLS for DJGPP ports of GNU siftware or other software that uses the gettext functionality for NLS. For getting NLS it is clear that the charsets that have been used to code the .po/.mo files (usualy some unix charsets) must be recoded to the appropiate dos codepages if they exist at all. Starting with gettext-0.10.35 port, I have written a small shell script that recodes the unix charset used in the .po files into the appropiate dos codepage. For this purpose a table has been created using the information from microsoft's: MS-DOS 6.22 COUNTRY.TXT file available as: ftp://ftp.microsoft.com/peropsys/msdos/kb/q117/8/50.txt The Recode program was needed for this job. The recoding of the .po file becomes a part of the DJGPP specific configuration process. The average user that wanted to reconfigure the source package did not need to have Recode installed because the recoded .po files had been distributed with the preconfigured DJGPP source package. Starting with gettext-0.10.36 port, the .po files are no longer recoded. Instead the .mo files are recoded at run time. This is done with the functionality provided by libiconv.a. gettext and libiconv inspect the LANG environment variable. gettext uses the language code (e.g.: es for spanish) to construct the path to locate the .mo file. This means it constructs a string like "share/locale/es/LC_MESSAGE" and loads the .mo file from there. libiconv uses the language code as an index into charset.alias. E.g.: es_AR CP850 (the argentine spanish messages will be displayed using CP850). The codepage used by libiconv.a is read ones and can not be changed while the program is running. The second environment variable LANGUAGES allowes to select different languages. E.G: LANGUAGES=es:de, this will display spanish messages or german messages if the spanish ones are not found. This is *only* possible as long as the languages use the *same* codepage. Something like spanish and hebrew, LANGUAGES=es_AR:he_IL, will *not* work due to the different codepages needed to display the messages. The LANGUAGE variable is *only* honored if the LANG variable has been set. It is only the LANG variable that defines the gettext path and the target charset (dos codepage in this case) to be used for displaying the message. It is important to realise that it is the *exclusive* responsability of the user to make match the dos codepage loaded by autoexec.bat with the codepage defined int charset.alias. E.G.: I want to display the messages in german of some GNU program that has NLS. The charset.alias tells that CP850 will be used to display the messages. In this case my autoexec.bat *must* contain the following lines: C:\DOS\MODE CON CODEPAGE PREPARE=((850) C:\DOS\EGA.CPI) C:\DOS\MODE CON CODEPAGE SELECT=850 It is completely impossible to load, let's say, CP437 in autoexec.bat and expect that spanish, french, uk_english, german, italian and all the other CP850 languages will be displayed correctly. The loaded codepage at boot time (in autoexec.bat) must match the definitions in charset.alias. Please note that the definitions in charset.alias are *not* arbitrary. This definitions are the appropiate values of the codepage for that particular language. Those definitions are identical with the codepages that the original MSDOS install program will put in the autoexec.bat of an user of that country that installs MSDOS on his computer. All this only as information to understand how all this works. IMHO it is crucial to realise that we are dealing with an operating system (MSDOS) that is simple **DEAD**. This is very sad but true (may be I am the last one on this planet using dos in his day time job). If I talk about an OS I mean MSDOS 6.22 or any previous version. I do *not* mean some kind of DOS emulation in some kind of Windows like WinNT, Win2000, WinXP. MSDOS is no longer actively developed and support by microsoft AFAIK. This implies that the NLS provided by MSDOS is limited to the actual existing codepages. A list of the existing codepages can be found in: ftp://ftp.microsoft.com/peropsys/msdos/kb/q117/8/50.txt. Because of the limited amount of available codepages, the amount of needed and possible recoding from some unix charset to some dos codepage is limited too. All this implies that only a very limited amount of languages can be supported by DJGPP ports at all. The amount of supported languages by MSDOS is well known and will probably not be increased by new MSDOS codepages produced and released by microsoft in the future. OTOH we have libiconv.a. This is certainly a very powerful tool. But it is, IMVHO, *to* powerful for the recoding job needed for getting NLS for DJGPP. As Eli pointed out, it recodes, at runtime, from a source charset to unicode and from unicode to the target charset. Because the amount of charsets increases, the size of libiconv.a increases too. But at the same time the amount of existing MSDOS codepages do *not* increase anymore. All this implies that, from a NLS specific point of view, we have *no* gain at all if we use libiconv.a. Using libiconv.a means that we recode at runtime instead of recoding at configuration time when the permited and available dos codepages are very well known. If we recode at configuration time we will use the recode program. The recode sources includes the libiconv.a sources. Indeed, the recode program is only a handy driver for the functionality offered by libiconv.a. Using recode means that we will have the some recoding power than using libiconv.a at runtime. Eli pointed out: > But the downside is that you need to produce a separate message > catalogue for each possible codepage. For example, with Cyrillic > languages, there are half a dozen possible encodings, maybe more. This is certainly true but the argument does not hold for NLS, IMHO. We do not need runtime recoding at all, IMHO. We know *apriori* what msdos codepage does exist for that particular language and country. *No* other dos codepage can be used to display that messages. Unix charset that can not be recoded at configuartion time to an appropiate dos codepage by the recode program via its build-in libiconv.a also can *not* be recoded at runtime to the appropiate dos codepage by a binary with its build-in libiconv.a. A very good example are the cyrillic charsets like KOI8-R (russia) and KOI8-U(ucraina). Because there is no appropiate dos codepage, this charsets are mapped to itselfs in charset.alias. This means the messages are displayed using the original unix encodings. The DJGPP port of some program will recode at runtime from KOI8-R to unicode and from unicode back to KOI8-R and then display the result on the dos machine. What the user will see on the screen is probabely garbadge. Of course, the user can edit charset.alias, assuming he knows what that is and where to find it, and set some dos cyrillic codepage like CP866. This means he must replace the lines: ru KOI8-R ru_RU KOI8-R by the lines: ru CP866 ru_RU CP866 But this will *not* solve the difficulties because CP866 is not the appropiate dos codepage. Indeed there exists no appropiate dos codepage for KOI8-R/U, AFAIK. IMHO, the conclusion of all this is that runtime recoding is a nice feature to quickly experiment with different charsets and codepages, but if there is *no* appropiate codepage to recode to, then a binary with runtime recoding is as useless as a binary without runtime recoding. If a user has an appropiate dos codepage he will certainly never change it and never change the settings of LANG. In this case he does not need any runtime recode functionality at all. OTOH if there is no appropiate dos codepage available at all, IMHO, runtime recoding will not solve the situation at all. The use of libiconv.a also introduces some DOS/DJGPP specific configuration difficulties usualy not seen on linux. linux does not need any recoding of .po/.mo files at all, so non of the huge amount of gettext.m4 files floating around checkes for the existance of libiconv.a Of course, because they do not check for libiconv.a, they do also not link libiconv.a into the test binaries that checks for gettext() and dcgettext() and some other GNU gettext funcionality. The DJGPP specific concequence of this is that all NLS tests fail no mather if a GNU gettext port is installed or not. This is certainly not a trivial issue. The DJGPP port a2ps413[bds].zip is a good example that shows what happens if a users tries to configure a packages when he does not unterstand how DJGPP's GNU gettext works. For an amount of reasons this packages is completely broken (no ofending intention here, but it is simply the truth) I will limit the explaination to the NLS specific issue. This is a command line of configure that checks for some gettext functionality: if { (eval echo configure:2934: \"$ac_link\") 1>&5; (eval $ac_link) 2>&5; } && test -s conftest${ac_exeext}; then The important thing is the variable $ac_link. This variable contains the command to create a test binary. This binary is linked with -lintl but not with -liconv. For DJGPP this implies that the test always fails because no binary is created at all. The compilation of the binary is aborted due to unresolved externals of libintl.a. Due to the amount of gettext.m4 floating around the above line looks different in every macro. A DJGPP ported must carefully inspect the configure script and search for that lines and write appropiate sed command into config.sed to fix this issue. Every porter that is not aware of this fact will distribute source packages that can not be configured for NLS, no mather if gtxt0NNb.zip and licvNNNb.zip is installed or not. Of course, I have complained about this to Bruno Haible some months ago and he has changed the gettext.m4 file accordingly. Netherless this does not solve the dificulty for all existing GNU packages. They have not used this new gettext.m4. Also I have not seen a new GNU package using this new macro, but may be I have missed something. JT Williams wrote: > If I add the following lines to djgpp.env > > +LANG=de > +LANGUAGE=de,en > > then I _do_ get German text (e.g., from `sed --version'), but the > umlauted and ess-tzett chars are incorrectly mapped on my screen > under CP437. These chars _are_ available in the upper half of CP437, > however. What I have done wrong? Does NLS require DOS NLSFUNC? > > I'm using the latest sed+NLS release with stock djdev 2.03 and DOS 5.0. As explained above, LANG=ll (ll=language code) selects the, hopefull, appropiate dos codepage for that language code that libiconv.a will use for runtime recoding from whatever charset has been used to create the .mo file. In this particular case you are asking for recoding the german .mo file written with ISO-8859-1 into dos CP850. But at the same time you are using CP473 for displaying messages to the screen. This can not be. It makes no sense to recode to a codepage that is not used to display the messages at the same time. As explained above, you can not set LANGUAGES to a set of different languages if this languages need different codepages to be displayed. But this is exactely your case. You have loaded CP473 at boot time for us-english and want to display german text that needs CP850. As you pointed out, CP473 and CP850 are identical for the first 127 chars (ascii-code) but they differ for everything beyond the 127th char. This is the usual way this codepages are coded, ascii is always available in the first 127 chars, no mather if its the chinese codepage or the european codepage or the cyrillic one. The first 127 chars are always identical in *all* codepages. OFYI: libiconv.a reads LANG only one time and it does not allow for later changes. This means that the target charset, that is the charset that will be used to display the text on screen, in this case, can be choosed only once. If you want to change the containts of LANG, you must abort and restart the program. That is the only way to reread the LANG variable. Last but not least, I have reconfigured and recompiled gettext-0.10.39 but this time without libiconv. Of course all .po files have been recoded apriori using the recode program so runtime recoding via build-in libiconv is not needed anymore. A comparation between xgettext.exe from gtxt039b.zip and the newone shows: xgettext.exe with libiconv (from the port available at simtel): 689648 bytes xgettext.exe without libiconv (newone): 72480 bytes On average, binaries without runtime recoding may be around 10 times smaller than binaries with runtime recoding. The same applies to *all* other binaries of the package. If it is taken note of the fact that usualy there is only one appropiate codepage and that almost 99% of the users will never change this codepage (the value of LANG and the codepage selected in autoexec.bat), the build-in runtime recoding facility seems completely superflous to me. The only reason for runtime recoding facility seems to me to be the support of users that have no appropiate codepage. But if there is no appropiate codepage then, like in the KOI8-R case, no mather what runtime recoding is done, the result will never be sactifactory. In this case the user should completely forget about NLS. In conclusion: 1) It is not my intension to start an useless and prolongated discussion about the use or the discard of libiconv.a. In view of the amount of inconvenieces (amount of work that is needed to get working configure scripts (gettext.m4 issue) and the huge size of the produced binaries) and the limited benefits, there is no justification for the use of runtime recoding (libiconv.a) at all, IMHO. The quality of the message output of an NLS binary is limited by the existance of appropiate dos codepages. The actualy existing codepages are defined in charset.alias. This table is identical to the table I have used in gtxt035b.zip to recode .po files at configuration time using the recode program. It is realy quite simple: .po files that can not be recoded during configuration also can not be recoded at runtime. If this is the case, runtime recoding seems superflous. 2) I am certainly not objecting the intention of introducing some sort of DJGPP specific DLL functionality into libiconv.a. Netherless I assume that it will become very hard to convince Bruno Haible to adopt that very MSDOS/DJGPP specific code into his POSIX centric code. The most important point is IMO, that no DLL library can ever be so small as an library that is not linked into the binary at all. Anyway, whatever the mayority of the audience here decides will be ok with me. Regards, Guerrero, Juan Manuel