delorie.com/archives/browse.cgi   search  
Mail Archives: djgpp-workers/2001/02/24/14:55:24

From: "Juan Manuel Guerrero" <ST001906 AT HRZ1 DOT HRZ DOT TU-Darmstadt DOT De>
Organization: Darmstadt University of Technology
To: Eli Zaretskii <eliz AT is DOT elta DOT co DOT il>
Date: Sat, 24 Feb 2001 19:07:40 +0200
MIME-Version: 1.0
Subject: Re: gettext pretest available
CC: Bruno Haible <haible AT ilog DOT fr>, djgpp-workers AT delorie DOT com
X-mailer: Pegasus Mail for Windows (v2.54DE)
Message-ID: <2E9C0C3501E@HRZ1.hrz.tu-darmstadt.de>
Reply-To: djgpp-workers AT delorie DOT com

On Fri, 23 Feb 2001 09:44:52 +0200, Eli Zaretskii:

> > Of course, you are right. All the pertinent DJGPP libc functions
> > work as you have described (they recognize CRLF *and* LF as '\n' if
> > the file has been fopen()'ed in text mode) makeing the code I have
> > added redundant and superfluous.
>
> One caveat: can the files that are read as text have unprintable
> characters, such as lone CRs or ^Z?  If they can, text mode is not
> reliable enough to be used with such files.

This was the reason why I had opened all files in binary mode.
This was the way it was done in the DJGPP port of gettext 0.10.35.
But IMHO we can drop this. The only text files we will deal with
are the .po files. This files are usualy created with two types
of charsets: non-asiatic single byte charsets like iso-8859-xx,
cpxxx and koi8-r/u and the asiatic ones double byte charsets like
big5, euc-{cn,jp,kr}, JIS-X-0208, shift-JIS, CP9XX (May be I have forgotten someone).
There should be no difficulty with the iso/cpxxx/koi8 written .po files.
IMHO, if iso/cpxxx/koi8 written .po files contain *lone* CRs or ^Z then
they are broken. The asiatic charsets are usually double byte charset.
The question arises if ASCII(0x00) to ASCII(0x20) is used in the charset
or not. Usualy asiatic characters are coded using two or more bytes.
This byte paires are usually organized into a 94 x 94 matrix. This matrix
is placed starting at ASCII(0xA0) sometimes. Sometimes it is placed starting 
at ASCII(0x21); this is 7-bit ISO-2022 AKA shift-JIS. Some of the charsets
use ESC, some others use tilde (~), some others use shift in (SI) and shift out (SO)
as control character to select different "character planes". *No* character
set uses CR, LF or Cntl-Z in any combination, neither for character encoding
nor as control sequence.
In conclusion: as long as *only* the above described single byte and double byte
charsets are used, we can savely open .po files in text mode.

BTW, if someone is really interested in CJK encoding, look at:
  <http://www.ora.com/people/authors/lunde/cjk_inf.html>
and download CJK.INF. Read:
  PART 3: CJK ENCODING SYSTEMS

Regards,
Guerrero, Juan Manuel

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019