Date: Sun, 25 Feb 2001 08:06:51 +0200 (IST)
From: Eli Zaretskii <eliz AT is DOT elta DOT co DOT il>
X-Sender: eliz AT is
To: Kenichi Handa <handa AT etl DOT go DOT jp>
cc: Bruno Haible <haible AT ilog DOT fr>,
        Juan Manuel Guerrero <ST001906 AT HRZ1 DOT HRZ DOT TU-Darmstadt DOT De>,
        djgpp-workers AT delorie DOT com
Subject: Re: gettext pretest available
In-Reply-To: <2E9C0C3501E@HRZ1.hrz.tu-darmstadt.de>
Message-ID: <Pine.SUN.3.91.1010225080326.10629B-100000@is>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Reply-To: djgpp-workers AT delorie DOT com
Errors-To: nobody AT delorie DOT com
X-Mailing-List: djgpp-workers AT delorie DOT com
X-Unsubscribes-To: listserv AT delorie DOT com
Precedence: bulk

Handa-san, could you please tell if the encodings used for *.po files can 
have lone CR characters or ^Z characters, as per the discussion below?  I 
remember you said something about that (in the context of Ediff's use of 
"diff --binary"), but I forget what was the bottom line.

This discussion is about a port of GNU gettext and libiconv to MS-DOS and 
MS-Windows.

Thanks in advance.

On Sat, 24 Feb 2001, Juan Manuel Guerrero wrote:

> On Fri, 23 Feb 2001 09:44:52 +0200, Eli Zaretskii:
> 
> > > Of course, you are right. All the pertinent DJGPP libc functions
> > > work as you have described (they recognize CRLF *and* LF as '\n' if
> > > the file has been fopen()'ed in text mode) makeing the code I have
> > > added redundant and superfluous.
> >
> > One caveat: can the files that are read as text have unprintable
> > characters, such as lone CRs or ^Z?  If they can, text mode is not
> > reliable enough to be used with such files.
> 
> This was the reason why I had opened all files in binary mode.
> This was the way it was done in the DJGPP port of gettext 0.10.35.
> But IMHO we can drop this. The only text files we will deal with
> are the .po files. This files are usualy created with two types
> of charsets: non-asiatic single byte charsets like iso-8859-xx,
> cpxxx and koi8-r/u and the asiatic ones double byte charsets like
> big5, euc-{cn,jp,kr}, JIS-X-0208, shift-JIS, CP9XX (May be I have forgotten someone).
> There should be no difficulty with the iso/cpxxx/koi8 written .po files.
> IMHO, if iso/cpxxx/koi8 written .po files contain *lone* CRs or ^Z then
> they are broken. The asiatic charsets are usually double byte charset.
> The question arises if ASCII(0x00) to ASCII(0x20) is used in the charset
> or not. Usualy asiatic characters are coded using two or more bytes.
> This byte paires are usually organized into a 94 x 94 matrix. This matrix
> is placed starting at ASCII(0xA0) sometimes. Sometimes it is placed starting 
> at ASCII(0x21); this is 7-bit ISO-2022 AKA shift-JIS. Some of the charsets
> use ESC, some others use tilde (~), some others use shift in (SI) and shift out (SO)
> as control character to select different "character planes". *No* character
> set uses CR, LF or Cntl-Z in any combination, neither for character encoding
> nor as control sequence.
> In conclusion: as long as *only* the above described single byte and double byte
> charsets are used, we can savely open .po files in text mode.
> 
> BTW, if someone is really interested in CJK encoding, look at:
>   <http://www.ora.com/people/authors/lunde/cjk_inf.html>
> and download CJK.INF. Read:
>   PART 3: CJK ENCODING SYSTEMS
> 
> Regards,
> Guerrero, Juan Manuel
>