Mail Archives: djgpp-workers/2001/02/25/01:13:28
Handa-san, could you please tell if the encodings used for *.po files can
have lone CR characters or ^Z characters, as per the discussion below? I
remember you said something about that (in the context of Ediff's use of
"diff --binary"), but I forget what was the bottom line.
This discussion is about a port of GNU gettext and libiconv to MS-DOS and
MS-Windows.
Thanks in advance.
On Sat, 24 Feb 2001, Juan Manuel Guerrero wrote:
> On Fri, 23 Feb 2001 09:44:52 +0200, Eli Zaretskii:
>
> > > Of course, you are right. All the pertinent DJGPP libc functions
> > > work as you have described (they recognize CRLF *and* LF as '\n' if
> > > the file has been fopen()'ed in text mode) makeing the code I have
> > > added redundant and superfluous.
> >
> > One caveat: can the files that are read as text have unprintable
> > characters, such as lone CRs or ^Z? If they can, text mode is not
> > reliable enough to be used with such files.
>
> This was the reason why I had opened all files in binary mode.
> This was the way it was done in the DJGPP port of gettext 0.10.35.
> But IMHO we can drop this. The only text files we will deal with
> are the .po files. This files are usualy created with two types
> of charsets: non-asiatic single byte charsets like iso-8859-xx,
> cpxxx and koi8-r/u and the asiatic ones double byte charsets like
> big5, euc-{cn,jp,kr}, JIS-X-0208, shift-JIS, CP9XX (May be I have forgotten someone).
> There should be no difficulty with the iso/cpxxx/koi8 written .po files.
> IMHO, if iso/cpxxx/koi8 written .po files contain *lone* CRs or ^Z then
> they are broken. The asiatic charsets are usually double byte charset.
> The question arises if ASCII(0x00) to ASCII(0x20) is used in the charset
> or not. Usualy asiatic characters are coded using two or more bytes.
> This byte paires are usually organized into a 94 x 94 matrix. This matrix
> is placed starting at ASCII(0xA0) sometimes. Sometimes it is placed starting
> at ASCII(0x21); this is 7-bit ISO-2022 AKA shift-JIS. Some of the charsets
> use ESC, some others use tilde (~), some others use shift in (SI) and shift out (SO)
> as control character to select different "character planes". *No* character
> set uses CR, LF or Cntl-Z in any combination, neither for character encoding
> nor as control sequence.
> In conclusion: as long as *only* the above described single byte and double byte
> charsets are used, we can savely open .po files in text mode.
>
> BTW, if someone is really interested in CJK encoding, look at:
> <http://www.ora.com/people/authors/lunde/cjk_inf.html>
> and download CJK.INF. Read:
> PART 3: CJK ENCODING SYSTEMS
>
> Regards,
> Guerrero, Juan Manuel
>
- Raw text -