Date: Sun, 25 Feb 2001 08:06:51 +0200 (IST) From: Eli Zaretskii X-Sender: eliz AT is To: Kenichi Handa cc: Bruno Haible , Juan Manuel Guerrero , djgpp-workers AT delorie DOT com Subject: Re: gettext pretest available In-Reply-To: <2E9C0C3501E@HRZ1.hrz.tu-darmstadt.de> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Reply-To: djgpp-workers AT delorie DOT com Errors-To: nobody AT delorie DOT com X-Mailing-List: djgpp-workers AT delorie DOT com X-Unsubscribes-To: listserv AT delorie DOT com Precedence: bulk Handa-san, could you please tell if the encodings used for *.po files can have lone CR characters or ^Z characters, as per the discussion below? I remember you said something about that (in the context of Ediff's use of "diff --binary"), but I forget what was the bottom line. This discussion is about a port of GNU gettext and libiconv to MS-DOS and MS-Windows. Thanks in advance. On Sat, 24 Feb 2001, Juan Manuel Guerrero wrote: > On Fri, 23 Feb 2001 09:44:52 +0200, Eli Zaretskii: > > > > Of course, you are right. All the pertinent DJGPP libc functions > > > work as you have described (they recognize CRLF *and* LF as '\n' if > > > the file has been fopen()'ed in text mode) makeing the code I have > > > added redundant and superfluous. > > > > One caveat: can the files that are read as text have unprintable > > characters, such as lone CRs or ^Z? If they can, text mode is not > > reliable enough to be used with such files. > > This was the reason why I had opened all files in binary mode. > This was the way it was done in the DJGPP port of gettext 0.10.35. > But IMHO we can drop this. The only text files we will deal with > are the .po files. This files are usualy created with two types > of charsets: non-asiatic single byte charsets like iso-8859-xx, > cpxxx and koi8-r/u and the asiatic ones double byte charsets like > big5, euc-{cn,jp,kr}, JIS-X-0208, shift-JIS, CP9XX (May be I have forgotten someone). > There should be no difficulty with the iso/cpxxx/koi8 written .po files. > IMHO, if iso/cpxxx/koi8 written .po files contain *lone* CRs or ^Z then > they are broken. The asiatic charsets are usually double byte charset. > The question arises if ASCII(0x00) to ASCII(0x20) is used in the charset > or not. Usualy asiatic characters are coded using two or more bytes. > This byte paires are usually organized into a 94 x 94 matrix. This matrix > is placed starting at ASCII(0xA0) sometimes. Sometimes it is placed starting > at ASCII(0x21); this is 7-bit ISO-2022 AKA shift-JIS. Some of the charsets > use ESC, some others use tilde (~), some others use shift in (SI) and shift out (SO) > as control character to select different "character planes". *No* character > set uses CR, LF or Cntl-Z in any combination, neither for character encoding > nor as control sequence. > In conclusion: as long as *only* the above described single byte and double byte > charsets are used, we can savely open .po files in text mode. > > BTW, if someone is really interested in CJK encoding, look at: > > and download CJK.INF. Read: > PART 3: CJK ENCODING SYSTEMS > > Regards, > Guerrero, Juan Manuel >