Mail Archives: djgpp/2002/02/20/12:46:32.1
from my earlier post >> and Eli Zaretskii > :
> > Suppose two or more email messages are concatenated in one file? Suddenly some
> > parts are in Latin-1 and others not, and there may even be some Korean and
> > Chinese spam mixed in.
> That's what I call a garbled file. In such a file, when you see a byte
> with a code of, say, 161 decimal--how do you interpret it? 161 means one
> thing in Latin-1, but something different in cp437, and something else in
> cp850. Unless the file has some meta-information in it, saying how
> which part is encoded, there is no way you could display the text
> correctly.
Still, a few big files look preferable to a lot of small files. Charset can be
determined, ideally, from individual message headers, but that part is often
missing, especially in Usenet. Some Unix-based mail programs can help sort into
separate categories, one category to a file or directory (I don't really like
the word "folder").
> > I think the HELLO file contains several different character sets in the same
> > file with (extended?) ANSI escape sequences to switch between them.
> Yes, but you need to encode the file with those escape sequences to be
> able to mix different languages in one file. ISO-2022, the encoding used
> for the HELLO file, can do that, as well as Unicode-based encodings such
> as UTF-8. Latin-1 and friends cannot.
Anybody think they can persuade Korean and Chinese spammers to adopt ISO-2022?
I laugh as I type this.
- Raw text -