Mail Archives: djgpp/2002/04/30/08:00:17
xeon <xeon AT dacreations DOT cjb DOT net> wrote:
> I'm wondering, how to determine is a file is a text file, or a binary
> file, programatically. I'm thinking about reading 4 bytes from the
> file and test them if they're in the range of usual text ([a-z],
> [A-Z], etc. The 4 bytes is read from the following locations : 1st
> byte, last byte, and 2 randomly selected offset inside the file. Is
> this enough?
Quite probably not. It's both too picky and not picky enough.
It's too picky because a file can easily be a text file without
containing a single letter in the whole file. Think of a
spreadsheed-like collection of lots of numbers in decimal figures. It
could be in some strange foreign character mapping where almost all
letters have codes outside the ASCII range (like all that trashy spam
recently flooding the net all coming from a particular country in East
Asia).
It's not picky enough because there's a significant probability that
all four bytes you tested happen to be printable ASCII characters, but
none of the others is.
More generally spoken: *every* file can potentially be a binary file.
To give just one example where such tests are almost guaranteed to
fail: GNU's .info files. These files look like text files (not a
single non-ASCII character in the whole file, setting aside some
control-L and control-_ ones), but they really are binary files,
because there are fseek() offsets inside the files that would break if
the file is transferred to DOS text mode. The DJGPP ports of info
readers know how to deal with that problem, but that was done only
because it became a burden to keep telling users to leave these binary
files alone.
The usual trick is as Eli described it: read some chunk of the file
and check for any strictly forbidden characters that cannot ever
appear text files, regardless of their encoding. E.g. null bytes. The
free zip/unzip tools read the first kilobyte for this purpose, IIRC.
That, too, obviously cannot be failsafe, but it works well most of the
time.
--
Hans-Bernhard Broeker (broeker AT physik DOT rwth-aachen DOT de)
Even if all the snow were burnt, ashes would remain.
- Raw text -