delorie.com/archives/browse.cgi   search  
Mail Archives: djgpp/2002/04/30/08:00:17

X-Authentication-Warning: delorie.com: mailnull set sender to djgpp-bounces using -f
From: Hans-Bernhard Broeker <broeker AT physik DOT rwth-aachen DOT de>
Newsgroups: comp.os.msdos.djgpp
Subject: Re: how to determine if a file is text/binary
Date: 30 Apr 2002 11:55:04 GMT
Organization: Aachen University of Technology (RWTH)
Lines: 43
Message-ID: <aam0mo$72o$1@nets3.rz.RWTH-Aachen.DE>
References: <c21a43ff DOT 0204291216 DOT 53eaf67c AT posting DOT google DOT com>
NNTP-Posting-Host: acp3bf.physik.rwth-aachen.de
X-Trace: nets3.rz.RWTH-Aachen.DE 1020167704 7256 137.226.32.75 (30 Apr 2002 11:55:04 GMT)
X-Complaints-To: abuse AT rwth-aachen DOT de
NNTP-Posting-Date: 30 Apr 2002 11:55:04 GMT
Originator: broeker@
To: djgpp AT delorie DOT com
DJ-Gateway: from newsgroup comp.os.msdos.djgpp
Reply-To: djgpp AT delorie DOT com

xeon <xeon AT dacreations DOT cjb DOT net> wrote:
> I'm wondering, how to determine is a file is a text file, or a binary
> file, programatically. I'm thinking about reading 4 bytes from the
> file and test them if they're in the range of usual text ([a-z],
> [A-Z], etc. The 4 bytes is read from the following locations : 1st
> byte, last byte, and 2 randomly selected offset inside the file. Is
> this enough?

Quite probably not.  It's both too picky and not picky enough.

It's too picky because a file can easily be a text file without
containing a single letter in the whole file.  Think of a
spreadsheed-like collection of lots of numbers in decimal figures.  It
could be in some strange foreign character mapping where almost all
letters have codes outside the ASCII range (like all that trashy spam
recently flooding the net all coming from a particular country in East
Asia).

It's not picky enough because there's a significant probability that
all four bytes you tested happen to be printable ASCII characters, but
none of the others is.

More generally spoken: *every* file can potentially be a binary file.

To give just one example where such tests are almost guaranteed to
fail: GNU's .info files.  These files look like text files (not a
single non-ASCII character in the whole file, setting aside some
control-L and control-_ ones), but they really are binary files,
because there are fseek() offsets inside the files that would break if
the file is transferred to DOS text mode.  The DJGPP ports of info
readers know how to deal with that problem, but that was done only
because it became a burden to keep telling users to leave these binary
files alone.

The usual trick is as Eli described it: read some chunk of the file
and check for any strictly forbidden characters that cannot ever
appear text files, regardless of their encoding. E.g. null bytes.  The
free zip/unzip tools read the first kilobyte for this purpose, IIRC.
That, too, obviously cannot be failsafe, but it works well most of the
time.
-- 
Hans-Bernhard Broeker (broeker AT physik DOT rwth-aachen DOT de)
Even if all the snow were burnt, ashes would remain.

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019