delorie.com/archives/browse.cgi   search  
Mail Archives: djgpp-workers/1998/09/15/19:18:36

Message-ID: <35FEF59C.9B4D8BFC@vlsi.com>
Date: Tue, 15 Sep 1998 16:17:48 -0700
From: Charles Marslett <charles DOT marslett AT vlsi DOT com>
MIME-Version: 1.0
To: DJ Delorie <dj AT delorie DOT com>
CC: djgpp-workers AT delorie DOT com
Subject: Re: auto-binary-mode?
References: <199809152120 DOT RAA08510 AT delorie DOT com>

DJ Delorie wrote:
> 
> Hey, I just had an idea.  When a file is opened and the first block is
> read in, if the user didn't specify binary or text, why not look at
> the data and try to guess?  The presence of null, control, or certain
> 8-bit characters should indicate binary vs text as a default.

I have used that idea in the past (personal version of Microsoft's
compiler 6-8 years ago ;-).

The best single indicator of text vs. binary turned out not to be non-text
byte values, though.  Lots of text files had PC graphics characters
in them and they often clustered at the beginning of the file (title boxes
and such).  But I found that looking for at least 3 CR/LF pairs in the
first 512 bytes of the file worked pretty well (PC file format, of course)
and it worked better if you relaxed the rule when lots of backspaces showed
up (I think I counted backspaces and when the counter hit 100 I counted
that as a CR/LF pair or some such thing).  If the CR/LF counter was 0, 1
or 2 I had a binary file, more than that indicated a text file (I actually
used assembly with scan instructions, so there really wasn't a counter as
such -- just where the program counter was).  It's slower than comparing
for a 'b' or 't' in the function call, but still pretty fast.

I also stuck in a few extra tests that looked for a few signatures I knew
I was going to be reading ("PK" for ZIP files and "MZ" for executables,
for example).  At least back in those days, nulls were not a good binary
file indicator because they seemed to occur in a lot of files captured
from stdout streams with ">".  Like most ad hoc designs, it got pretty
baroque by the time I stopped using it.

Looking for 01, 02, 03, FF and FE might work pretty well, though.

--Charles

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019