delorie.com/archives/browse.cgi   search  
Mail Archives: djgpp-workers/1998/09/16/14:48:30

Message-ID: <3600096D.7CEF24DD@vlsi.com>
Date: Wed, 16 Sep 1998 11:54:37 -0700
From: Charles Marslett <charles DOT marslett AT vlsi DOT com>
MIME-Version: 1.0
To: Eli Zaretskii <eliz AT is DOT elta DOT co DOT il>
CC: djgpp-workers AT delorie DOT com
Subject: Re: auto-binary-mode?
References: <Pine DOT SUN DOT 3 DOT 91 DOT 980916122621 DOT 5874V-100000 AT is>

Eli Zaretskii wrote:
> 
> On Tue, 15 Sep 1998, Charles Marslett wrote:
> 
> > But I found that looking for at least 3 CR/LF pairs in the
> > first 512 bytes of the file worked pretty well (PC file format, of course)
> > and it worked better if you relaxed the rule when lots of backspaces showed
> > up (I think I counted backspaces and when the counter hit 100 I counted
> > that as a CR/LF pair or some such thing).  If the CR/LF counter was 0, 1
> > or 2 I had a binary file, more than that indicated a text file (I actually
> > used assembly with scan instructions, so there really wasn't a counter as
> > such -- just where the program counter was).
> 
> I think you are mixing two different issues: the Unix- vs DOS-style
> text files and the binary vs text files.  They are NOT the same, and
> thus using the approach you suggest would introduce subtle bugs and
> misfeatures into innocent programs like GCC, Gawk, Sed, etc.
> 
> A file that has CR/LF pairs can be a binary file (e.g., an executable
> image with text of multi-line messages inside it), but it is still a
> binary file.  OTOH, a text file can have Unix-style LF-only lines, and
> it still should be treated as text file (e.g., the ^Z character at its
> end should still be stripped).

Well, I was thinking of the issue as only being between Unix (binaryish)
text files and DOS text files.  The whole problem arises because in most
Unix systems one need not distinguish between text and binary files.  If
a file is a Unix-style text file with LF-only lines, then it should,
IMHO,
never have a ^Z at the end, and should be processed by the OS-ish part
of
the system as binary (ignoring issues based on the API call used, such
as
end of line identification in gets() for example).

That is, an application would never have been written with an "r" or "w"
fopen() call if it were important to distinguish between text and binary
I/O on the system the program was written for (a Unix most likely).

> GNU Emacs originally failed to distinguish between these two issues,
> which caused several headaches when Emacs 20 began to automatically
> detect and convert CR/LF to LF and back.  Guessing the EOL format is
> okay in text files, but reading binary files should be done with no
> guesswork and no conversions at all.  Since text files can be reliably
> read in text mode without any guessing at all, it isn't really needed.

I disagree.  The problem was and is that GNU Emacs does not have an
inherent way of specifying a file as being text or binary -- exactly the
problem addressed by "rb" or "rt", and the problem pointed out by you
and
others with installing autodetect in the library.

No program that does an fopen() with "r" or "w" can possibly have a
concept of text and binary files.

As an unrelated side issue, what difference is there between a text and
binary file in the Microsoft world except for the processing of the ^Z
and CR/NL characters?  (And of course the side effects of that
processing
that leaks into ftell(), fseek() and other functions that depend on the
number if characters in strings read from or written to the file.)  Are
there systems that need text identification other than for handling the
differences in end of line and end of file parsing?

--Charles

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019