delorie.com/archives/browse.cgi   search  
Mail Archives: djgpp/2010/03/03/09:16:22

X-Authentication-Warning: delorie.com: mail set sender to djgpp-bounces using -f
NNTP-Posting-Date: Wed, 03 Mar 2010 08:08:44 -0600
From: "Robbie Hatley" <see DOT my DOT signature AT for DOT my DOT contact DOT info>
Newsgroups: comp.os.msdos.djgpp
References: <2PydnQe72P4H_BrWnZ2dnUVZ_vmdnZ2d AT giganews DOT com> <hmbvg7$ieq$1 AT speranza DOT aioe DOT org> <Br-dnQ3aTcaMsRfWnZ2dnUVZ_uSdnZ2d AT giganews DOT com> <4b8ba6a1 AT news DOT x-privat DOT org>
Subject: Re: Bug in findfirst/findnext: mangles certain characters.
Date: Wed, 3 Mar 2010 06:08:26 -0800
Organization: Tustin Free Zone
X-Priority: 3
X-MSMail-Priority: Normal
X-Newsreader: Microsoft Outlook Express 6.00.2800.1983
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1983
Message-ID: <OKSdnaHEbqHx8BPWnZ2dnUVZ_h2dnZ2d@giganews.com>
Lines: 156
X-Usenet-Provider: http://www.giganews.com
X-Trace: sv3-feIgRXNIfrFRMwe4FjEcH3Q2+h/T4sN7BqIBPw2CuqwiwN+sL5SCWW8toghwHGt18DKdGSnsSRLhm7G!t2f/rZzpdM7D0uliHLeR266Ovq+TykicKu4iEHQ6OhJORSdIC73AdWbyEGtod9WAx7Q79UI6TA==
X-Complaints-To: abuse AT giganews DOT com
X-DMCA-Notifications: http://www.giganews.com/info/dmca.html
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.40
Bytes: 6869
To: djgpp AT delorie DOT com
DJ-Gateway: from newsgroup comp.os.msdos.djgpp
Reply-To: djgpp AT delorie DOT com

"Jason Hood" wrote:

> On 28/02/2010 18:43, Robbie Hatley wrote:
> > Try renaming THAT file using findfirst/findnext and rename().
> > It won't work.  (Findfirst will turn "Ìn-Ít" to "In-It",
> > for one thing.  rename() will then complain "no such file".)
>
> The problem is not with findfirst ...

Actually, I'm pretty sure it is.

> ... but with Windows.  It is /Windows/ that is renaming the file...

I don't know what you mean by "windows is renaming the file".
*I* am the one attempting to rename a file.

In the case of the example file name I gave,
Windows holds the file name to be:

   "Fíle-Nàme-Wíth-Måny-ïsõ-8859-1-Lëttêrs-Ìn-Ít.txt"

That is a fully-valid Windows Long File name, and Windows makes
no complaint about it and does not attempt to rename it.

Do you mean that Windows is remapping the character set
of the file name before handing it to findfirst()?
Are you *sure* of that?  Do you have the full source code
in front of you as you say that?  Have you thoroughly
inspected it?  Or, as I suspect, are you just *assuming*
that's what's happening?

Personally, I don't think that's what's happening at all.
I think the remapping is in the findfirst/findnext code.

In the case of the example file I give above, what
findfirst() returns is the following:

   "Fíle-Nàme-Wíth-Måny-ïsõ-8859-1-Lëttêrs-In-It.txt"

Do you see the 2 differences?

1. capital-I-grave-accent   -->   capital-I-no-accent
2. capical-I-acute-accent   -->   capital-I-no-accent

Non-injective, hence non-invertable, hence non-remapable.

And since no such file exists, rename() fails.

Also, I have determined that Code Pages have nothing to
do with it.  I tried the following experiment with two
differnet code pages, and determined that while the
file name from findfirst() LOOKS very different, if you
look at the raw numbers, they're the same regardless of
Code Page.  Code pages just re-map character sets
immediately before writing text to the screen.  Has
nothing to do with the encodings used in file names
or by findfirst()/findnext().

========================================================
Using Code Page 437:

wd=E:\TEST-R~1\FINDFI~1
%chcp 437
Active code page: 437

wd=E:\TEST-R~1\FINDFI~1
%findfirst-test

Fíle-Nàme-Wíth-Måny-ïso-8859-1-Lëttêrs-And-Symbols_¿¼½_x÷_In-It.txt
70, 161, 108, 101, 45, 78, 133, 109, 101, 45, 87, 161, 116, 104, 45, 77, 134, 11
0, 121, 45, 139, 115, 111, 45, 56, 56, 53, 57, 45, 49, 45, 76, 137, 116, 116, 13
6, 114, 115, 45, 65, 110, 100, 45, 83, 121, 109, 98, 111, 108, 115, 95, 168, 172
, 171, 95, 120, 246, 95, 73, 110, 45, 73, 116, 46, 116, 120, 116,

========================================================
Using Code Page 1252:

wd=E:\TEST-R~1\FINDFI~1
%chcp 1252
Active code page: 1252

wd=E:\TEST-R~1\FINDFI~1
%findfirst-test

F¡le-N.me-W¡th-M?ny-<so-8859-1-L?tt^rs-And-Symbols_¨¬«_xö_In-It.txt
70, 161, 108, 101, 45, 78, 133, 109, 101, 45, 87, 161, 116, 104, 45, 77, 134, 11
0, 121, 45, 139, 115, 111, 45, 56, 56, 53, 57, 45, 49, 45, 76, 137, 116, 116, 13
6, 114, 115, 45, 65, 110, 100, 45, 83, 121, 109, 98, 111, 108, 115, 95, 168, 172
, 171, 95, 120, 246, 95, 73, 110, 45, 73, 116, 46, 116, 120, 116,

========================================================

See?  Very different-looking presentations, but if you look
at just the numbers, THEY'RE EXACTLY THE SAME.  In either
case, the numbers contain exactly 2 errors.  But 2 errors
is 2 too many.

> ... so there is nothing that findfirst can do...

Sure there is.  It can stop remapping!

> The same problem even happens with Win32 console programs,
> when switching the file APIs to OEM.

I don't think that's the same thing at all.

> The ideal solution is to use a Windows Unicode program, so
> you might want to have a look at MinGW.

Unicode is not relevant here, since we're talking
about single-byte encodings.

And the ideal solution to the bug would be to fix it,
rather than to tell people to abandon djgpp as being useless.
Fixing it should not be that hard.

As near as I can tell, there's some remapping code in
findfirst()/findnext() which is remapping file names
to something very-close-to (but not quite) Code Page 437,
possibly because that's the default code page for
MS-DOS and Windows Command Consoles.  If the strings
from findfirst() are printed to the screen using that
code page (and no other), the file names then look almost
correct.

But this is horribly broken, because rename() cannot
understand the remapped file names!  It expects the
original, un-remapped numerical values from the
Windows Long File Name.  If it gets a remapped version,
it will fail with error "File Not Found".

Ironically, for the names returned by findfirst()
to be USEFUL, if you print them to the screen they
should look like GIBBERISH in the default code page,
not even remotely similar to the correct file names.
(This is because CP437 is so drastically different
from iso-8859-1, which seems to be what Windows is
using for Long File Names, unless you force it to
use Unicode by using Hebrew or Chinese or some such.)

But in reality, the names returned by findfirst(), when
printed on the screen using CP437, look almost correct.
And that is a very bad sign.  It means that these
functions are broken, and cannot handle most Windows
Long File Names which contain characters over char(126).

-- 
Cheers,
Robbie Hatley
lonewolf at well dot com
www dot well dot com slant tilde lonewolf slant




- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019