delorie.com/archives/browse.cgi   search  
Mail Archives: djgpp/2010/02/28/02:16:32

X-Authentication-Warning: delorie.com: mail set sender to djgpp-bounces using -f
NNTP-Posting-Date: Sun, 28 Feb 2010 00:58:02 -0600
From: "Robbie Hatley" <see DOT my DOT signature AT for DOT my DOT contact DOT info>
Newsgroups: comp.os.msdos.djgpp
References: <2PydnQe72P4H_BrWnZ2dnUVZ_vmdnZ2d AT giganews DOT com> <5099c66a-fad4-42b6-8fb0-aaae2f01d35e AT 19g2000yqu DOT googlegroups DOT com>
Subject: Re: Bug in findfirst/findnext: mangles certain characters.
Date: Sat, 27 Feb 2010 23:00:07 -0800
Organization: Tustin Free Zone
X-Priority: 3
X-MSMail-Priority: Normal
X-Newsreader: Microsoft Outlook Express 6.00.2800.1983
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1983
Message-ID: <wMadnRDVVN5njhfWnZ2dnUVZ_vednZ2d@giganews.com>
Lines: 173
X-Usenet-Provider: http://www.giganews.com
X-Trace: sv3-cEi0vlIwzjoQ1uAMPlsKOSp4sW/0o+8dh916gIG5qH+KSu1DBj1ptxvKjl/1xmqRsQpaL+JdLhh5DPQ!0lkOKsmaBSZbW5OEKtiHEvHOUOfLzcWPALY/Uex1iXX/AYbqFXo5y3YFlpHA/URGDhILZE2EIw==
X-Complaints-To: abuse AT giganews DOT com
X-DMCA-Notifications: http://www.giganews.com/info/dmca.html
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.40
Bytes: 7789
To: djgpp AT delorie DOT com
DJ-Gateway: from newsgroup comp.os.msdos.djgpp
Reply-To: djgpp AT delorie DOT com

"Rugxulo" wrote:

> DJGPP only properly supports "C" locale, e.g. 7-bit ASCII.
> Anything extra isn't available.

Not so, else findfirst() and findnext() would not being
lossily converting iso-8859-1 characters into some other
non-ASCII encoding, as they do.

For example, findfirst() and findnext() perform the following
conversions (among MANY others):

'à' = char(224)  findfirst/findnext convert to '.' = char(133)
'å' = char(229)  findfirst/findnext convert to '?' = char(134)
'í' = char(237)  findfirst/findnext convert to '¡' = char(161)
'ï' = char(239)  findfirst/findnext convert to '<' = char(139)

Note that many of the values BEING CONVERTED TO are over 127
(ie, they're not ASCII).

Some of these conversions are clearly attempts to "map to
something that looks similar":

'À' = char(192)  findfirst/findnext convert to 'A' = char(65)
'Á' = char(193)  findfirst/findnext convert to 'A' = char(65)
'Å' = char(194)  findfirst/findnext convert to 'A' = char(65)
'Ã' = char(195)  findfirst/findnext convert to 'A' = char(65)

Unfortunately, this is lossy conversion!  Since many DIFFERENT
characters are all converted to char(65), information is lost.
Even if I knew the encoding, there is no way to re-map, because
the original information has been discarded.

> For pure DOS, you can try the third-party llocl102b.zip library,
> but even it may not work (haven't tested it much myself) and
> needs COUNTRY.SYS + DISPLAY + EGA?.CPI + KEYB or similar.
> (Henrique Peron of FreeDOS is the resident expert in this
> area, FYI, if you really really need help.)
>
> http://djgpp.cybermirror.org/current/v2tk/llocl02b.zip
> http://djgpp.cybermirror.org/current/v2tk/llocl02s.zip
>
> http://www.kostis.net/en/index.htm
> http://www.kostis.net/freeware/isocp101.zip
>
> isocp101.zip  V1.01
> 1993-12-19 ISO 8859-x code pages for MS-DOS

Do you mean actual MS-DOS?  Such as version 6.22?  I do
have that as one of the 3 OSs on my machine, but I rarely
use it.  (I have DOS on there mostly so I can run certain
cool old MS-DOS based games which don't work in Windows
Command Consoles because they use certain low-level features
of DOS which were not carried over to Win 9x/me/NT/2K/XP.)

> BTW, what Windows are you using? I'll guess XP. Anyways,
> I guess you know XP (even with FAT partitions?) uses UTF-16.
> So there is no Latin-1 there (nor was there any in Win9x
> either, cp850 is just an altered variant with most of the
> same glyphs).

I'm using Windows 2000.  It seems to be cognizant of what
encoding is being used in files, and uses whatever a file
is using.

It's long file names seem to be using unicode, yes,
as they can use Hebrew (which makes it a devil of a time
to edit a file name, because pressing left arrow moves cursor
RIGHT, and pressing right arrow moves cursor LEFT, because
Hebrew uses the other direction).

However, If I copy a windows long file name which uses only
ISO-8859-1 glyphs (such as a file from Italy or France,
for example) to a text editor and save, and look at the
bytes, I find that:
1. Each glyph is represented by 1 byte (NOT unicode)
2. Each glyph is represented by the iso-8859-1 numerical
   code for that glyph.
So it looks from this that whatever unicode version
Windows 2000 is using subsumes iso-8859-1 as the first
256 entries of it's mapping table.

I won't expect of findfirst()/findnext() that they
correctly handle multibyte characters.  But I *DO* expect
that they will at least refrain from lossy remapping of
single-byte encodings.  They should just feed the numbers
through unaltered.  But they don't, and therein lies the
problem.

> So this is a problem of findfirst / findnext or
> of rename or both?

It's purely a bug in findfirst() and findnext().
rename() tries to find a file with the name returned
by findnext() only to fail to find any such file.
That's because findnext() mangled the file name!

> Does a simple findfirst / findnext app (e.g. ls.exe)
> report the names correctly?

For debugging, I put code in my programs that prints the names
that findfirst()/findnext() are returning; that's how I know
that the bug is in these functions.  For example, if I have
a file named "Roca-Marrón.txt" ("brown rock" in Spanish),
findfirst() gives "Roca-Marr¢n.txt" (substitutes "cent sign"
for "small-o-with-acute-accent").

> Using iconv???

What is "iconv"?

If it's something that does reverse conversion,
that won't work, because the original conversion
was non-injective.
http://en.wikipedia.org/wiki/Injective

> > I'm curious if anyone has run across this bug before?
>
> Probably not English-only Americans like me. I've (very very)
> briefly dabbled in codepages "for fun" (Latin-3 ftw!), but
> nothing hardcore.    ;-)

If you have ever acquired files from other countries,
and if you write file-utility programs using djgpp,
and if you try feeding file names returned by findnext()
to rename(), then you'll run into this bug fast enough.

> ... GCC 4.2.3 ... February 1, 2008 ...
> That's not really old, IMHO.

I upgraded to latest and still get same problem.

> Does a simple "ren blah blah2" at the shell work?

Let me test that.  First, I'm make file "C:\Tést.txt"
using a Windows Explorer window to "C:\".  Done.
Now I'll rename it in a command console:
%ren Tést.txt Test.txt
%_
No error.  Yes, Windows 2000 command console accurately
drew the "small e with acute accent" glyph for iso-8859-1
character code 233.  Yes, it correctly renamed the file.

In short, my operating system talks iso-8859-1, but
findfirst()/findnext() don't.

> P.S. The best (only??) DJGPP program to really support
> i18n features is the text editor Mined (just released
> 2000.16). It probably has some stuff in there that you
> would find useful. Give it a whirl in addition
> to trying some of the above-mentioned stuff for
> completeness.
> http://www.towo.net/mined/

I'll check it out, but really the only things that would
help the problem I'm writing about here would be one of
the following:

1. Someone fixes the remapping bug in findfirst()/findnext()
   (or presents some way to "turn off" the remapping).

2. I learn of another library that has functions that
   correctly pull Windows long file names (at least,
   the ones that use single-byte encoding) into a C++ program.

-- 
Cheers,
Robbie Hatley
lonewolf at well dot com
www dot well dot com slant tilde lonewolf slant


- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019