Mail Archives: djgpp/2010/02/28/02:16:32
"Rugxulo" wrote:
> DJGPP only properly supports "C" locale, e.g. 7-bit ASCII.
> Anything extra isn't available.
Not so, else findfirst() and findnext() would not being
lossily converting iso-8859-1 characters into some other
non-ASCII encoding, as they do.
For example, findfirst() and findnext() perform the following
conversions (among MANY others):
'à' = char(224) findfirst/findnext convert to '.' = char(133)
'å' = char(229) findfirst/findnext convert to '?' = char(134)
'í' = char(237) findfirst/findnext convert to '¡' = char(161)
'ï' = char(239) findfirst/findnext convert to '<' = char(139)
Note that many of the values BEING CONVERTED TO are over 127
(ie, they're not ASCII).
Some of these conversions are clearly attempts to "map to
something that looks similar":
'À' = char(192) findfirst/findnext convert to 'A' = char(65)
'Á' = char(193) findfirst/findnext convert to 'A' = char(65)
'Å' = char(194) findfirst/findnext convert to 'A' = char(65)
'Ã' = char(195) findfirst/findnext convert to 'A' = char(65)
Unfortunately, this is lossy conversion! Since many DIFFERENT
characters are all converted to char(65), information is lost.
Even if I knew the encoding, there is no way to re-map, because
the original information has been discarded.
> For pure DOS, you can try the third-party llocl102b.zip library,
> but even it may not work (haven't tested it much myself) and
> needs COUNTRY.SYS + DISPLAY + EGA?.CPI + KEYB or similar.
> (Henrique Peron of FreeDOS is the resident expert in this
> area, FYI, if you really really need help.)
>
> http://djgpp.cybermirror.org/current/v2tk/llocl02b.zip
> http://djgpp.cybermirror.org/current/v2tk/llocl02s.zip
>
> http://www.kostis.net/en/index.htm
> http://www.kostis.net/freeware/isocp101.zip
>
> isocp101.zip V1.01
> 1993-12-19 ISO 8859-x code pages for MS-DOS
Do you mean actual MS-DOS? Such as version 6.22? I do
have that as one of the 3 OSs on my machine, but I rarely
use it. (I have DOS on there mostly so I can run certain
cool old MS-DOS based games which don't work in Windows
Command Consoles because they use certain low-level features
of DOS which were not carried over to Win 9x/me/NT/2K/XP.)
> BTW, what Windows are you using? I'll guess XP. Anyways,
> I guess you know XP (even with FAT partitions?) uses UTF-16.
> So there is no Latin-1 there (nor was there any in Win9x
> either, cp850 is just an altered variant with most of the
> same glyphs).
I'm using Windows 2000. It seems to be cognizant of what
encoding is being used in files, and uses whatever a file
is using.
It's long file names seem to be using unicode, yes,
as they can use Hebrew (which makes it a devil of a time
to edit a file name, because pressing left arrow moves cursor
RIGHT, and pressing right arrow moves cursor LEFT, because
Hebrew uses the other direction).
However, If I copy a windows long file name which uses only
ISO-8859-1 glyphs (such as a file from Italy or France,
for example) to a text editor and save, and look at the
bytes, I find that:
1. Each glyph is represented by 1 byte (NOT unicode)
2. Each glyph is represented by the iso-8859-1 numerical
code for that glyph.
So it looks from this that whatever unicode version
Windows 2000 is using subsumes iso-8859-1 as the first
256 entries of it's mapping table.
I won't expect of findfirst()/findnext() that they
correctly handle multibyte characters. But I *DO* expect
that they will at least refrain from lossy remapping of
single-byte encodings. They should just feed the numbers
through unaltered. But they don't, and therein lies the
problem.
> So this is a problem of findfirst / findnext or
> of rename or both?
It's purely a bug in findfirst() and findnext().
rename() tries to find a file with the name returned
by findnext() only to fail to find any such file.
That's because findnext() mangled the file name!
> Does a simple findfirst / findnext app (e.g. ls.exe)
> report the names correctly?
For debugging, I put code in my programs that prints the names
that findfirst()/findnext() are returning; that's how I know
that the bug is in these functions. For example, if I have
a file named "Roca-Marrón.txt" ("brown rock" in Spanish),
findfirst() gives "Roca-Marr¢n.txt" (substitutes "cent sign"
for "small-o-with-acute-accent").
> Using iconv???
What is "iconv"?
If it's something that does reverse conversion,
that won't work, because the original conversion
was non-injective.
http://en.wikipedia.org/wiki/Injective
> > I'm curious if anyone has run across this bug before?
>
> Probably not English-only Americans like me. I've (very very)
> briefly dabbled in codepages "for fun" (Latin-3 ftw!), but
> nothing hardcore. ;-)
If you have ever acquired files from other countries,
and if you write file-utility programs using djgpp,
and if you try feeding file names returned by findnext()
to rename(), then you'll run into this bug fast enough.
> ... GCC 4.2.3 ... February 1, 2008 ...
> That's not really old, IMHO.
I upgraded to latest and still get same problem.
> Does a simple "ren blah blah2" at the shell work?
Let me test that. First, I'm make file "C:\Tést.txt"
using a Windows Explorer window to "C:\". Done.
Now I'll rename it in a command console:
%ren Tést.txt Test.txt
%_
No error. Yes, Windows 2000 command console accurately
drew the "small e with acute accent" glyph for iso-8859-1
character code 233. Yes, it correctly renamed the file.
In short, my operating system talks iso-8859-1, but
findfirst()/findnext() don't.
> P.S. The best (only??) DJGPP program to really support
> i18n features is the text editor Mined (just released
> 2000.16). It probably has some stuff in there that you
> would find useful. Give it a whirl in addition
> to trying some of the above-mentioned stuff for
> completeness.
> http://www.towo.net/mined/
I'll check it out, but really the only things that would
help the problem I'm writing about here would be one of
the following:
1. Someone fixes the remapping bug in findfirst()/findnext()
(or presents some way to "turn off" the remapping).
2. I learn of another library that has functions that
correctly pull Windows long file names (at least,
the ones that use single-byte encoding) into a C++ program.
--
Cheers,
Robbie Hatley
lonewolf at well dot com
www dot well dot com slant tilde lonewolf slant
- Raw text -