X-Authentication-Warning: delorie.com: mail set sender to djgpp-bounces using -f NNTP-Posting-Date: Sun, 28 Feb 2010 00:58:02 -0600 From: "Robbie Hatley" Newsgroups: comp.os.msdos.djgpp References: <2PydnQe72P4H_BrWnZ2dnUVZ_vmdnZ2d AT giganews DOT com> <5099c66a-fad4-42b6-8fb0-aaae2f01d35e AT 19g2000yqu DOT googlegroups DOT com> Subject: Re: Bug in findfirst/findnext: mangles certain characters. Date: Sat, 27 Feb 2010 23:00:07 -0800 Organization: Tustin Free Zone X-Priority: 3 X-MSMail-Priority: Normal X-Newsreader: Microsoft Outlook Express 6.00.2800.1983 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1983 Message-ID: Lines: 173 X-Usenet-Provider: http://www.giganews.com X-Trace: sv3-cEi0vlIwzjoQ1uAMPlsKOSp4sW/0o+8dh916gIG5qH+KSu1DBj1ptxvKjl/1xmqRsQpaL+JdLhh5DPQ!0lkOKsmaBSZbW5OEKtiHEvHOUOfLzcWPALY/Uex1iXX/AYbqFXo5y3YFlpHA/URGDhILZE2EIw== X-Complaints-To: abuse AT giganews DOT com X-DMCA-Notifications: http://www.giganews.com/info/dmca.html X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly X-Postfilter: 1.3.40 Bytes: 7789 To: djgpp AT delorie DOT com DJ-Gateway: from newsgroup comp.os.msdos.djgpp Reply-To: djgpp AT delorie DOT com "Rugxulo" wrote: > DJGPP only properly supports "C" locale, e.g. 7-bit ASCII. > Anything extra isn't available. Not so, else findfirst() and findnext() would not being lossily converting iso-8859-1 characters into some other non-ASCII encoding, as they do. For example, findfirst() and findnext() perform the following conversions (among MANY others): 'à' = char(224) findfirst/findnext convert to '.' = char(133) 'å' = char(229) findfirst/findnext convert to '?' = char(134) 'í' = char(237) findfirst/findnext convert to '¡' = char(161) 'ï' = char(239) findfirst/findnext convert to '<' = char(139) Note that many of the values BEING CONVERTED TO are over 127 (ie, they're not ASCII). Some of these conversions are clearly attempts to "map to something that looks similar": 'À' = char(192) findfirst/findnext convert to 'A' = char(65) 'Á' = char(193) findfirst/findnext convert to 'A' = char(65) 'Å' = char(194) findfirst/findnext convert to 'A' = char(65) 'Ã' = char(195) findfirst/findnext convert to 'A' = char(65) Unfortunately, this is lossy conversion! Since many DIFFERENT characters are all converted to char(65), information is lost. Even if I knew the encoding, there is no way to re-map, because the original information has been discarded. > For pure DOS, you can try the third-party llocl102b.zip library, > but even it may not work (haven't tested it much myself) and > needs COUNTRY.SYS + DISPLAY + EGA?.CPI + KEYB or similar. > (Henrique Peron of FreeDOS is the resident expert in this > area, FYI, if you really really need help.) > > http://djgpp.cybermirror.org/current/v2tk/llocl02b.zip > http://djgpp.cybermirror.org/current/v2tk/llocl02s.zip > > http://www.kostis.net/en/index.htm > http://www.kostis.net/freeware/isocp101.zip > > isocp101.zip V1.01 > 1993-12-19 ISO 8859-x code pages for MS-DOS Do you mean actual MS-DOS? Such as version 6.22? I do have that as one of the 3 OSs on my machine, but I rarely use it. (I have DOS on there mostly so I can run certain cool old MS-DOS based games which don't work in Windows Command Consoles because they use certain low-level features of DOS which were not carried over to Win 9x/me/NT/2K/XP.) > BTW, what Windows are you using? I'll guess XP. Anyways, > I guess you know XP (even with FAT partitions?) uses UTF-16. > So there is no Latin-1 there (nor was there any in Win9x > either, cp850 is just an altered variant with most of the > same glyphs). I'm using Windows 2000. It seems to be cognizant of what encoding is being used in files, and uses whatever a file is using. It's long file names seem to be using unicode, yes, as they can use Hebrew (which makes it a devil of a time to edit a file name, because pressing left arrow moves cursor RIGHT, and pressing right arrow moves cursor LEFT, because Hebrew uses the other direction). However, If I copy a windows long file name which uses only ISO-8859-1 glyphs (such as a file from Italy or France, for example) to a text editor and save, and look at the bytes, I find that: 1. Each glyph is represented by 1 byte (NOT unicode) 2. Each glyph is represented by the iso-8859-1 numerical code for that glyph. So it looks from this that whatever unicode version Windows 2000 is using subsumes iso-8859-1 as the first 256 entries of it's mapping table. I won't expect of findfirst()/findnext() that they correctly handle multibyte characters. But I *DO* expect that they will at least refrain from lossy remapping of single-byte encodings. They should just feed the numbers through unaltered. But they don't, and therein lies the problem. > So this is a problem of findfirst / findnext or > of rename or both? It's purely a bug in findfirst() and findnext(). rename() tries to find a file with the name returned by findnext() only to fail to find any such file. That's because findnext() mangled the file name! > Does a simple findfirst / findnext app (e.g. ls.exe) > report the names correctly? For debugging, I put code in my programs that prints the names that findfirst()/findnext() are returning; that's how I know that the bug is in these functions. For example, if I have a file named "Roca-Marrón.txt" ("brown rock" in Spanish), findfirst() gives "Roca-Marr¢n.txt" (substitutes "cent sign" for "small-o-with-acute-accent"). > Using iconv??? What is "iconv"? If it's something that does reverse conversion, that won't work, because the original conversion was non-injective. http://en.wikipedia.org/wiki/Injective > > I'm curious if anyone has run across this bug before? > > Probably not English-only Americans like me. I've (very very) > briefly dabbled in codepages "for fun" (Latin-3 ftw!), but > nothing hardcore. ;-) If you have ever acquired files from other countries, and if you write file-utility programs using djgpp, and if you try feeding file names returned by findnext() to rename(), then you'll run into this bug fast enough. > ... GCC 4.2.3 ... February 1, 2008 ... > That's not really old, IMHO. I upgraded to latest and still get same problem. > Does a simple "ren blah blah2" at the shell work? Let me test that. First, I'm make file "C:\Tést.txt" using a Windows Explorer window to "C:\". Done. Now I'll rename it in a command console: %ren Tést.txt Test.txt %_ No error. Yes, Windows 2000 command console accurately drew the "small e with acute accent" glyph for iso-8859-1 character code 233. Yes, it correctly renamed the file. In short, my operating system talks iso-8859-1, but findfirst()/findnext() don't. > P.S. The best (only??) DJGPP program to really support > i18n features is the text editor Mined (just released > 2000.16). It probably has some stuff in there that you > would find useful. Give it a whirl in addition > to trying some of the above-mentioned stuff for > completeness. > http://www.towo.net/mined/ I'll check it out, but really the only things that would help the problem I'm writing about here would be one of the following: 1. Someone fixes the remapping bug in findfirst()/findnext() (or presents some way to "turn off" the remapping). 2. I learn of another library that has functions that correctly pull Windows long file names (at least, the ones that use single-byte encoding) into a C++ program. -- Cheers, Robbie Hatley lonewolf at well dot com www dot well dot com slant tilde lonewolf slant