X-Spam-Check-By: sourceware.org User-Agent: Microsoft-Entourage/11.2.5.060620 Date: Thu, 30 Nov 2006 10:42:56 -0600 Subject: Windows NTFS UCS2 characters From: John Love-Jensen To: Message-ID: Mime-version: 1.0 Content-type: text/plain; charset="US-ASCII" Content-transfer-encoding: 7bit Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com Hi Cygwin folks, I have a Windows file on NTFS named (using \uXXXX representation): xxx_\u212B_A\u030A_\u00C5_xxx.txt # ls -alb xxx_*_xxx.txt ls: xxx_\305_A\260_\305_xxx.txt: No such file or directory Windows sees it just fine. The bash *-expansion is expanding it to /something/... just not a good something it appears. I can select the file in Explorer, I can double click on it to edit it. Use MS-Notepad (shudder -- Cygwin's Vim's can't see the file either, neither passed on the command line nor through Vim's explorer; I don't have a Windows native Vim/gVim to test) to put some text in it. Save it. But Cygwin / bash / ls finds that filename unpalatable. Hmmm. # echo -n xxx_*_xxx.txt | xxd -g 1 78 78 78 5F C5 5F 41 B0 5F C5 5F 78 78 78 2E 74 78 74 x x x _ Ao _ A ^o _ Ao _ x x x . t x t (The character representation line was typed in by me, not xxd. Using Ao to represent the A-with-overcircle, ^o combining overcircle.) I presume Cygwin's bash operates using UTF8 encoded POSIX filenames. I expect the name should have been expanded as: 78 78 78 5F E2 84 AB 5F 41 CC 8A 5F C3 85 5F 78 78 78 2E 74 78 74 ^^^^^^^^ ^^^^^ ^^^^^ E2 84 AB is UTF8 for \u212B CC 8A is UTF8 for \u030A C3 85 is UTF8 for \u00C5 (Assuming I didn't mess up) Hmmm. Yep, it appears that xxx_*_xxx.txt is expanding funny. # ls -alb -n xxx_$'\xE2\x84\xAB'_A$'\xCC\x8A'_$'\xC3\x85'_xxx.txt ls: xxx_\342\204\253_A\314\212_\303\205_xxx.txt: No such file or directory Drat. Still no love. So even if hand fed the UTF8 representation, ls is not able to digest the name. (Assuming I didn't mess up.) Is there some sort of UCS2 or UTF8 or Unicode compatibility setting I need to set for Cygwin to be able to work in Window's NTFS environment, when some filenames have some arbitrary UCS2 (Unicode 1.x, of course) characters? I presume that somewhere something is set to CP1252 and causing grief. Hmmm, I don't have LANG nor LC_ALL (or any other LC_xxx) set. Maybe that's my problem. [Tries it.] Nope -- or I didn't do it correctly. I can always fallback to use scripts for CMD.EXE to manipulate these files; but I'd rather be able to do it in my Bash shell scripts. Please don't suggest Interix, SFU or MKS alternatives. Those are fine products, I'm sure, but I'm not interested. Thanks, --Eljay /* MSVS8: cl test.c */ #include #include int main() { /* Create file name that Cygwin does not like. */ HANDLE h = CreateFileW( L"xxx_\u212B_A\u030A_\u00C5_xxx.txt", GENERIC_READ | GENERIC_WRITE, 0, NULL, OPEN_ALWAYS, 0, NULL); if (h == INVALID_HANDLE_VALUE) { fprintf(stderr, "Invalid handle\n"); } else { fprintf(stderr, "Successfully opened\n"); CloseHandle(h); } return 0; } -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/