delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2006/11/30/11:43:20

X-Spam-Check-By: sourceware.org
User-Agent: Microsoft-Entourage/11.2.5.060620
Date: Thu, 30 Nov 2006 10:42:56 -0600
Subject: Windows NTFS UCS2 characters
From: John Love-Jensen <eljay AT adobe DOT com>
To: <cygwin AT cygwin DOT com>
Message-ID: <C1946630.19DC4%eljay@adobe.com>
Mime-version: 1.0
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

Hi Cygwin folks,

I have a Windows file on NTFS named (using \uXXXX representation):
xxx_\u212B_A\u030A_\u00C5_xxx.txt

# ls -alb xxx_*_xxx.txt

ls: xxx_\305_A\260_\305_xxx.txt: No such file or directory

Windows sees it just fine.  The bash *-expansion is expanding it to
/something/... just not a good something it appears.

I can select the file in Explorer, I can double click on it to edit it.  Use
MS-Notepad (shudder -- Cygwin's Vim's can't see the file either, neither
passed on the command line nor through Vim's explorer; I don't have a
Windows native Vim/gVim to test) to put some text in it.  Save it.

But Cygwin / bash / ls finds that filename unpalatable.  Hmmm.

# echo -n xxx_*_xxx.txt | xxd -g 1

78 78 78 5F C5 5F 41 B0 5F C5 5F 78 78 78 2E 74 78 74
 x  x  x  _ Ao  _  A ^o  _ Ao  _  x  x  x  .  t  x  t

(The character representation line was typed in by me, not xxd.  Using Ao to
represent the A-with-overcircle, ^o combining overcircle.)

I presume Cygwin's bash operates using UTF8 encoded POSIX filenames.  I
expect the name should have been expanded as:

78 78 78 5F E2 84 AB 5F 41 CC 8A 5F C3 85 5F 78 78 78 2E 74 78 74
            ^^^^^^^^       ^^^^^    ^^^^^

E2 84 AB is UTF8 for \u212B
CC 8A is UTF8 for \u030A
C3 85 is UTF8 for \u00C5
(Assuming I didn't mess up)

Hmmm.  Yep, it appears that xxx_*_xxx.txt is expanding funny.

# ls -alb -n xxx_$'\xE2\x84\xAB'_A$'\xCC\x8A'_$'\xC3\x85'_xxx.txt

ls: xxx_\342\204\253_A\314\212_\303\205_xxx.txt: No such file or directory

Drat.  Still no love.  So even if hand fed the UTF8 representation, ls is
not able to digest the name.  (Assuming I didn't mess up.)

Is there some sort of UCS2 or UTF8 or Unicode compatibility setting I need
to set for Cygwin to be able to work in Window's NTFS environment, when some
filenames have some arbitrary UCS2 (Unicode 1.x, of course) characters?

I presume that somewhere something is set to CP1252 and causing grief.

Hmmm, I don't have LANG nor LC_ALL (or any other LC_xxx) set.  Maybe that's
my problem.  [Tries it.]  Nope -- or I didn't do it correctly.

I can always fallback to use scripts for CMD.EXE to manipulate these files;
but I'd rather be able to do it in my Bash shell scripts.

Please don't suggest Interix, SFU or MKS alternatives.  Those are fine
products, I'm sure, but I'm not interested.

Thanks,
--Eljay

/* MSVS8: cl test.c */
#include <Windows.h>
#include <stdio.h>

int main()
{
  /* Create file name that Cygwin does not like. */
  HANDLE h = CreateFileW(
    L"xxx_\u212B_A\u030A_\u00C5_xxx.txt",
    GENERIC_READ | GENERIC_WRITE,
    0,
    NULL,
    OPEN_ALWAYS,
    0,
    NULL);

  if (h == INVALID_HANDLE_VALUE)
  {
    fprintf(stderr, "Invalid handle\n");
  }
  else
  {
    fprintf(stderr, "Successfully opened\n");
    CloseHandle(h);
  }

  return 0;
}


--
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Problem reports:       http://cygwin.com/problems.html
Documentation:         http://cygwin.com/docs.html
FAQ:                   http://cygwin.com/faq/

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019