delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2010/01/23/10:07:26

X-Recipient: archive-cygwin AT delorie DOT com
X-Spam-Check-By: sourceware.org
Date: Sat, 23 Jan 2010 16:07:03 +0100
From: Corinna Vinschen <corinna-cygwin AT cygwin DOT com>
To: cygwin AT cygwin DOT com
Subject: Re: Please support CP932. (I have problem using subversion with SJIS)
Message-ID: <20100123150703.GY2402@calimero.vinschen.de>
Reply-To: cygwin AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
References: <e22ab97b1001222149r3c217decmb0da069d7049c896 AT mail DOT gmail DOT com> <20100123135020 DOT GW2402 AT calimero DOT vinschen DOT de>
MIME-Version: 1.0
In-Reply-To: <20100123135020.GW2402@calimero.vinschen.de>
User-Agent: Mutt/1.5.20 (2009-06-14)
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

On Jan 23 14:50, Corinna Vinschen wrote:
> On Jan 23 14:49, Nayuta Taga wrote:
> > In short, '~' (U+007E TILDE) turns into U+203E (OVERLINE) when
> > LANG=ja_JP.SJIS.
> > 
> > Then I looked into cygwin and subversion again.
> > (1) cygwin1.dll converts L"foo~" (UCS-2) to "foo~" (CP932).
> > (2) Because subversion's internally uses UTF-8,
> >     "foo~" (CP932) should be converted to "foo~" (UTF-8).
> > (3) It uses iconv to convert from *SJIS* to UTF-8,
> >     because nl_langinfo(CODESET) returns "SJIS" when LANG=ja_JP.SJIS.
> > (4) The final string is "foo\xe2\x80\xbe".
> >     (e2 80 be is UTF-8 representation of U+203E)
> 
> SJIS is the charset name for the Windows codepage 932.  The multibyte to
> widechar conversion (and vice versa) for SJIS even uses the Windows
> conversion functions under the hood.  And the character 0x7e in SJIS is
> identical to the Unicode character U+00fe.
> 
> So, why does iconv turn U+007e into U+203E?
> 
> This sounds like a bug in iconv, not in Cygwin.  Your patch just adds an
> additional charset name CP932 for the exact same charset SJIS.  What
> this does is just cancel the recognition of the charset in iconv.  That
> sounds like a hack, rather than a solution.

Ouch.  I understand now.  Standard SJIS is *really* different from
Microsoft CP932 in two code points:

  CP932 0x5c == U+005E
  SJIS  0x5c == U+00A5

  CP932 0x7e == U+007E
  SJIS  0x7e == U+203E

Bummer.  Actually the problem is SJIS, not CP932.  One of the basic
ideas in Cygwin is that every character set has at least an intact
ASCII code range.

Hmm.

Would it be a valid help for your case if Cygwin's SJIS conversion would
convert 0x5c to U+00A5 and 0x7e to 203E, so that the SJIS conversion
would be really correct *and* bijective?  To me this sounds like the
better solution than adding a CP932 charset identifier.


Cor

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019