delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2010/01/21/10:41:40

X-Recipient: archive-cygwin AT delorie DOT com
X-Spam-Check-By: sourceware.org
Date: Thu, 21 Jan 2010 16:41:20 +0100
From: Corinna Vinschen <corinna-cygwin AT cygwin DOT com>
To: cygwin AT cygwin DOT com
Subject: Re: Japanese/Chinese language question
Message-ID: <20100121154120.GF2402@calimero.vinschen.de>
Reply-To: cygwin AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
References: <20100121134055 DOT GE2402 AT calimero DOT vinschen DOT de> <f60fe001001210704o27f08b15lcb3456fb59822024 AT mail DOT gmail DOT com>
MIME-Version: 1.0
In-Reply-To: <f60fe001001210704o27f08b15lcb3456fb59822024@mail.gmail.com>
User-Agent: Mutt/1.5.20 (2009-06-14)
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

On Jan 21 10:04, Mark J. Reed wrote:
> On Thu, Jan 21, 2010 at 8:40 AM, Corinna Vinschen  wrote:
> > would somebody with Japanese and/or Chinese language background be so
> > When comparing strings linguistically (strcoll/wcscoll),
> > - are Hiragana and Katakana forms of the same character to be
> >  treated as equal or as different?
> 
> (Nit: they are not "the same character" in either the technical or
> traditional sense of "character"; they're the same syllable, but
> represented by different characters.)
> 
> From the Unicode point of view, they are distinct; there is no defined
> equivalence, either canonical or compatibility, between corresponding
> Katakana and Hiragana syllables.  The collation algorithm (which does
> take linguistic context into account) doesn't seem to say anything
> about such comparisons, though it's possible I missed something.
> 
>  But as a precedent which might be helpful, I note that with
> linguistic sensitivity active, Oracle 10g does compare Hiragana and
> Katakana forms of the same syllable as equal.
> 
> > - are half-width and full-width forms of the same CJK character
> >  treated as equal or as different?
> 
> According to the Unicode normalization algorithm, half -width and
> full-width forms normalize to the same character, so they should be
> treated as equivalent.  From the point of view of Unicode, there is no
> semantic difference, and the width property is informative, not
> normative. It's primarily encoded in Unicode to preserve round-trip
> compatibility with other standards, though it's also helpful for hints
> to rendering algorithms.

Thanks for the info.  However...


  linux$ cat jp.c
  #include <stdio.h>
  #include <locale.h>
  #include <wchar.h>

  int
  main (int argc, char **argv)
  {
    setlocale (LC_ALL, "ja_JP.UTF-8");
    /* U+3042 = Hiragana letter A
       U+30a2 = Katakana letter A
       U+ff71 = Halfwidth Katakana letter A */
    printf ("%d\n", wcscoll (L"\x3042", L"\x30a2"));
    printf ("%d\n", wcscoll (L"\xff71", L"\x30a2"));
    return 0;
  }
  linux$ gcc jp.c -o jp
  linux$ ./jp
  -83
  -340

I expected that at least one of the comparisons returns 0.
Am I doing something wrong?


Corinna


-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019