X-Recipient: archive-cygwin AT delorie DOT com X-Spam-Check-By: sourceware.org Date: Thu, 21 Jan 2010 16:41:20 +0100 From: Corinna Vinschen To: cygwin AT cygwin DOT com Subject: Re: Japanese/Chinese language question Message-ID: <20100121154120.GF2402@calimero.vinschen.de> Reply-To: cygwin AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com References: <20100121134055 DOT GE2402 AT calimero DOT vinschen DOT de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.20 (2009-06-14) Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com On Jan 21 10:04, Mark J. Reed wrote: > On Thu, Jan 21, 2010 at 8:40 AM, Corinna Vinschen wrote: > > would somebody with Japanese and/or Chinese language background be so > > When comparing strings linguistically (strcoll/wcscoll), > > - are Hiragana and Katakana forms of the same character to be > > treated as equal or as different? > > (Nit: they are not "the same character" in either the technical or > traditional sense of "character"; they're the same syllable, but > represented by different characters.) > > From the Unicode point of view, they are distinct; there is no defined > equivalence, either canonical or compatibility, between corresponding > Katakana and Hiragana syllables. The collation algorithm (which does > take linguistic context into account) doesn't seem to say anything > about such comparisons, though it's possible I missed something. > > But as a precedent which might be helpful, I note that with > linguistic sensitivity active, Oracle 10g does compare Hiragana and > Katakana forms of the same syllable as equal. > > > - are half-width and full-width forms of the same CJK character > > treated as equal or as different? > > According to the Unicode normalization algorithm, half -width and > full-width forms normalize to the same character, so they should be > treated as equivalent. From the point of view of Unicode, there is no > semantic difference, and the width property is informative, not > normative. It's primarily encoded in Unicode to preserve round-trip > compatibility with other standards, though it's also helpful for hints > to rendering algorithms. Thanks for the info. However... linux$ cat jp.c #include #include #include int main (int argc, char **argv) { setlocale (LC_ALL, "ja_JP.UTF-8"); /* U+3042 = Hiragana letter A U+30a2 = Katakana letter A U+ff71 = Halfwidth Katakana letter A */ printf ("%d\n", wcscoll (L"\x3042", L"\x30a2")); printf ("%d\n", wcscoll (L"\xff71", L"\x30a2")); return 0; } linux$ gcc jp.c -o jp linux$ ./jp -83 -340 I expected that at least one of the comparisons returns 0. Am I doing something wrong? Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple