X-Recipient: archive-cygwin AT delorie DOT com X-SWARE-Spam-Status: No, hits=-1.9 required=5.0 tests=AWL,BAYES_00,SARE_MSGID_LONG40,SPF_PASS X-Spam-Check-By: sourceware.org MIME-Version: 1.0 In-Reply-To: <20100121134055.GE2402@calimero.vinschen.de> References: <20100121134055 DOT GE2402 AT calimero DOT vinschen DOT de> Date: Thu, 21 Jan 2010 10:04:04 -0500 Message-ID: Subject: Re: Japanese/Chinese language question From: "Mark J. Reed" To: cygwin AT cygwin DOT com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-IsSubscribed: yes Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com On Thu, Jan 21, 2010 at 8:40 AM, Corinna Vinschen wrote: > would somebody with Japanese and/or Chinese language background be so > kind to answer the below two questions? I have some (outdated) background in I18N and Japanese L10N, though I'm not a native speaker of either Japanese or any Chinese language. So I can't offer native intuition, but I can relay some technical info that might be helpful: > When comparing strings linguistically (strcoll/wcscoll), > - are Hiragana and Katakana forms of the same character to be > =C2=A0treated as equal or as different? (Nit: they are not "the same character" in either the technical or traditional sense of "character"; they're the same syllable, but represented by different characters.) =46rom the Unicode point of view, they are distinct; there is no defined equivalence, either canonical or compatibility, between corresponding Katakana and Hiragana syllables. The collation algorithm (which does take linguistic context into account) doesn't seem to say anything about such comparisons, though it's possible I missed something. But as a precedent which might be helpful, I note that with linguistic sensitivity active, Oracle 10g does compare Hiragana and Katakana forms of the same syllable as equal. > - are half-width and full-width forms of the same CJK character > =C2=A0treated as equal or as different? According to the Unicode normalization algorithm, half -width and full-width forms normalize to the same character, so they should be treated as equivalent. From the point of view of Unicode, there is no semantic difference, and the width property is informative, not normative. It's primarily encoded in Unicode to preserve round-trip compatibility with other standards, though it's also helpful for hints to rendering algorithms. --=20 Mark J. Reed -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple