delorie.com/archives/browse.cgi | search |
X-Recipient: | archive-cygwin AT delorie DOT com |
DKIM-Filter: | OpenDKIM Filter v2.11.0 sourceware.org CCDE9388A40C |
DKIM-Signature: | v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com; |
s=default; t=1617619793; | |
bh=mevxnnCgbO6vFL2mmmXlF6w6Tizs9LWPxWPB3ROlHNo=; | |
h=References:In-Reply-To:Date:Subject:To:List-Id:List-Unsubscribe: | |
List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: | |
From; | |
b=BhzwN6gb0MqKKl/ko4pEYmlUwZjdC/9Nkw2LopU6JMMxgIN8CNWBaIQ1BwC+ZJ1n1 | |
/y2OVHM04Y9durbgjtuIQNAFKB6KyzDxAJiXM0BWzhpkxZ8fgJNOMCWVrtnc6YudQ9 | |
bmpkyJ94/qyHaKKydrXygurAhf2mmZKM+CVm8Ff4= | |
X-Original-To: | cygwin AT cygwin DOT com |
Delivered-To: | cygwin AT cygwin DOT com |
DMARC-Filter: | OpenDMARC Filter v1.3.2 sourceware.org EC3293857C4C |
X-Google-DKIM-Signature: | v=1; a=rsa-sha256; c=relaxed/relaxed; |
d=1e100.net; s=20161025; | |
h=x-gm-message-state:mime-version:references:in-reply-to:from:date | |
:message-id:subject:to:content-transfer-encoding; | |
bh=sRSLoyLxqxotuHLz0SvRLRGlUJ5ZFSJ2MaeGpJ+UJd0=; | |
b=AfVG9We06xHSIcWyC4f7Q2N6nJcdkVSHcgx2zJYCia4vLHN0MTFyH1StWMpzWyce6u | |
fuxgXUOXDMwqk/5XLdra5DBcAp8m9NLegvpphGhG9W1j4G3/MnrsM2NKo3K5tzggfLkD | |
JJP0qZflaOOIVqFCC4hmWNC0hg6W8lc7lOKiaxeLe9Ky18C1PhRLQRmRfEqTAFAPODHl | |
+yNK64mTwbaWJ6MjIg55KhKS9z2rsXlEFr/gI7PDvp24pNayXXGWhHJymZUvj7a90W32 | |
67K9T4AFuXqr3CQD+v90Td0xp1rY5BeZPhA2Ti9yk35DE83RKuH0NRjH1N5lo3cIPOTY | |
3kSw== | |
X-Gm-Message-State: | AOAM531AHPcLR73qqWi1NoJgYACdXtSCoFBIdQA0M0RXk5Qd0CBcSK6U |
jSBiiJbUgkOUZlhowcCbWGvs9C3IJLBPZIiJ23MVBvK8 | |
X-Google-Smtp-Source: | ABdhPJwWTahRjaEBMJNFEVy798CjJjV0BMzfb/2JHXDH1rQiXU3fjc/a4qcRy0Nng9Hmc4qPJxnZbqbVq9GyRMITkZE= |
X-Received: | by 2002:a05:620a:1277:: with SMTP id |
b23mr22626801qkl.457.1617619790545; | |
Mon, 05 Apr 2021 03:49:50 -0700 (PDT) | |
MIME-Version: | 1.0 |
References: | <d3342ff4-f717-f882-5c41-b27ab272dc03 AT cyberXpress DOT co DOT nz> |
<CAAr43iOdVea3YYThgdYpJxRCaVtFVhyHz_FwMTQhqTw8+YT-zg AT mail DOT gmail DOT com> | |
<606AD7CE DOT 6090606 AT tlinx DOT org> | |
In-Reply-To: | <606AD7CE.6090606@tlinx.org> |
Date: | Mon, 5 Apr 2021 19:49:39 +0900 |
Message-ID: | <CAAr43iMuc3LRxy=BqJJuZTkzU14c+XERMv2oVVc7Lg-kuMY5BQ@mail.gmail.com> |
Subject: | Re: Perl Unidecode modules - which to use (if not Text::Unidecode)? |
To: | cygwin AT cygwin DOT com |
X-Spam-Status: | No, score=-0.5 required=5.0 tests=BAYES_00, BODY_8BITS, |
DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, | |
RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, | |
TXREP autolearn=ham autolearn_force=no version=3.4.2 | |
X-Spam-Checker-Version: | SpamAssassin 3.4.2 (2018-09-13) on |
server2.sourceware.org | |
X-BeenThere: | cygwin AT cygwin DOT com |
X-Mailman-Version: | 2.1.29 |
List-Id: | General Cygwin discussions and problem reports <cygwin.cygwin.com> |
List-Archive: | <https://cygwin.com/pipermail/cygwin/> |
List-Post: | <mailto:cygwin AT cygwin DOT com> |
List-Help: | <mailto:cygwin-request AT cygwin DOT com?subject=help> |
List-Subscribe: | <https://cygwin.com/mailman/listinfo/cygwin>, |
<mailto:cygwin-request AT cygwin DOT com?subject=subscribe> | |
From: | Joel Rees via Cygwin <cygwin AT cygwin DOT com> |
Reply-To: | Joel Rees <joel DOT rees AT gmail DOT com> |
Sender: | "Cygwin" <cygwin-bounces AT cygwin DOT com> |
X-MIME-Autoconverted: | from base64 to 8bit by delorie.com id 135AnumO021040 |
On Mon, Apr 5, 2021 at 6:26 PM L A Walsh <cygwin AT tlinx DOT org> wrote: > > On 2021/04/04 14:26, Joel Rees via Cygwin wrote: > > > >> 1. What perl Unicode modules should I consider, if not Text::Unidecode? > >> The present need > >> is to be able to convert those few "foreign" characters (like > >> ÇĆĈĊçĉċĜĞĠĢĝģğġËÌÍÎÏÒÓÔÕ) > >> that are basically ASCII with accent marks to their closest ASCII > >> equivalents, but I'd > >> like to do more with Unicode in the future, without going down any > >> dead-ends as far as > >> being able to run under cygwin is concerned. > >> > >> > > > > "Stripping those few foreign accent characters" is probably not really what > > you want to do. > > > ---- > Why not? You don't know his use case and you are misinterpreting his > example as random garbage. Actually, I was specifically _not_ interpreting them as random garbage. If they were random garbage, it wouldn't matter what he does with them. > Those aren't a random foreign encoding -- those are C's G's then E, I O > with accent variations that he may want to collapse for purposes of storing > in a text storage and retrieval (search) application. in this world many things are possible, and those may actually be intentional strings of characters with assorted diacriticals, some sort of example of diacriticals, and he may have some reason to force the characters to their base form instead of regenerating the text. Or maybe I'm misinterpreting his intent. Maybe he doesn't want to strip the diacriticals so much as convert the combinations to something like punycode. > They are all well > formed/well-coded UTF-8 characters -- they are not some 8-bit encoding > that was remangled during a no-recoding display of them in a UTF-8 > context. I've seen lots of strings like that that are the result of e-mail software mangling. In Japan, we call it 文字化け (mojibake). And, yes, the e-mail software "helpfully" converts the misinterpreted bytes to well-formed but entirely irrelevant UTF-8 in many cases. I will acknowledge that I don't see it as often as I used to, but it still happens. > I didn't know about Text::Unidecode -- but it specifically to create > Latinized alternatives to foreign characters. That was another hint > that it wasn't a random mistake. The manpage for it says: > > It often happens that you have non-Roman text data in Unicode, > but you > can't display it -- usually because you're trying to show it to a > user > via an application that doesn't support Unicode, or because the fonts > you need aren't accessible. You could represent the Unicode > characters > as "???????" or "\15BA\15A0\1610...", but that's nearly useless > to the > user who actually wants to read what the text says. > > An example was like: > > tperl > use utf8; > use Text::Unidecode; > my $name="\x{5317}\x{4EB0}"; > > printf "name, %s == %s\n", $name, unidecode($name); > ' > name, 北亰 == Bei Jing I would not call that "stripping" accent marks. It's a process of recognizing the characters, looking them up in a dictionary, and finding a reasonable Latinized equivalent, which is a fairly involved process requiring a bit of heuristics, since there is often a many-to-many mapping involved. > It's not just about removing accents but getting an English > like translation based on the foreign text. And that's actually what I was trying to point him to? Okay, maybe my suggestions were too elliptical. Maybe I should have told myself I was too busy and ignored his question like everybody else. [snip] -- Joel Rees http://reiisi.blogspot.jp/p/novels-i-am-writing.html -- Problem reports: https://cygwin.com/problems.html FAQ: https://cygwin.com/faq/ Documentation: https://cygwin.com/docs.html Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple
webmaster | delorie software privacy |
Copyright 2019 by DJ Delorie | Updated Jul 2019 |