X-Recipient: archive-cygwin AT delorie DOT com X-Original-To: cygwin AT cygwin DOT com Delivered-To: cygwin AT cygwin DOT com DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 7D5BB3858023 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=tlinx.org Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=cygwin AT tlinx DOT org Message-ID: <606A2017.2040405@tlinx.org> Date: Sun, 04 Apr 2021 13:22:47 -0700 From: L A Walsh User-Agent: Thunderbird 2.0.0.24 (Windows/20100228) MIME-Version: 1.0 To: Mark Aitchison Subject: Re: Perl Unidecode modules - which to use (if not Text::Unidecode)? References: In-Reply-To: X-Spam-Status: No, score=-0.5 required=5.0 tests=BAYES_00, BODY_8BITS, KAM_DMARC_STATUS, SPF_HELO_NONE, SPF_PASS, TRACKER_ID, TXREP autolearn=no autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: cygwin AT cygwin DOT com X-Mailman-Version: 2.1.29 List-Id: General Cygwin discussions and problem reports List-Archive: List-Post: List-Help: List-Subscribe: , Cc: cygwin AT cygwin DOT com Content-Type: text/plain; charset="utf-8"; Format="flowed" Sender: "Cygwin" Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by delorie.com id 134KN7GQ025719 On 2021/04/01 13:35, Mark Aitchison wrote: > 1. What perl Unicode modules should I consider, if not Text::Unidecode? The present need > is to be able to convert those few "foreign" characters (like ÇĆĈĊçĉċĜĞĠĢĝģğġËÌÍÎÏÒÓÔÕ) > that are basically ASCII with accent marks to their closest ASCII equivalents, --- Hmm...have you tried installing from cpan? I just tried it and it seems to work. > cpan -i Text::Unidecode; > > cat /tmp/in ÇĆĈĊçĉċĜĞĠĢĝģğġËÌÍÎÏÒÓÔÕ > cat /tmp/in| perl -e ' use Text::Unidecode; while (<>) { print unidecode($_); }' CCCCcccGGGGggggEIIIIOOOO --- I.e. it stripped off all the accent marks. Is that what you want? (it spewed some warnings, but seemed to test out ok, so tried it). put your characters in a file "/tmp/in", (i.e. > cat /tmp/in -- I know, not very creative, but then: cat /tmp/in| tperl use Text::Unidecode; while (<>) { print unidecode($_); }' CCCCcccGGGGggggEIIIIOOOO) Where are you seeing those characters and how do you know they are not already in unicode? I.e. That I'm seeing characters "CcGgEIO" but with accents -- indicates they area already in Unicode. What are you wanting to do.. just convert them to the ASCII characters with the accent marks stripped off? > but I'd > like to do more with Unicode in the future, without going down any dead-ends as far as > being able to run under cygwin is concerned. > > 2. I see some talk of Internationalization in Chapter 2 of "Setting up Cygwin", but > cannot see anything relating to perl modules, and I don't see any easy way to search many > months of the mailing list for a keyword... is there any information I should know about? > > > Thanks, > > Mark Aitchison > > -- > Problem reports: https://cygwin.com/problems.html > FAQ: https://cygwin.com/faq/ > Documentation: https://cygwin.com/docs.html > Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple > > -- Problem reports: https://cygwin.com/problems.html FAQ: https://cygwin.com/faq/ Documentation: https://cygwin.com/docs.html Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple