delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2017/12/11/18:36:36

X-Recipient: archive-cygwin AT delorie DOT com
DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id
:list-unsubscribe:list-subscribe:list-archive:list-post
:list-help:sender:subject:references:to:from:message-id:date
:mime-version:in-reply-to:content-type
:content-transfer-encoding; q=dns; s=default; b=ilTgRlb9lqua8QqA
nRXaSU9aYEXUQrPUTqI0p60nmRxRQVgdM1TkvELtPvJFNTbYtOD5t1xgxhASVDqU
mbDzBgMv60cMbgkRjqWo4/eYjN/8sQV36fk8P+M5ZZqskF7q47bxXN7ApkGwRODb
lt75uxTK7M/BP2StoY8XmqtdCwk=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id
:list-unsubscribe:list-subscribe:list-archive:list-post
:list-help:sender:subject:references:to:from:message-id:date
:mime-version:in-reply-to:content-type
:content-transfer-encoding; s=default; bh=IP58wUBG+4gud/LeFB8b4f
wigdQ=; b=rRKjPpJdtM4IJfappiM0Eg9wfSaqTElq3Su1VhKeBiFL5UZyGmlmK7
2GiwMRhqFOKzLQ5cG21DekKix2GN5wfZCkxbS9RG+J+N6HOBBEgOuaaGwczKgrsp
jAaDHsp16hzxtgOZ7n5hDtgAmqCXPleSkax3c3KHDGR7/CCXrV3Dc=
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-0.1 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,LIKELY_SPAM_SUBJECT,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=no version=3.3.2 spammy=Euro, dash, UD:k.a, HX-Received:10.55.108.7
X-HELO: mail-qt0-f178.google.com
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:references:to:from:message-id:date :user-agent:mime-version:in-reply-to:content-transfer-encoding :content-language; bh=OA7cDA/zSS3ZK7HQj0L2DE7APzbPAwSvCNx3PzpMwW8=; b=O9L4NUuwl/8FHMFGRc9FM7h+zG1lU1ato37UVjoM+TB9oLNnE7PJu0pBq3MxT3qBr2 +oz0Yf5kdBBjidKvXM2w4LcpJlI39y7MhyS82vtmHbZzKqddFZFeXLR0SSKwQLc5pKVa RRbBS5D89QRiWeJZAj2KNI46AKuBmNmeVJLXjVBBw8stC2kG7Y5GOGoV2S+U0xfGyLJ9 YNTgNKTuen/HpqfUXapi5plpXBvR8xDJuq651m7KJ8j2InMDuQcuV4Vso/oSgzrMZgcY Qil1UvosPW7+zCFUea+6X4vTUX/lqyD/O9FMCZVB/0gLr8YE0k1lKJmfvxRUqwVCnsS5 b7rQ==
X-Gm-Message-State: AKGB3mLdJuxYWUCpoT4iKp4ukvCevgaRbdS7/cB2yqIKOrcIWV/UP+Ji gRFWkNp+yiwK+xYSKd75cHE=
X-Google-Smtp-Source: ACJfBou273Jbt1rzXOweQZmbvREunoBxoaMbCOCr9uiIHmJHCNeJC/SrKYfbpk3PcU7BtEp52C0lwg==
X-Received: by 10.55.108.7 with SMTP id h7mr3079475qkc.111.1513035371132; Mon, 11 Dec 2017 15:36:11 -0800 (PST)
Subject: Re: Need help with multibyte UTF-8 characters
References: <626a3c06-e9f2-1932-f1f3-47ddb2051215 AT gmail DOT com>
To: cygwin AT cygwin DOT com
From: Thomas Taylor <tayloth AT gmail DOT com>
Message-ID: <9d3b73ff-f596-51a2-909a-30a767e3e9b3@gmail.com>
Date: Mon, 11 Dec 2017 18:36:09 -0500
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.5.0
MIME-Version: 1.0
In-Reply-To: <626a3c06-e9f2-1932-f1f3-47ddb2051215@gmail.com>

Thank you for your advice on setting my locale to en_US.UTF-8.  
Unfortunately, Cygwin still seems to have trouble displaying some 
three-byte UTF-8 encoded characters correctly.  For example, see the 
following snippet from a "sed" file.  This file attempts to convert 
XML-encoded filenames to UTF-8.  As you can see, it converts one- and 
two-byte encodings correctly, but fails on some three-byte encodings 
(the en dash, the em dash, and the ellipsis, all of which are displayed 
as a filled-in rectangle):

# Match longest strings first

# Three-byte encodings:

# En dash
s/%[Ee]2%80%93/–/g

# Em dash
s/%[Ee]2%80%94/—/g

# Horizontal ellipsis
s/%[Ee]2%80%[Aa]6/…/g

# Less-than-or-equal sign
s/%[Ee]2%89%[Aa]4/≤/g

# Euro symbol
s/%[Ee]2%82%[Aa][Cc]/€/g

# Two-byte encodings:

# Non-break space
#s/%[Cc]2%[Aa]0/⎵/g

# Lowercase a with acute accent
s/%[Cc]3%[Aa]1/á/g

# Lowercase a with umlaut (a.k.a. diaeresis)
s/%[Cc]3%[Aa]4/ä/g

# Lowercase e with acute accent
s/%[Cc]3%[Aa]9/é/g

# Lowercase i with acute accent
s/%[Cc]3%[Aa]D/í/g

# Lowercase o with acute accent
s/%[Cc]3%[Bb]3/ó/g



--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019