X-Recipient: archive-cygwin AT delorie DOT com DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:message-id:date:from:to:references:subject :content-type; q=dns; s=default; b=ZVqe4Hv9r6bSewUQPw1VlpXynev84 XlZMOhYiikgQNdYuaNDENNSBlHBzzKs1gic43sUsrzFshpGcPqjWnOD9BSsfLyW4 EQHAADiPaJ+05nhiYKZX5va508CdqjHAeEsAo3z8c0HUXrzBuc6ObMzVLXs8Bm1n 37bskXU+tSG5nI= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:message-id:date:from:to:references:subject :content-type; s=default; bh=5SGEhq8Dtz19cn3gx3cbtriy7fk=; b=ToO 5KbM6nB9vZ+d9kC5VwWT4hmd7PFNcNiwocNjGOmB6ljafIoiskGvR7SzOUAmiTQ9 pjj0fDCitTrQ9jU+vpHI3GPz2AdNxBb4cOzMZvtMSCiGJNchFo2QQRjrLBBpuDWg PyAvResgLrb/jK12vn4DhtSCqIZUmFLpS8ztipxk= Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2 spammy=henderson, Henderson, weight, cats X-HELO: mail-oi0-f67.google.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:date:from:to:references:subject:user-agent; bh=NG1/FiQJqBwfyx0VfwinQQzzvDjrZ5FQ+hT3BQvq7n4=; b=TqOJQIybf924jggjunhb59l3XJNaaQenJJAexZ8mZyUhbuRb9nzWswAzGgPcga7AQv cuNkcOciftus0tqyPA4ei0S3wNHIZ3aZw4SzVH9/ePktiUhyvfMDoLLVtUPwBvUoCUZ2 gPQIMKd/5icleLioFMB85BulH3fHUzqTcuNwSZlpiAM9eQNXxNTqC1tODON4tljy909H NSEZ0Io1dac+BtXhBcJi8FhnkijFJ3B9h90jEu2jstk66Tu3fgm0taLAQxHbnVcr9YvY T7rQQtgxYdz/uJzM4QCwdN6oIdGW1kbhoJWgn5zR78w4suEb3XnmRzm1jrFwumAaGeuU msJQ== Message-ID: <5b8ef3af.1c69fb81.6801.f392@mx.google.com> Date: Tue, 04 Sep 2018 14:05:51 -0700 (PDT) From: Steven Penny To: cygwin AT cygwin DOT com References: Subject: Re: Cygwin fails to utilize Unicode replacement character Content-Type: text/plain; charset=utf8; format=flowed User-Agent: Tryst/2.8.0 (cup.github.io/tryst) On Tue, 4 Sep 2018 13:59:10, Doug Henderson wrote: > My preference is to remove the output fiddling code that Corrina has > been working on. It is trying to solve the wrong problem. > I think we have gone down a rabbit hole at the wrong end of cat's data flow. this has nothing to do with "cat". it has to do with the unfounded design decision to use U+2592. Granted at this point we are bikeshedding - but an official standard does exist, namely Unicode, with 2 applicable characters for this use case: 1. U+FFFD: http://unicode.org/charts/nameslist/n_FFF0.html 2. U+25A1: http://unicode.org/charts/nameslist/n_25A0.html > Should any changes to the way a character is displayed be required, it > needs to be in the terminal program that display the character, not in > cygwin which should pass the character along unmodified. the "terminal" in this case is either "cygwin" or "xterm" - in both cases code changes have already been made in reponse to this thread, so i dont think your comment here holds weight. > Both cygwin and Debian 9.5 show: > > $ file alfa.txt > alfa.txt: ISO-8859 text > > When Linux reads the file, it assumes the encoding is UTF-8. > When cygwin reads the file, it assume the encoding is CP1252 > This command shows the problem > > $ iconv -f utf8 alfa.txt > iconv: alfa.txt:1:0: incomplete character or shift sequence > > On Linux, this shows a slightly different message, with the same intent. > > Try using this string: > > $ printf "\xC3\xAB\353\n" > =C3=AB=E2=96=92 > > to get a better understanding of the problem. It contains two > representation of LATIN SMALL LETTER E WITH DIAERESIS, first encoded > in UTF-8, then using ISO-8859-1. now it appears *you* are going down the rabbit hole. both Cygwin and Mintty were in violation on Unicode standard - however this has already been remedied in the code. > There are two different reasons for the MEDIUM SHADE. Here it > indicates an invalid UTF-8 character, and the font does not have a > glyph for REPLACEMENT CHARACTER. The MEDIUM SHADE is also used in > place of an ordinary character without a glyph in the font. this is flat wrong. U+2592 MEDIUM SHADE is *only* used in cases of invalid UTF-8. In case of missing character - the ".notdef" glyph is used - as has been discussed several times in this thread. This is not an actual character, so i cannot paste it here - but as an example with "DejaVu Sans Mono" the glyph is an empty rectangle. -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple