delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2010/03/03/11:36:56

X-Recipient: archive-cygwin AT delorie DOT com
X-SWARE-Spam-Status: No, hits=-2.0 required=5.0 tests=AWL,BAYES_00,SARE_MSGID_LONG40
X-Spam-Check-By: sourceware.org
MIME-Version: 1.0
In-Reply-To: <4B8A6069.4030008@towo.net>
References: <513288 DOT 14252 DOT qm AT web19014 DOT mail DOT hk2 DOT yahoo DOT com> <4B8A6069 DOT 4030008 AT towo DOT net>
Date: Wed, 3 Mar 2010 16:36:37 +0000
Message-ID: <416096c61003030836j7a56b38bpad73bfc3c4146c55@mail.gmail.com>
Subject: Re: Non-canonical mode input via tcsetattr(), under mintty console
From: Andy Koppe <andy DOT koppe AT gmail DOT com>
To: cygwin AT cygwin DOT com
X-IsSubscribed: yes
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

Thomas Wolff:
> Dave Lee schrieb:
>>
>> Hi all,
>>
>> I was testing a program that uses non-canonical mode input via
>> tcsetattr().
>>
>> ...
>> Specifically, I entered the chinese character "=E4=BE=8B" (which means "=
rule"
>> or "example"). It occupies 3 bytes in UTF-8 representation: E4, BE, 8B.
>>
>> On standard console, the read() call returned THREE bytes (n =3D=3D 3), =
and
>> (not surprisingly) E4, BE and 8B were returned to buf[].
>>
>> On mintty console, the read() call returned ONE byte (n =3D=3D 1), and o=
nly
>> E4 were returned to buf[]. I could grab the other two bytes if I did
>> additional calls to read().
>>
> This is absolutely in line with the specified interface of read(), whether
> or not you apply some tcsetattr settings, and whether or not there is a
> difference between cygwin console and mintty. It is a traditional
> byte-oriented function and has no knowlege or handling of character
> encoding, and there is no guarantee that a multi-byte character comes in =
one
> piece.

Exactly.


> (Even if mintty were changed to try to feed them in one piece, there
> would still be no guarantee that you receive them in one piece.)

As it happens, mintty sends multibyte characters in a single write()
already, but the pseudo terminal device driver is indeed entitled to
pick them apart anyway: VMIN=3D1 and VTIME=3D0 means give me at least one
byte, as soon as you have it. It's also possible that multiple
characters are delivered at once.


> You have four options (two each whether you want UTF-8 or Unicode words in
> your program):
> [...]
> * Read bytes and transform with one of the mbtowc (multi-byte to
> wide-character) functions
> [...]

I'd go with that, because that way you can support not only UTF-8, but
all the charsets supported by the OS.


> (provided you want characters as Unicode words,
> not UTF-8 sequences in your program).

In that case, one can just ignore the widechar output and only use the
length info returned by mb(r)towc.

Andy

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019