X-Recipient: archive-cygwin AT delorie DOT com X-SWARE-Spam-Status: No, hits=-2.0 required=5.0 tests=AWL,BAYES_00,SARE_MSGID_LONG40 X-Spam-Check-By: sourceware.org MIME-Version: 1.0 In-Reply-To: <4B8A6069.4030008@towo.net> References: <513288 DOT 14252 DOT qm AT web19014 DOT mail DOT hk2 DOT yahoo DOT com> <4B8A6069 DOT 4030008 AT towo DOT net> Date: Wed, 3 Mar 2010 16:36:37 +0000 Message-ID: <416096c61003030836j7a56b38bpad73bfc3c4146c55@mail.gmail.com> Subject: Re: Non-canonical mode input via tcsetattr(), under mintty console From: Andy Koppe To: cygwin AT cygwin DOT com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-IsSubscribed: yes Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com Thomas Wolff: > Dave Lee schrieb: >> >> Hi all, >> >> I was testing a program that uses non-canonical mode input via >> tcsetattr(). >> >> ... >> Specifically, I entered the chinese character "=E4=BE=8B" (which means "= rule" >> or "example"). It occupies 3 bytes in UTF-8 representation: E4, BE, 8B. >> >> On standard console, the read() call returned THREE bytes (n =3D=3D 3), = and >> (not surprisingly) E4, BE and 8B were returned to buf[]. >> >> On mintty console, the read() call returned ONE byte (n =3D=3D 1), and o= nly >> E4 were returned to buf[]. I could grab the other two bytes if I did >> additional calls to read(). >> > This is absolutely in line with the specified interface of read(), whether > or not you apply some tcsetattr settings, and whether or not there is a > difference between cygwin console and mintty. It is a traditional > byte-oriented function and has no knowlege or handling of character > encoding, and there is no guarantee that a multi-byte character comes in = one > piece. Exactly. > (Even if mintty were changed to try to feed them in one piece, there > would still be no guarantee that you receive them in one piece.) As it happens, mintty sends multibyte characters in a single write() already, but the pseudo terminal device driver is indeed entitled to pick them apart anyway: VMIN=3D1 and VTIME=3D0 means give me at least one byte, as soon as you have it. It's also possible that multiple characters are delivered at once. > You have four options (two each whether you want UTF-8 or Unicode words in > your program): > [...] > * Read bytes and transform with one of the mbtowc (multi-byte to > wide-character) functions > [...] I'd go with that, because that way you can support not only UTF-8, but all the charsets supported by the OS. > (provided you want characters as Unicode words, > not UTF-8 sequences in your program). In that case, one can just ignore the widechar output and only use the length info returned by mb(r)towc. Andy -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple