delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2020/08/04/08:33:15

X-Recipient: archive-cygwin AT delorie DOT com
X-Original-To: cygwin AT cygwin DOT com
Delivered-To: cygwin AT cygwin DOT com
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 547113857C56
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none)
header.from=SystematicSw.ab.ca
Authentication-Results: sourceware.org;
spf=none smtp.mailfrom=brian DOT inglis AT systematicsw DOT ab DOT ca
X-Authority-Analysis: v=2.3 cv=ePaIcEh1 c=1 sm=1 tr=0
a=kiZT5GMN3KAWqtYcXc+/4Q==:117 a=kiZT5GMN3KAWqtYcXc+/4Q==:17
a=IkcTkHD0fZMA:10 a=w_pzkKWiAAAA:8 a=akje89l3nSV0mwZZYFYA:9 a=QEXdDO2ut3YA:10
a=bSC-MnHvYjsA:10 a=sRI3_1zDfAgwuvI8zelB:22
From: Brian Inglis <Brian DOT Inglis AT SystematicSw DOT ab DOT ca>
Subject: Re: Trouble with output character sets from Win32 applications
running under mksh
To: cygwin AT cygwin DOT com
References: <OF3F4D2646 DOT 3A75682C-ON852585B5 DOT 0058983D-852585B9 DOT 0055B758 AT abinitio DOT com>
<ae1f8133-948a-4497-049b-b8349a138143 AT SystematicSw DOT ab DOT ca>
<OF28060D19 DOT DB6E392B-ON852585B9 DOT 005D898D-852585B9 DOT 005E6021 AT abinitio DOT com>
<1314865780 DOT 20200803204249 AT yandex DOT ru>
<d8133245-02f0-71a7-e409-bf3b82fc7756 AT SystematicSw DOT ab DOT ca>
<OFE0AAB507 DOT AC9FD3B4-ON852585B9 DOT 0076DEA7-852585B9 DOT 007955FA AT abinitio DOT com>
Autocrypt: addr=Brian DOT Inglis AT SystematicSw DOT ab DOT ca; prefer-encrypt=mutual;
keydata=
mDMEXopx8xYJKwYBBAHaRw8BAQdAnCK0qv/xwUCCZQoA9BHRYpstERrspfT0NkUWQVuoePa0
LkJyaWFuIEluZ2xpcyA8QnJpYW4uSW5nbGlzQFN5c3RlbWF0aWNTdy5hYi5jYT6IlgQTFggA
PhYhBMM5/lbU970GBS2bZB62lxu92I8YBQJeinHzAhsDBQkJZgGABQsJCAcCBhUKCQgLAgQW
AgMBAh4BAheAAAoJEB62lxu92I8Y0ioBAI8xrggNxziAVmr+Xm6nnyjoujMqWcq3oEhlYGAO
WacZAQDFtdDx2koSVSoOmfaOyRTbIWSf9/Cjai29060fsmdsDLg4BF6KcfMSCisGAQQBl1UB
BQEBB0Awv8kHI2PaEgViDqzbnoe8B9KMHoBZLS92HdC7ZPh8HQMBCAeIfgQYFggAJhYhBMM5
/lbU970GBS2bZB62lxu92I8YBQJeinHzAhsMBQkJZgGAAAoJEB62lxu92I8YZwUBAJw/74rF
IyaSsGI7ewCdCy88Lce/kdwX7zGwid+f8NZ3AQC/ezTFFi5obXnyMxZJN464nPXiggtT9gN5
RSyTY8X+AQ==
Organization: Systematic Software
Message-ID: <6263e211-8751-8d61-7ceb-e9af59f0e5ce@SystematicSw.ab.ca>
Date: Tue, 4 Aug 2020 06:32:27 -0600
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101
Thunderbird/68.11.0
MIME-Version: 1.0
In-Reply-To: <OFE0AAB507.AC9FD3B4-ON852585B9.0076DEA7-852585B9.007955FA@abinitio.com>
X-CMAE-Envelope: MS4wfMg4YiLwmDuh1KElmR0eLKX3nT+WF94LHEEKCTTa/GHgW1+psPVPvrkfz/J+/JZ5Z4DChnxD0OQ953XPU4ra2GnjdhoMB4ODyqvQQ94lVMDhy+obF9Rg
ngYOmRPtjmHeUvITAXxQkQcVahy2Lvc/BOBd9TDxYE6I3w5qGHGmHCQ39YuShDJ9EdPjgv0JAqL4og==
X-Spam-Status: No, score=-8.6 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS,
KAM_LAZY_DOMAIN_SECURITY, NICE_REPLY_A, RCVD_IN_DNSWL_LOW, SPF_HELO_NONE,
SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
server2.sourceware.org
X-BeenThere: cygwin AT cygwin DOT com
X-Mailman-Version: 2.1.29
List-Id: General Cygwin discussions and problem reports <cygwin.cygwin.com>
List-Unsubscribe: <https://cygwin.com/mailman/options/cygwin>,
<mailto:cygwin-request AT cygwin DOT com?subject=unsubscribe>
List-Archive: <https://cygwin.com/pipermail/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-request AT cygwin DOT com?subject=help>
List-Subscribe: <https://cygwin.com/mailman/listinfo/cygwin>,
<mailto:cygwin-request AT cygwin DOT com?subject=subscribe>
Reply-To: cygwin AT cygwin DOT com
Errors-To: cygwin-bounces AT cygwin DOT com
Sender: "Cygwin" <cygwin-bounces AT cygwin DOT com>
X-MIME-Autoconverted: from base64 to 8bit by delorie.com id 074CWv6o020037

On 2020-08-03 16:05, Michael Shay via Cygwin wrote:
> On 2020-08-03 11:42, Andrey Repin wrote:
>>>> Doesn't help. I tried 65001 (UTF-8):

>>> Because you're confusing things.
>>> chcp has nothing to do with LANG or LC_*.
>>> Et vice versa.
>>>
>> chcp sets console code page for native console applications. 
>>> Only for those supporting it. Many do not.
>>> LANG sets output parameters for Cygwin applications (and other programs 
>>> that look for it, but these are few).

>> You cut the significant statement at the top of the OP:
>>>> I'm having a problem with Cygwin 3.1.4, changing the character set on 
>>>> the fly. It seems to work with Cygwin applications, but not with Win32 
>>>> applications.

>> He has problems with invalid characters only running win32 console 
>> applications: I changed the subject to hopefully better reflect the issue.
>> 
>> I am unsure where Cygwin 3.1.4 comes into Win32 applications - you have to 
>> use the Windows codepage conversion routines.
>> 
>> You can only change input character sets on the fly; output character sets 
>> will depend on mintty support of xterm-compatible character set support
>> and switching escape sequences; if you set up UTF16LE console output,
>> Windows and mintty should handle it.
>> 
>> Perhaps a better description of your environment, build tools, what you 
>> are trying to do, what you expect as output, and what you are getting as 
>> output, could help us better understand and help with the issue you see.

> The script I sent changes the locale information i.e. LANG and LC_ALL are 
> set to en_US.CP1252. i.e.
> 
> export LANG="en_US.CP1252"
> export LC_ALL=en_US.CP1252

FYI the normal sequence and order to check is LANG, LC_CTYPE, LC_ALL, where the
last var set wins, or the reverse where the first var set wins; the default
locale may be POSIX C.ASCII or the effective Windows locale, depending on your
startup.

> Then, it runs a simple Win32 program that takes a single input argument, ZÇ,
> the second character being C-cedilla, an 8-bit character, hex value 0xc7.
> The Win32 program transcodes the input Unicode argument using the Cygwin
> character set to determine the codepage, 1252.

Do you mean using the environment variables to determine the codepage?

FYI the default character set if none is specified is the Unix equivalent of the
default Windows "ANSI"/OEM code page, in English or many European locales that
will be ISO-8859-1.

You may have to use cygpath -C OEM chars... or cygpath -C ANSI chars... to
convert a string to the required character set for console or GUI programs.

Please specify what you mean by "Unicode" in each context; that term means a
standard for representing scripts in many writing systems with a large character
glyph repertoire and a number of encodings, representations, and handling rules:
in each use case, do you mean a char/wchar representation, and/or an encoding
UTF16LE or UTF-8?
Similarly when MS uses "ANSI" they may mean an SBCS OEM code page.

To check what is available and what is in effect in Cygwin, try e.g.:

$ for o in system user no-unicode input format; do echo `locale --$o` $o; done
en_US system
en_GB user
en_CA no-unicode
en_CA input
en_CA format
$ locale

on both Cygwin versions.

FYI see:

	https://cygwin.com/cygwin-ug-net/setup-locale.html

> It then prints the transcoded characters to stdout, and the result should be
> ZÇ, identical to the input argument.
> This works fine using Cygwin 1.7.28.

Which Windows version are you running Cygwin 1.7.28 on?
Please show output from cmd /c ver.
That Cygwin version 1.7.28 is from 2014-02 and has been unsupported for years.
That version may not have completely supported international character sets and
may just assume that everything is in ISO-8859-1/Latin-1, which is similar to
CP1252, so that may work, or your system default OEM codepage e.g. 437 or 850,
and pass it along.

> Cygwin 3.1.4 is launching the Win32 application, and is responsible for
> transcoding the arguments passed to it by mksh, in this case CP1252
> characters ZÇ, into Unicode.

Do you mean you believe Cygwin should recode argument strings, and what do you
mean by Unicode in this context?

> That means Cygwin has to use the mb-to-uc function for transcoding codepage
> 1252 to Unicode.

I am unsure if Cygwin does any recoding internally except for input typed on the
terminal console interface.
CP1252 is an SBCS not an MBCS so MB functions are not required.
What do you expect when you use Unicode here?

> It does not. It uses the UTF-8 to Unicode function (I've seen this using
> gdb). That function flags the Ç as an invalid UTF-8 sequence, not
> surprisingly since it's not a UTF-8 character.

What Windows, Cygwin, gdb versions are you seeing this on and what is the name
of the function you are seeing?

> No matter what character set I use in 'export LANG...' and 'export
> LC_ALL...', Cygwin 3.1.4 always uses the uft8-to-wc transcoding function in
> sys
... what should be there and what is the name of the function used?

> 1.7.28 Uses the correct function.

What is the name of that function?

> I'm not using mintty, I'm using mksh, a requirement since our software uses
> lots of shell scripts, and for legacy support, that means using a Korn shell.

So that means that the mksh is running on the Windows console, and you are not
running mintty.

> I could understand it if 1.7.28 didn't do the proper transcoding, but it
> does.

You may just be seeing Cygwin 1.7.28 passing the character codes along verbatim.

> I used:
> 
>         gdb mksh
> 
> to load mksh into the debugger, then started it with
> 
>         start -c 'cygtest.exe ZÇ'

Windows, Cygwin, and gdb versions?

> That allowed me to step into child_info_spawn::worker() and stop at the 
> call to CreateProcess(), where the command line (cygtest.exe) and argument 
> (ZÇ) are translated into Unicode.

In this case you mean into a UTF16LE string?

> This is the code to which I'm referring, in strfuncs.cc, which is supposed 
> to translate the command line and arguments from CP 1252 into Unicode.
> 
>   size_t __reg3
>   sys_mbstowcs (wchar_t * dst, size_t dlen, const char *src, size_t nms)
>   {
>     mbtowc_p f_mbtowc = __MBTOWC;
>     if (f_mbtowc == __ascii_mbtowc)
>       {
>         f_mbtowc = __utf8_mbtowc;       <<<< THE CODE CHANGES THE 
> '__ascii_mbtowc' TO '__utf8_mbtowc' EVERY TIME, REGARDLESS OF THE 
> CODEPAGE.
>       }
>     return sys_cp_mbstowcs (f_mbtowc, dst, dlen, src, nms);
>   }
> 
> So 'f_mbtowc' is set to _ascii_mbtowc, the default.You said:

UTF-8 contains ASCII as the first 128 code points, so that is valid, unless the
"ASCII" used isn't really, and has character codes > 127!

> You can only change input character sets on the fly;
> 
> The input character set to Cygwin should have been changed to CP 1252, as 
> it was in 1.7.28. At least, that's what I would expect to happen. If it 
> does not, or if miintty is required, then that's a regression from 1.7.28.

As Cygwin packages are rolling releases, old releases are unsupported, and you
must upgrade to the latest release, reproduce the problem with a simple test
case, and other examples if you wish, and post that with a copy of the output from:

	$ cygcheck -hrsv > cygcheck.out

as a plain text attachment to your post.

-- 
Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada

This email may be disturbing to some readers as it contains
too much technical detail. Reader discretion is advised.
[Data in IEC units and prefixes, physical quantities in SI.]
--
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019