DMARC-Filter: OpenDMARC Filter v1.4.2 delorie.com 6217eeMQ3189198
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 99B4F4BA2E0B
Message-ID: <c013bd50-6cef-4d8f-ad9a-2421e417a6bb@SystematicSW.ab.ca>
Date: Sun, 1 Mar 2026 00:40:16 -0700
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: Memmove causing program crashes, giving SIGTRAP in GDB(?)
To: General Cygwin discussions and problem reports <cygwin@cygwin.com>
References: <547312365.1464244.1771958282029@connect.xfinity.com>
 <1670201592.1489273.1772043520008@connect.xfinity.com>
 <e91d8b5b-2690-4271-aa74-e6226440e33d@SystematicSW.ab.ca>
 <1044918836.1507810.1772086967212@connect.xfinity.com>
 <1579472684.1508349.1772092747339@connect.xfinity.com>
 <aaABFf5iEowV1l7I@xps13> <1148572549.1808180.1772097444036@mail.yahoo.com>
 <1901597260.1508573.1772100378936@connect.xfinity.com>
 <0C965DD0-856E-41FF-B5A4-15E472292A32@unified-streaming.com>
 <483908609.1508714.1772103775739@connect.xfinity.com>
 <2346fd41-2500-0db6-5849-6788174b5a1d@cs.umass.edu>
 <1462848037.1521935.1772136952077@connect.xfinity.com>
 <399745a1-429a-ebb4-0f67-c32f6282caa6@cs.umass.edu>
 <1093316506.1533154.1772157883568@connect.xfinity.com>
 <3e0de899-a7dd-8fea-7743-10e6b05cc6b6@cs.umass.edu>
 <1990836634.1545853.1772216419837@connect.xfinity.com>
 <45c133f7-8285-4cb3-9701-2642cb76ab37@SystematicSW.ab.ca>
 <103536920.1558501.1772245830440@connect.xfinity.com>
Content-Language: en-CA
Organization: Systematic Software
In-Reply-To: <103536920.1558501.1772245830440@connect.xfinity.com>
Precedence: list
From: Brian Inglis via Cygwin <cygwin@cygwin.com>
Reply-To: General Cygwin discussions and problem reports <cygwin@cygwin.com>
Cc: Brian Inglis <Brian.Inglis@SystematicSW.ab.ca>
Content-Type: text/plain; charset="utf-8"; Format="flowed"
Errors-To: cygwin-bounces~archive-cygwin=delorie.com@cygwin.com
Sender: "Cygwin" <cygwin-bounces~archive-cygwin=delorie.com@cygwin.com>
Content-Transfer-Encoding: 8bit

Very good Kennon,

Neat and well researched, and surprisingly minimal!

Hopefully some of those approaches can eliminate all problems with CPU errata or 
unfixed bugs, so you no longer hit any crashes, while managing high performance 
on fast hardware.

And given the source is in C, it will continue working okay on older and newer 
compilers, CPUs, and combos of those, as nowadays little improves, they are only 
moving the bottlenecks around, to where your code hopefully will no lomger 
notice the problems.

That's the issue I always had with "optimized" assembler: it's all well and good 
with today's compiler and CPU, but give it a generation of each, and it's an 
unpredictable pile of emoji, good only on old machines (like those I have) ;^>

We have to be able to run the same code on systems ranging from whatever today's 
cheap mobile laptop celery-stick-in-the-muds are called, to GPU monster CPUs, to 
the fractional or multiple package KCPU servers, with dozens to thousands of 
threads on each, variable ISAs, uarchs, cache levels, sizes, and write policies.

That's actually an advantage for CISC ISAs, acting as an HLA, interpreted by the 
instruction decoder into highly tuned RISC-like uops for dispatch into multiple 
pipelined stages per thread, CPU, and/or package, to hopefully hide any poor 
performance issues.


On 2026-02-27 19:30, KENNON J CONRAD via Cygwin wrote:
> I just wanted to add that the stash and store idea you suggest that is also
> used in memmove has a very nice impact on the assembly code.
> 
> With the old code that does this for the last 0 to 7 words:
>          while (candidate_ptr > score_ptr) {
>            *candidate_ptr = *(candidate_ptr - 1);
>            candidate_ptr--;
>          }
> 
> the assembly code shows this from the point where the move starts:
> .L24:
> 	movdqu	-16(%rax), %xmm1
> 	subq	$16, %rax
> 	movups	%xmm1, 2(%rax)
> 	cmpq	%rdx, %rax
> 	jnb	.L24
> 	movq	%r10, %rax
> 	subq	%r9, %rax
> 	subq	$16, %rax
> 	notq	%rax
> 	andq	$-16, %rax
> 	addq	%r10, %rax
> 	cmpq	%rax, %r9
> 	jnb	.L28
> 	movq	%rax, %rcx
> 	movq	%rax, %rdx
> 	movq	%r9, 48(%rsp)
> 	subq	%r9, %rcx
> 	subq	$1, %rcx
> 	shrq	%rcx
> 	leaq	2(%rcx,%rcx), %r8
> 	negq	%rcx
> 	subq	%r8, %rdx
> 	leaq	(%rax,%rcx,2), %rcx
> 	call	memmove
> 	movq	48(%rsp), %r9
> 	jmp	.L28
> 
> But with stash and store:
>          *(uint64_t *)&candidates_index[new_score_rank + 1] = first_four;
>          *(uint64_t *)&candidates_index[new_score_rank + 5] = next_four;
> 
> the assembly code from the point where the move start is this:
> .L24:
> 	movdqu	-16(%r9), %xmm1
> 	subq	$16, %r9
> 	movups	%xmm1, 2(%r9)
> 	cmpq	%rax, %r9
> 	jnb	.L24
> 	movups	%xmm0, 2(%rdi,%rdx)
> 	jmp	.L26
> 
> There are a couple of extra assembly instructions to stash into xmm0 before
> the move, but this is a big reduction in assembly code size for the backward
> memory move. Not as fast as memmove if the DF wasn't getting corrupted, but
> much better than the old code plus it completely avoids the risk of DF
> corruption during rep movsq in memmove for backward move sizes >= 8!  I like it
> because there is no need to worry about whether rep movsb or rep movsw could
> also be vulnerable to DF corruption.

>> On 02/27/2026 11:49 AM PST Brian Inglis via Cygwin wrote:
>> Some perf reports and analysis imply that backward moves (with overlap?) are no
>> faster than straight rep movsb on some CPUs, so it may be better to just
>> simplify to that, unless you want to stash the final element(s) to be moved out
>> of the way in register(s), and use multiple registers in unrolled wide moves for
>> the aligned portion?
-- 
Take care. Thanks, Brian Inglis              Calgary, Alberta, Canada

La perfection est atteinte                   Perfection is achieved
non pas lorsqu'il n'y a plus rien à ajouter  not when there is no more to add
mais lorsqu'il n'y a plus rien à retrancher  but when there is no more to cut
                                 -- Antoine de Saint-Exupéry

-- 
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple