delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2017/01/09/08:29:22

X-Recipient: archive-cygwin AT delorie DOT com
DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id
:list-unsubscribe:list-subscribe:list-archive:list-post
:list-help:sender:mime-version:from:date:message-id:subject:to
:content-type; q=dns; s=default; b=IDRMS0Bi3FkXfhrtzhl0S6jRnUIVg
QU/9x2hfVi58pAGfmq3DSYyLQM9j1HZHthnyqkrvt62kJv7W1kO4GJaaYOC2sLCl
1Z0eDErxIiUrhDxEiZuOtXc1AxwAOWTqSBSXrVje27+pRM80gqZHQJLy5pcbDKVj
kY9Z7oUgZuQouo=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id
:list-unsubscribe:list-subscribe:list-archive:list-post
:list-help:sender:mime-version:from:date:message-id:subject:to
:content-type; s=default; bh=DdjAk+hdutbC9OFABpuc6IZU070=; b=bPx
ukM4VSB4bqBVIlrKE0TIjK9cfm79FXo8IaOuoqkIV3Nf3buN5rJySqvQBftIq4TN
9wohF6En3pe7Rw1M3Sr47zktdpupcDOQ4gIh7n5D/Km4AE4X7dxQ+ovdTPJdKbKp
hvd2SQVNFaVusW0zScbSbWs2BgNzLRl9lpxTRgs0=
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-0.8 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,RCVD_IN_SORBS_SPAM,SPF_PASS autolearn=no version=3.3.2 spammy=Meanwhile, respond
X-HELO: mail-vk0-f42.google.com
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=IPJJsHyJW/EnRzghhnECMddMH/HJiGSC5SqMJn8duPY=; b=FUOXX3AOdWTH/FM6/t0ONijB61sbi/88dpc1F43CKGeID403WV2e2eRnR3Ep6ngwlk lziTV9vysTNawFTaOToJ7+N8QEsu4gvjLe+QtEUI/yDi+fGDAOHgPev1b3fHFQsfqAI1 yzpNTI5iuoZWw+7Bf11nFrDojbxNV+0/g9jO58tTHRUMufVSV6qCCyuNkHWmYcZ25XDM Fi/yAScAwproJ0nC8gkagwQ5BjtoD9rAG5aNULi1yyf+ITrhLkVLEWNTxXuzNacYuOdT +yPkzlAChtUIiyPXKlcnmTRbsUTEeis8VE6wb3yrGW2dc2SnqMWh7PR/rghMoiOku1C9 XaVA==
X-Gm-Message-State: AIkVDXIvz+0X+A6Vs6+XxsM+YGgdpqIVlMNVIqbLk5Z4DqRh6AlcBTl7bfvHqHEO/CorG0ywmqbVMJBGhXcMZg==
X-Received: by 10.31.80.197 with SMTP id e188mr32409286vkb.109.1483968541002; Mon, 09 Jan 2017 05:29:01 -0800 (PST)
MIME-Version: 1.0
From: Erik Bray <erik DOT m DOT bray AT gmail DOT com>
Date: Mon, 9 Jan 2017 14:29:00 +0100
Message-ID: <CAOTD34aMu6DLP-8kaRXZRoihGxW9Jusf1pDBeM3cTWeNRfLVWw@mail.gmail.com>
Subject: Re: Hangs on connect to UNIX socket being listened on in the same process (was: Cygwin hanging in pselect)
To: cygwin AT cygwin DOT com
X-IsSubscribed: yes

On Mon, Jan 9, 2017 at 12:01 PM, Erik Bray <erik DOT m DOT bray AT gmail DOT com> wrote:
> On Fri, Jan 6, 2017 at 12:40 PM, Erik Bray <erik DOT m DOT bray AT gmail DOT com> wrote:
>> Hello, and happy new-ish year,
>>
>> I've been working on and off over the past few months on bringing
>> Python's compatibility with Cygwin up to snuff, including having all
>> pertinent tests passing.  I've noticed that there are several tests
>> (which I currently skip) that cause the process to hang indefinitely,
>> and not respond to any signals from Cygwin (it can only be killed from
>> Windows).  This is Cygwin 64-bit--I have not tested 32-bit.
>>
>> I finally looked into this problem and found the lockup to be in
>> pselect() somewhere.  Attached I've provided the most minimal example
>> I've been able to come up with so far that reproduces the problem,
>> which I'll describe in a bit more detail next. I would attach a
>> cygcheck output if requested, but I was also able to reproduce this on
>> a recent build from source.
>>
>> So far as I've been able to tell, the problem only occurs with AF_UNIX
>> sockets.  In the example I have a 'server' socket and a 'client'
>> socket both set to non-blocking.  The client connects to the socket,
>> returning errno EINPROGRESS as expected.  Then I do a pselect on the
>> client socket to wait until it is ready to be read from.  The hang
>> only happens when I pselect on the client socket, and not on the
>> server socket.  It doesn't seem to make a difference what the timeout
>> is.  One thing I have no tried is if the client and server are
>> actually different processes, but the example from the Python tests
>> this is reproducing is where they are both in the same process.
>>
>> Below is (I think) the most relevant output from strace on the test
>> case.  It seems to hang somewhere in socket_cleanup, but I haven't
>> investigated any further than that.
>
> I made a little bit of progress debugging this, but now I'm stumped.
> It seems the problem is this:
>
> For each socket whose fd is passed to select() a thread_socket is
> started which calls peek_socket until there are bits ready on the
> socket, or until the timeout is reached.  This in turn calls
> fhandler_socket::evaluate_events.
>
> The reason it's only locking up on my "client thread" on which
> connect() is called, is that evaluate_events notes that the socket is
> waiting to connect, and this passes control to
> fhandler_socket::af_local_connect().  af_local_connect() temporarily
> sets the socket to blocking, then sends a magic string to the socket
> (you can see in my strace log that this succeeds).  What's strange,
> and what I don't understand, is that there are no FD_READ or FD_OOB
> events recorded for the WSASendTo call from af_local_send_secret().
> Then, after af_local_send_secret() it calls af_local_recv_secret().
> This calls recv_internal() which in turn calls recursively into
> fhandler_socket::evaluate_events where it waits for an FD_READ or
> FD_OOB event that never arrives.  And since it set the socket to
> blocking it just sits in an infinite loop.
>
> Meanwhile the timer for the select() call expires and tries to shut
> down the thread_socket but it can't because it never completes.
>
> What I don't understand is why there is not an event recorded for the
> WSASendTo in send_internal.  I even wrapped it with the following
> debug code to wait for an FD_READ event immediately following the
> WSASendTo:
>
>       else if (get_socket_type () == SOCK_STREAM)
>       {
>         WSAEventSelect(get_socket (), wsock_evt, EVENT_MASK);
>         res = WSASendTo (get_socket (), out_buf, out_idx, &ret, flags,
>                  wsamsg->name, wsamsg->namelen, NULL, NULL);
>           debug_printf("WSASendTo sent %d bytes; ret: %d", ret, res);
>           while (!(res=wait_for_events (FD_READ | FD_OOB, 0))) {
>               debug_printf("Waiting for socket to be readable");
>           }
>       }
>
>
>
> But the strace at this point just outputs:
>    62  108286 [socksel] poll_test 24152
> fhandler_socket::af_local_connect: af_local_connect called,
> no_getpeereid=0
>   156  108442 [socksel] poll_test 24152
> fhandler_socket::send_internal: WSASendTo sent 16 bytes; ret: 0
>
> It never returns from send_internal.  I don't have deep knowledge of
> WinSock, but from what I've read ISTM WSASendTo should have triggered
> an FD_READ event on the socket, and it doesn't for some reason.

After playing around with this a bit more I came up with a much
simpler example.  This has nothing to do with select( ) at all,
directly.

The simplified example is just:

#include <arpa/inet.h>
#include <sys/socket.h>
#include <string.h>
#include <stdio.h>
#include <sys/un.h>
#include <errno.h>

int main(void) {
    fd_set rfds;
    int sock_server, sock_client;
    int retval;
    struct sockaddr_un addr;

    memset(&addr, 0, sizeof(addr));
    addr.sun_family = AF_UNIX;
    strcpy(addr.sun_path, "@test.sock");

    sock_server = socket(AF_UNIX, SOCK_STREAM, 0);
    if (bind(sock_server, (struct sockaddr*)&addr, sizeof(addr))) {
        printf("binding server socket failed");
        return 1;
    }

    retval = listen(sock_server, 5);
    printf("Ret from listen: %d\n", retval);

    sock_client = socket(AF_UNIX, SOCK_STREAM, 0);
    retval = connect(sock_client, (struct sockaddr*)&addr, sizeof(addr));
    printf("Ret from client connect: %d; errno: %d\n", retval, errno);

    return 0;
}


On Linux this example works as I expect, and the connect() call
returns immediately.  However, on Cygwin the connect() call hangs
after af_local_send_secret(), as described in my first message.

However, when I split this example up into separate client and server
processes it works as expected and the connect() is properly
negotiated and returns immediately.

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019