Mailing-List: contact cygwin-help AT sourceware DOT cygnus DOT com; run by ezmlm List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT sources DOT redhat DOT com Delivered-To: mailing list cygwin AT sources DOT redhat DOT com Message-ID: From: David Humphrey To: "'cygwin AT cygwin DOT com'" Subject: Interactions between rshd and programs using sockets? Date: Wed, 16 May 2001 12:16:33 -0500 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2653.19) Content-Type: text/plain; charset="iso-8859-1" I am experimenting with the MPI (Message Passing Interface). For those not familiar with MPI, it is a software specification that enables a master program to communicate with slave programs on the same or remote machines so that processes such as large numerical computations can be split over several machines. I'm using the Windows NT implementation publicly available from Argonne National Laboratory.. Starting remote processes in the UNIX world is easy enough via rsh, etal, but NT doesn't come with them, so the NT implementations usually come with programs especially tailored for starting these slave process; one runs these programs as services on the slave machines. Unfortunately, I would have to extensively remodel my application to take advantage of their launching tools. Therefore, after reading the documentation that comes with the Argonne package, I now understand the environment that the master and slave programs require. I have been using Cygwin's rsh/rshd tools successfully, but I've run into a problem that I don't know how to resolve. In this implementation of MPI, programs get some of their parameters via environment variables. Programs on different machines communicate via sockets and the socket number is one of those parameters. I have master and slave programs that operate properly when I manually set up the environment regardless of whether the slave runs on the same machine as the master or on a remote machine. Then, I wrote short bash scripts that set up the environment and execute the master and slave programs. When I manually invoke these scripts, again, everything works fine regardless of whether the slave is on the same machine as the master. The problem arises when I start the master program on a machine (NT1) and then try to start the slave on a second machine (NT2) via rsh. Here's the script: MPICH_ROOT="roadrunner:54545"; export MPICH_ROOT MPICH_JOBID=roadrunner.123; export MPICH_JOBID MPICH_IPROC=$2; export MPICH_IPROC MPICH_NPROC=$3; export MPICH_NPROC echo Starting slave $MPICH_IPROC of $MPICH_NPROC $1 Here's what happens when I try to start the slave on NT2: $ rsh NT2 runslave.sh cpi.exe 1 2 starting slave 1 of 2 Error 10106, process 1 ComPortThread: NT_Tcp_create_bind_select failed $ The failure seems to happen almost immediately, therefore, it's not a timeout problem such as the slave waiting to talk to the master. Interestingly, if I rlogin to NT2 and execute "runslave.sh cpi.exe 1 2" everything works fine. So, the question is: What is it about running the runslave.sh script via rsh that causes the port bind to fail? One difference I note is that rlogin asks me for a password while rsh does not. Is this a security issue? I updated my installation with the latest Cygwin release yesterday but that didn't change the results any. I think the /etc/passwd and /etc/group files on both machines are properly built and I've carefully tried to followed the instructions in the ineutils-1.3.2.README. Regards, David L. Humphrey Manager, Software Development Bell Geospace, Inc -- Want to unsubscribe from this list? Check out: http://cygwin.com/ml/#unsubscribe-simple