X-Spam-Check-By: sourceware.org Date: Tue, 9 May 2006 00:44:20 -0700 From: clayne AT anodized DOT com To: cygwin AT cygwin DOT com Subject: readv() questions Message-ID: <20060509074420.GG18330@ns1.anodized.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.11 X-Assp-Spam-Prob: 0.00000 X-Assp-Whitelisted: Yes X-Assp-Envelope-From: clayne AT ns1 DOT anodized DOT com X-IsSubscribed: yes Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com Warning - LONG and network code related - do not read if not interested or not versed. I'm trying to currently debug an issue where readv() seems to be filling iovec's with bad data or otherwise overflowing when having to deal with a large receive buffer. I say large receive buffer because I can replicate the issue by writev()ing 1000+ iovec's on the sending side and readv()ing on the cygwin side continously. I have multiply verified the sending side is writev()ing the correct iovecs, with length intact and as I specified it - however upon readv()ing the same back on the cygwin end, after a sporadic number of data has been transfered (usually around 100 iovecs or so), I get a spurious iovec filled with data I did not originally send out. [ sending-side (Linux 2.6.9-22.0.1.EL) ] --> writev() + write(variable length, sent in preceeded iovec) --> 100mb uplink --> 384/1500 dsl up/downlink --> readv() + read(length derived from iovec received) [ receiving-side (Cygwin 1.5.20s(0.155/4/2) 20060427) ] The iovec itself is small, 13 bytes: 1 byte (total length) 1 byte (variable length) 1 byte (flag) 2 byte (header data) 4 byte (header data) 4 byte (header data) On the sending side I writev() to the network stack, and then immediately issue another write() afterwards containing the variable length data, which I stored in the header (iovec[1]). On the receiving end, same deal, just reverse. readv(), passing a char * to iovec[1], and relying on readv() to fill it with the correct data received - which I then use as a length to read() to get the variable length data following. A few things: 1. Sanity test, nobody sees anything wrong with this fairly standard procedure, correct? 2. What exactly is the purpose of dummytest() within /winsup/cygwin/miscfuncs.cc? The call to check_iovec_for_read from within readv(): 440 extern "C" ssize_t 441 readv (int fd, const struct iovec *const iov, const int iovcnt) 442 { 443 extern int sigcatchers; 444 const int e = get_errno (); 445 446 int res = -1; 447 448 const ssize_t tot = check_iovec_for_read (iov, iovcnt); check_iovec_for_read is a macro defined as: winsup.h:#define check_iovec_for_read(a, b) check_iovec ((a), (b), false) The actual check_iovec() call with preceeding dummytest(): 162 static char __attribute__ ((noinline)) 163 dummytest (volatile char *p) 164 { 165 return *p; 166 } 167 ssize_t 168 check_iovec (const struct iovec *iov, int iovcnt, bool forwrite) 169 { 170 if (iovcnt <= 0 || iovcnt > IOV_MAX) 171 { 172 set_errno (EINVAL); 173 return -1; 174 } 175 176 myfault efault; 177 if (efault.faulted (EFAULT)) 178 return -1; 179 180 size_t tot = 0; 181 182 while (iovcnt != 0) 183 { 184 if (iov->iov_len > SSIZE_MAX || (tot += iov->iov_len) > SSIZE_MAX) 185 { 186 set_errno (EINVAL); 187 return -1; 188 } 189 190 volatile char *p = ((char *) iov->iov_base) + iov->iov_len - 1; 191 if (!iov->iov_len) 192 /* nothing to do */; 193 else if (!forwrite) 194 *p = dummytest (p); 195 else 196 dummytest (p); 197 198 iov++; 199 iovcnt--; 200 } 201 202 assert (tot <= SSIZE_MAX); 203 204 return (ssize_t) tot; 205 } Lines 190 to 196 seem completely pointless to me unless I'm missing something, which I believe to be the case here. Can someone explain it? Due to the use of volatile and the explicit noinline attribute, I have a feeling it's some form of memory assertion - but why? Anyways, the cases where the situation *does not* happen are if I run it under strace (which smells of a race) or if I throttle the data manually by only sending a set amount and then requesting ack from the receiving side (which I use the flags var for). If I go full unthrottled, no acks, standard write it all to wire, read it all from wire - the s* hits the fan. What I believe is causing the issue is an MTU related problem. It almost always seems to get into weirdness right around 1452 bytes transfered. I have verified, via Ethereal, that my assertions fail (which are checking the variable length stored in the header I sent == what is stored in the received iovec) when readv() reads data at the border of a TCP packet in the stream (i.e. the next portion of an iovec or the next iovec entirely is in the next packet). Ethereal also verifies that the data sent is exactly as I had placed it on the sending stack via writev() from sending host. Ethereal also verifies that the problems occur as iovec data or iovecs within the array passed to readv() span TCP packets. I'm slowly going through the code, which can be a mission, but I'm beginning to wonder if this section: 219 void 220 fhandler_base::raw_read (void *ptr, size_t& ulen) 221 { 222 #define bytes_read ulen 223 224 HANDLE h = NULL; /* grumble */ 225 int prio = 0; /* ditto */ 226 DWORD len = ulen; 227 228 ulen = (size_t) -1; 229 if (read_state) 230 { 231 h = GetCurrentThread (); 232 prio = GetThreadPriority (h); 233 SetThreadPriority (h, THREAD_PRIORITY_TIME_CRITICAL); 234 signal_read_state (1); 235 } 236 BOOL res = ReadFile (get_handle (), ptr, len, (DWORD *) &ulen, 0); 237 if (read_state) 238 { 239 signal_read_state (1); 240 SetThreadPriority (h, prio); 241 } 242 if (!res) 243 { 244 /* Some errors are not really errors. Detect such cases here. */ 245 246 DWORD errcode = GetLastError (); 247 switch (errcode) 248 { 249 case ERROR_BROKEN_PIPE: 250 /* This is really EOF. */ 251 bytes_read = 0; 252 break; 253 case ERROR_MORE_DATA: 254 /* `bytes_read' is supposedly valid. */ 255 break; 256 case ERROR_NOACCESS: is culprit... There *are* some relatively spooky looking calls in there, coming from a POSIX perspective. But according to my MS API docs on ReadFile - it shall not return until it has read the number of bytes requested (or times out, specified through SetCommTimeouts I believe - although I do not see it used under fhandler_base. I presume there is another way through the win32 API when using sockets?): "If hFile is not opened with FILE_FLAG_OVERLAPPED and lpOverlapped is NULL, the read operation starts at the current file position and ReadFile does not return until the operation is complete, and then the system updates the file pointer." ERROR_MORE_DATA is not surprisingly defined as: "ERROR_MORE_DATA: More data is available." The API references it here: "If a named pipe is being read in message mode and the next message is longer than the nNumberOfBytesToRead parameter specifies, ReadFile returns FALSE and GetLastError returns ERROR_MORE_DATA. The remainder of the message may be read by a subsequent call to the ReadFile or PeekNamedPipe function." However this applies to named pipes - not necessarily sockets. But I'm weary of this section: 253 case ERROR_MORE_DATA: 254 /* `bytes_read' is supposedly valid. */ 255 break; Mainly because I do not see anywhere where there is an explicit check in the form of: if (len != bytes_read) /* bytes_read is really ulen */ handle_problem(); Let's just throw out the wild assumption that win32 does something funky when data requested via ReadFile() spans an MTU size or resides in a following TCP packet associated with the stream - throwing an error and saying ERROR_MORE_DATA. An example case being mine where I request 13 bytes and we get 2 for instance. Upon returning from raw_read(), not much is done in the way of error checking there either: Within fhandler_base::read(): 725 raw_read (ptr + copied_chars, len); 726 if (!copied_chars) 727 /* nothing */; 728 else if ((ssize_t) len > 0) 729 len += copied_chars; 730 else 731 len = copied_chars; 732 733 if (rbinary () || len <= 0) 734 goto out; My actual readv() wrapping code is very basic and standard, so I don't think it's doing anything evil or causing a problem: 400 size_t n_recv_iov(int s, const struct iovec *v, size_t c, int tout) 401 { 402 size_t br; 403 int res; 404 struct timeval to; 405 fd_set fds, fds_m; 406 407 FD_ZERO(&fds_m); 408 FD_SET(s, &fds_m); 409 410 while (1) { 411 fds = fds_m; 412 to.tv_sec = tout; 413 to.tv_usec = 0; 414 415 if ((br = readv(s, v, c)) == (size_t)-1) { 416 switch (errno) { 417 case EWOULDBLOCK: 418 case EINTR: 419 break; 420 default: 421 perror("readv"); 422 return -1; 423 } 424 } else { 425 break; 426 } 427 428 if ((res = select(s + 1, &fds, NULL, NULL, &to)) == 0) 429 return -1; /* timeout */ 430 else if (res == -1) { 431 perror("select"); 432 return -1; /* never happen */ 433 } 434 } 435 436 return br; 437 } And my call to it is basic as well: 61 IOV_SET(&packet[0], &byte_tl, sizeof(byte_tl)); 62 IOV_SET(&packet[1], &byte_vl, sizeof(byte_vl)); 63 IOV_SET(&packet[2], &byte_flags, sizeof(byte_flags)); 64 IOV_SET(&packet[3], &nbo_s, sizeof(nbo_s)); 65 IOV_SET(&packet[4], &nbo_t_onl, sizeof(nbo_t_onl)); 66 IOV_SET(&packet[5], &nbo_t_ofl, sizeof(nbo_t_ofl)); 67 68 for (error = 0; !error; ) { 69 error = 1; 70 71 if ((hl = n_recv_iov(s, packet, NE(packet), 60)) == (size_t)-1) 72 break; 73 74 assert(byte_vl < sizeof(byte_var)); 75 76 if ((vl = n_recv(s, byte_var, byte_vl, 60)) == (size_t)-1) 77 break; 78 if (hl == 0 || vl == 0) 79 break; 80 81 error = 0; 82 83 /* process_data(); */ 84 } Sorry for the ultra mail, but I know for a fact that readv() on cygwin is doing bad things when faced with a lot of data to read from the wire. Any insights? -cl -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/