Mail Archives: cygwin-developers/2001/06/04/05:40:19
Hi Jeff,
I'm to blame for much of the pthreads code...
Are you using shared memory mutex's or private mutex's?
----- Original Message -----
From: "Jeff Waller" <jeffw AT 141monkeys DOT org>
> The port attempt of bind9 has uncovered some serious problems with
> pthread. Probably one of the major differences between bind 8 and
> bind 9 is the use of threads for not only named but also the resolver
> with in fact the lightweight resolver being a hard link to named --
> which is another problem with the build process BTW. Also, it seems
> dig and nslookup are threaded, they share some of the same sourcecode,
> it apprears as they segfault in exactly the same place:
>
> The call to pthread_cond_timedwait
>
> using gdb it appears the the segfault occurs the last line of the
following,
> note the FIXME comment.
>
> // FIXME: pshared mutexs have the cond count in the shared memory
area.
> // We need to accomodate that.
> int
> __pthread_cond_timedwait (pthread_cond_t * cond, pthread_mutex_t *
mutex,
> const struct timespec *abstime)
> {
> // and yes cond_access here is still open to a race. (we increment,
> context swap,
> // broadcast occurs - we miss the broadcast. the functions aren't
split
> properly.
> int rv;
> if (!abstime)
> return EINVAL;
> pthread_mutex **themutex = NULL;
> if (*mutex == PTHREAD_MUTEX_INITIALIZER)
> __pthread_mutex_init (mutex, NULL);
> if ((((pshared_mutex *)(mutex))->flags & SYS_BASE == SYS_BASE))
> // a pshared mutex
> themutex = __pthread_mutex_getpshared (mutex);
>
> if (!verifyable_object_isvalid (*themutex, PTHREAD_MUTEX_MAGIC))
> return EINVAL;
>
>
> Even taken out of context like it is, this is obviously buggy,
themutex is
> initialized to NULL and then is only re-initialized to a "valid" value
if
> the mutex is a pshared mutex, if it is not, then themutex is left ==
to
> NULL.
this may be an artifact of a particularly horrendous patch I had to
generate at one point. That does look buggy to me. Patch coming shortly.
> And in fact when the above
>
> pthread_mutex **themutex = NULL;
>
> is replaced with
>
> pthread_mutex_t *themutex = mutex;
That should be pthread_mutex **themutex = mutex, IIF *mutex !=
PTHREAD_MUTEX_INITIALIZER and (((pshared_mutex *)(mutex))->flags &
SYS_BASE != SYS_BASE))
thats ugly and horrible I know. I'm not at all convinced that my
approach there is the right way to do shared memory mutex's. Hopefully
if I procrastinate on that a bit more the cygwin daemon will shape up a
bit and I can wipe the slate clean and use the daemon to create and
manage all mutexs system wide. That would then hopefully remove the
special-case conditions for shared memory mutex's.
> to mimic the initalization that takes place in pthread_cond_wait, the
> segmentation fault goes away, and the program dig ran part-way
> successfully, but not totally:
>
>
> $ ./dig 141monkeys.org
>
> ; <<>> DiG 9.1.2 <<>> 141monkeys.org
> ;; global options: printcmd
> ;; Got answer:
> ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 38854
> ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
>
> ;; QUESTION SECTION:
> ;141monkeys.org. IN A
>
> ;; AUTHORITY SECTION:
> 141monkeys.org. 60 IN SOA bubba.141monkeys.org.
> root.141mo
> nkeys.org. 36 300 120 21600 60
>
> ;; Query time: 270 msec
> ;; SERVER: 192.168.1.1#53(192.168.1.1)
> ;; WHEN: Mon Jun 4 04:17:02 2001
> ;; MSG SIZE rcvd: 79
>
> 0 [unknown (0xFFFCEE81)] dig 199739 pthread_cond::BroadCast:
> Broadcast cal
> led with invalid mutex
>
look at line 430. That's where the problem has occured - the mutex
variable in the condition struct has failed the verification test, and
is either an old reference to a previously valid mutex, or is in
unreadable memory, or some related condition. This can occur through
bugs, or if your program does something silly like destroy a mutex while
a thread is waiting on a condition variable waiting on that mutex.
There's not much a library can do to prevent that sort of silly
behaviour :]. We could ignore it and return.. I'll try and get time to
check the spec and see if that's the expected behaviour. If it is,
consider that error a useful diagnostic for your code :].
>
> perhaps a matter of not getting it properly from the shared area as is
> done in __pthread_cond_timedwait? Unfornutately, the exact context
> could not be determined as using gdb caused the program to freeze
> and eventually, the machine had to be rebooted.
gdb - Strange. Anyway w.r.t. getting the mutex, no. A pointer to the
private mutex detail is set and stored in __pthread_cond_timedwait, and
then simply referred to until the condition variable has no waiting
threads, when it is cleared.
> ====================================================
> ====================================================
>
> Ok so much for the background, now the question. Apparently from
> the comments and the to-do list, the pthread impl is not completed,
> could someone give me or point me to some documentation that describes
> the architecture of cygwin and how threads fit into it? Also, what
part is
> done or generally considered solid by now? Also, what IS the shared
> area BTW?
The "shared area" is mutex's in shared memory regions accessible my
multiple process's concurrently - such as mmap or SysVSHM ShmGet memory.
Mutex's in win32 are accessible across process's without any special
trickery, but because we have extra data over and above the win32 mutex
we need to have that data accessible across process. Likewise for
condition variables.
background:
There are two basic approaches for getting data on global objects across
process boundarys: parallel access such as shared memory areas, and
kernel/daemon storage.
Currently cygwin has no "active" kernel or daemon that process's can
call, so _all_ cross process communication is effectively accessing
shared memory areas. (Note: tools such as pipes are not relevant here
because they pass messages, whereas we need ongoing access to a common
struct.
An opaque struct that is returned by the cygwin pthreads code would
allow fairly easy management of these issues, but makes API changes very
hard (due to struct size changes). Thus the trickery you see in the
source today.
debugging your problem: I suggest you breakpoint in
pthread_cond::Broadcast and see why the mutex isn't valid. If it's NULL
then you may have no waiting threads, and the error is a nonsense (that
I'm looking into right now :]). If it's not NULL and isn't valid, then
you have a problem.
Rob
> -Jeff
>
> P.S. Oh yes, newbie here if the last question didn't give me away.
Welcome!
- Raw text -