X-Authentication-Warning: delorie.com: mailnull set sender to djgpp-workers-bounces using -f
Message-ID: <3C72DA25.F1AE5FF0@yahoo.com>
Date: Tue, 19 Feb 2002 18:05:09 -0500
From: CBFalconer <cbfalconer AT yahoo DOT com>
Organization: Ched Research
X-Mailer: Mozilla 4.75 [en] (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: djgpp-workers AT delorie DOT com
CC: Eli Zaretskii <eliz AT is DOT elta DOT co DOT il>
Subject: Re: Malloc/free DJGPP code
References: <Pine DOT SUN DOT 3 DOT 91 DOT 1020218084527 DOT 1856I AT is> <3C723299 DOT 563B5B72 AT yahoo DOT com> <8971-Tue19Feb2002141104+0200-eliz AT is DOT elta DOT co DOT il> <3C72560A DOT EA7F5099 AT yahoo DOT com> <6048-Tue19Feb2002180659+0200-eliz AT is DOT elta DOT co DOT il>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Reply-To: djgpp-workers AT delorie DOT com

Eli Zaretskii wrote:
> 
> > Date: Tue, 19 Feb 2002 08:41:30 -0500
> > From: CBFalconer <cbfalconer AT yahoo DOT com>
> >
> > > > As long as merging requires an exhaustive search of a singly
> > > > linked list for merge candidates, it is sooner or later going to
> > > > require such an O(N**2) algorithm.
> > >
> > > I don't understand why; please explain.  AFAICS, `merge' simply scans
> > > the appropriate sublist linearly until it finds the block being freed.
> >
> > It has to search all the lists for adjacencies. O(n).  And it has
> > to search for each item freed.  O(n) again, resulting in O(n*n).
> 
> Can you please identify these two nested searches in malloc.c?  I
> don't see them.  What I see is this (cleaned-up of debug code for
> clarity):

The outer loop isn't in malloc, it is in the application, which
has the array or linked list or whatever of allocated memory to
free.

>       for (bp=freelist[bu]; bp; bpp=&(bp->next), bp=bp->next)

This is the only thing that matters.  It makes each merge O(n),
and must be executed O(n) times, thus O(n*n).

> 
> This boils down to a single call to b2bucket, followed by a linear
> scan of a single sublist.  b2bucket loops up to 32 times to find the
> closest integral power of 2, so at worst, this is O(n+32), not O(n*m);
> and `n' here is typically smaller than the number of allocated blocks,
> but can at worst be equal to it.  Am I missing something?

Yes.  For a long list free is called for each of m items.  merge
takes time proportional to n, where n is the number of freed
blocks.  As a significant portion of the m items have been freed
the number of freed blocks peaks (freeing is being done in random
order).  

The effect will probably never show up when freeing is done in
roughly the order (or reverse order) of allocation, because the
count of free blocks will remain low.  It takes the randomization
to make it bite.  This can be caused by many things (prime example
the hashlib system I sent you) including sorting either as a
pointer array or as a linked list.  On my machine objectionable
delays occur at about 10,000 items up, and are totally impossible
at 50,000.  VC6 doesn't have the problem and doesn't die until
things slop over into disk thrashing at over 1,000,000 items. 
LCCWIN32 has the same problem.

On the hashlib system I sent, run "hashtest 4 N" where N is the
count of items.  Odd N's run without calling the free mechanism at
exit, while even values call it.  The time difference belongs to
merge.  I don't care how fast your machine is, or how big memory
is, sooner or later you will run into objectionable delays.  I
believe the program is actually limited to about 2**23 entries,
but that is easily increased.  You won't need to.
> 
> The profile you posted last time indicates that b2bucket took about 4
> microseconds per call, while merge took 315 microsoeconds per call--80
> times longer!  So evidently, only one loop, the one in merge, matters
> here.

Exactly.  Once that becomes an O(1) in place of O(n) algorithm,
the problem will go away.

> 
> Since there's only one loop in merge, this would mean the number of
> free blocks on the sublist is very large.  Can you see how large is
> it?  Are all the 30K objects on the same sublist, perhaps, and is that
> the reason for the slow search?  If so, how can we modify the code to
> run faster in this situation?

At most dividing into sublists changes the constant factor.  The
O(n*n) will always dominate.  The search has to go on until either
an adjacency is found or the whole list has been searched and none
has been found.  Sublists don't change this.

> 
> These are the kinds of questions, and in that order, that I'd
> recommend to discuss.  Jumping to rewriting the code with different
> data structures design without understanding those issues first is not
> a good way of dealing with such problems.

I can easily make a portion of the revised memblock compatible
with the original, in that the size and freelist pointer will be
at the same (-ve) offsets from the user pointer.  That would
forbid the safety improvement I am making to guard against
off-by-one indexed operations.  Publishing the offsets of those
two items in implementation space would allow gradual conversion
of unclean software that uses them, so that eventually the module
would be truly independant  If the other software uses only the
size that is better.  I envision a globally available variable:

   const int __msize_offset = offsetof(memblock, data)
                            - offsetof(memblock, size);

and the header contains:

   extern const int __msize_offset;

which is initially constrained (and documented) to be the same as
the present.

Have I answered your questions?  BTW, please don't pass on the
hashlib I sent, it has been revised, but is in abeyance because of
this.  If desired I will attach a later version on the mail-list.

-- 
Chuck F (cbfalconer AT yahoo DOT com) (cbfalconer AT XXXXworldnet DOT att DOT net)
   Available for consulting/temporary embedded and systems.
   (Remove "XXXX" from reply address. yahoo works unmodified)
   mailto:uce AT ftc DOT gov  (for spambots to harvest)