Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com Date: Fri, 29 Jul 2005 17:25:56 -0400 (EDT) From: Igor Pechtchanski Reply-To: cygwin AT cygwin DOT com To: Yitzchak Scott-Thoennes cc: cygwin AT cygwin DOT com Subject: Re: heap_chunk_in_mb default value (Was Re: perl - segfault on "free unused scalar") In-Reply-To: <20050729170958.GA872@efn.org> Message-ID: References: <23AA05B1B7171647BC38C5D761900EA40223C7E9 AT DF-SEADOG-MSG DOT exchange DOT corp DOT microsoft DOT com> <20050729170958 DOT GA872 AT efn DOT org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII On Fri, 29 Jul 2005, Yitzchak Scott-Thoennes wrote: > On Wed, Jul 27, 2005 at 05:07:23PM -0700, Stephan Mueller wrote: > > "Igor Pechtchanski wrote: > > " > > " On Thu, 28 Jul 2005, Krzysztof Duleba wrote: > > " > > > I've simplified the test case. It seems that Cygwin perl can't > > " > > > handle too much memory. For instance: > > " > > > > > " > > > $ perl -e '$a="a"x(200 * 1024 * 1024); sleep 9' > > " > > > > > " > > > OK, this could have failed because $a might require 200 MB of > > " > > > continuous space. > > " > > > > " > > Actually, $a requires *more* than 200MB of continuous space. Perl > > " > > characters are 2 bytes, so you're allocating at least 400MB of > > space! > > " > > > " > Right, UTF. I completely forgot about that. > > " > > " Unicode, actually. > > > > Unicode is a standard that defines 'code points' (numeric values) for a > > whole lot of different characters. UTF-8 is a specific encoding of > > Unicode. It has the nifty property that ASCII characters are encoded > > just as in ASCII -- one byte, with the high bit clear, and the low seven > > bits representing a character in the range 0..127. Characters above the > > ASCII range require multiple bytes -- sometimes two, sometimes more. > > The algorithm is quite clever; find it in The Unicode Standard or with a > > quick Google search. > > > > Another popular encoding is UCS-2, which is roughly "16-bit words each > > holding one Unicode character". > > > > The latter is frequently what people think of as "Unicode". The former > > is what perl uses internally to encode characters. > > > > End result is that the perl internal representation in the example above > > probably only needs about 200MB of space, and not double that, as > > suggested. > > Correct; perl uses UTF-8 (actually, an extension of UTF-8 which allows > codepoints up to 2**72-1). As I said before, it might be nice if this were clearer from the perlunicode man page. > However code like the above does end up using twice the space; it's > allocated once to store the result of the x operation and again when > it's copied to $a. D'oh! I forgot that this was an assignment, not an initialization. I feel properly chastised. :-) Igor -- http://cs.nyu.edu/~pechtcha/ |\ _,,,---,,_ pechtcha AT cs DOT nyu DOT edu ZZZzz /,`.-'`' -. ;-;;,_ igor AT watson DOT ibm DOT com |,4- ) )-,_. ,\ ( `'-' Igor Pechtchanski, Ph.D. '---''(_/--' `-'\_) fL a.k.a JaguaR-R-R-r-r-r-.-.-. Meow! If there's any real truth it's that the entire multidimensional infinity of the Universe is almost certainly being run by a bunch of maniacs. /DA -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/