Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Perl: porters

Seeking guidance on Benchmarking and potentially removing swash caching

 

 

Perl porters RSS feed   Index | Next | Previous | View Threaded


public at khwilliamson

Aug 19, 2012, 11:14 AM

Post #1 of 9 (295 views)
Permalink
Seeking guidance on Benchmarking and potentially removing swash caching

The data structure used to store and access Unicode properties and
[bracketed character classes] in Perl regular expressions is a swash.
A swash is a hash with a few keys. The value of one of those keys is a
list of code points that match the swash, stored as an inversion list.
The value of another of the keys is a sub-hash which contains a cache of
the results for every code point that the program has so-far checked
against the list. I am considering changing to omit this sub-hash
cache, so every time the result is needed, it would be re-looked up
(using the same binary search of the inversion list that is done now for
the first look-up). I am seeking guidance for suitable benchmarks and
the advisability of making this change. I was inspired to do this now
because it make other changes I'm in the process of making easier to do,
and from the discussion at
http://blogs.perl.org/users/ovid/2011/08/binary-search-versus-hash-lookup.html
which indicates that a binary search can be faster than a hash.

In this case the binary search is faster than the blog posting would
indicate, as it is done on a C array in a C loop.

A problem with the hash cache is that it can conceivably grow very large
with time. Potentially, every code point representable on the machine
could have a bit allocated for it. This means that the upper limit of
memory a swash could use is 2**64 bits plus keys and overhead on a
64-bit platform; 2**32 on a 32-bit one. More realistically, if we limit
ourselves to code points used by Unicode, the number is 2**21; but only
about 25% of Unicode has been allocated, which brings the likely maximum
number down to around 2**19, which is .5Mb = 64KB. In practice, the
number is much smaller, as bits only get allocated for code points that
are actually looked up. But, if you have ever done, as I have, a loop
in which you start at 0 and test every code point, you will eat up more
and more memory. This all could be solved by limiting the size of the
cache with an LRU algorithm, but at the expense of additional overhead.

I decided to do some experimentation. I ran a version of the Perl core
that skips the caching against the regression test suite (make test),
and compared the times to the standard one. The differences were in the
noise.

I then Benchmarked t/re/uniprops.t, which exercises the Unicode
properties, and thus doesn't have a lot of unrelated tests that would
mask the differences in the approaches. The results indicate that the
binary search alone is very very slightly faster on this set of tests.

Binary search alone: timethis 100: 1340 wallclock secs (1336.69 usr +
1.44 sys = 1338.13 CPU) @ 0.07/s (n=100)

sub-hash cache: timethis 100: 1348 wallclock secs (1345.35 usr + 1.31
sys = 1346.66 CPU) @ 0.07/s (n=100)

It is quite easy to change the binary search-only version to cache the
last search result. This speeds up this benchmark a little: timethis
100: 1332 wallclock secs (1328.83 usr + 1.16 sys = 1329.99 CPU) @
0.08/s (n=100)

So the modified binary search looks very slightly faster (1% over the
hash cache) on this data, and doesn't suffer from creeping memory use.
I suspect also that the hash cache gets slower over time as more code
points get added to it, increasing collisions.

I would prefer to benchmark more realistic real-world data, and wonder
if anyone has some ideas on such, or any other thoughts on this issue.


jhi at iki

Aug 19, 2012, 3:53 PM

Post #2 of 9 (273 views)
Permalink
Re: Seeking guidance on Benchmarking and potentially removing swash caching [In reply to]

>
> I would prefer to benchmark more realistic real-world data, and wonder
> if anyone has some ideas on such, or any other thoughts on this issue.

Just a very random idea: pull in a few wikipedia pages in, say, French,
Russian, and Japanese, and while (m{\pL}g) { $n++ } through them?

>


davem at iabyn

Aug 20, 2012, 3:22 AM

Post #3 of 9 (273 views)
Permalink
Re: Seeking guidance on Benchmarking and potentially removing swash caching [In reply to]

On Sun, Aug 19, 2012 at 12:14:39PM -0600, Karl Williamson wrote:
> I would prefer to benchmark more realistic real-world data, and
> wonder if anyone has some ideas on such, or any other thoughts on
> this issue.

Well, one thing would be to benchmark a worst-case scenario. Pick a
swatch that has a large inversion list: i.e. something that is maximally
bad for binary search; then run a loop where only the same character is
repeatedly looked up (so minimally small cache). And see whether the
result is significantly bad.

PS: If a swatch has only a small, fixed number of fields, would it be
better to use an array rather than a hash to represent it?

--
I've often wanted to drown my troubles, but I can't get my wife to go
swimming.


public at khwilliamson

Aug 21, 2012, 9:09 AM

Post #4 of 9 (279 views)
Permalink
Re: Seeking guidance on Benchmarking and potentially removing swash caching [In reply to]

On 08/20/2012 04:22 AM, Dave Mitchell wrote:
> On Sun, Aug 19, 2012 at 12:14:39PM -0600, Karl Williamson wrote:
>> I would prefer to benchmark more realistic real-world data, and
>> wonder if anyone has some ideas on such, or any other thoughts on
>> this issue.
>
> Well, one thing would be to benchmark a worst-case scenario. Pick a
> swatch that has a large inversion list: i.e. something that is maximally
> bad for binary search; then run a loop where only the same character is
> repeatedly looked up (so minimally small cache). And see whether the
> result is significantly bad.
>
> PS: If a swatch has only a small, fixed number of fields, would it be
> better to use an array rather than a hash to represent it?
>


I'm not understanding you here very well.

You use the term 'swatch'. I think you meant what I called 'swash'.
The comments in the code refer to a 'swatch' as the 64-bit bit map (8
bytes) that each value in swash hashes is.

If so, then I still don't understand. My email gave the results of
using binary search with and without caching the last look-up result,
and caching gave somewhat better results, so my proposal included
caching, and so your test would just keep returning that cached value,
so would be fine.

I also don't understand the postscript. An inversion list is an ordered
array, so that would seem to fit your description. Perhaps you meant an
adaptive algorithm, which I hadn't thought of. Use the inversion list
if the number of elements is small, and use a swash if it is larger.
(The largest inversion list for an official Unicode property is about
1600 elements, which is 11 probes to search, BTW.) regcomp.c could
decide for any given property or bracketed character class which one to
use. Possibly, also, a swash that was getting too large and losing
performance could be switched to an inversion list. That seems probably
more trouble than the possible gain.


public at khwilliamson

Aug 21, 2012, 1:26 PM

Post #5 of 9 (273 views)
Permalink
Re: Seeking guidance on Benchmarking and potentially removing swash caching [In reply to]

On 08/19/2012 04:53 PM, Jarkko Hietaniemi wrote:
>>
>> I would prefer to benchmark more realistic real-world data, and wonder
>> if anyone has some ideas on such, or any other thoughts on this issue.
>
> Just a very random idea: pull in a few wikipedia pages in, say, French,
> Russian, and Japanese, and while (m{\pL}g) { $n++ } through them?
>
>>
>

Thank you for this idea. I did it for Russian, and it showed the
current scheme had between 20-25% advantage over my proposed one, so I
won't be pursuing the proposal as-is.

It got me to being more realistic about what real-world applications
look like. Most things are written in just one language, or at most a
few. And so processing will be of just a relatively few code points.
The hash implementation shines in this regard. Modern Cyrillic has 32
characters IIRC, times 2 for upper/lower case. Swash hashes will handle
these in just a couple of keys. Chinese has a lot more characters, so I
tried it on a Chinese wikipedia page, and got similar results. So even
though there are more keys in the hash, it's not enough to degrade hash
performance.

I did an experiment with a a property for Korean with a small (4) number
of entries in the inversion list, and ran it on a Korean wikipedia
entry. That showed the inversion list was about 8% faster than a swash.

As I said at first, the swash implementation has the disadvantage that
it continues to use memory and add keys as different, not already-tried
code points are tested against it. That should lead to a performance
degradation over time, in such a situation. (I'm saying this not
knowing anything about the Perl implementation, but from my general
knowledge of how hashes work, which is decades old.) But I don't think
there are that many real world situations where this likely happens.
One possible one is a long-running web server that serves up pages in
many different languages. Perhaps Wikipedia's is structured this way.
Other situations could include linguistics analysis code, and code that
is analyzing Unicode itself. These last two possibilities are so
specialized that we shouldn't worry about them at all.

I'm now leaning towards a composite algorithm determined at regex
compile time. If the inversion list is sufficiently small, use it;
otherwise use the current method. If it's easy to detect when the hash
performance is suffering, we could do that at every insertion, and
probably clear the hash completely to start over.

In the long run, it would be best to get most or all of the standard
Unicode properties into memory, using the techniques that ICU does; then
this wouldn't much matter.


jhi at iki

Aug 21, 2012, 1:50 PM

Post #6 of 9 (272 views)
Permalink
Re: Seeking guidance on Benchmarking and potentially removing swash caching [In reply to]

>
> Thank you for this idea. I did it for Russian, and it showed the current
> scheme had between 20-25% advantage over my proposed one, so I won't be
> pursuing the proposal as-is.

Glad it gave you some results. In the meanwhile I remembered another
source for more Unicode text,
but this time it is much shorter (though you can probably just
self-concat it enough times), and at least
in principle the same text:

http://www.unicode.org/standard/WhatIsUnicode-more.html

--
There is this special biologist word we use for 'stable'. It is
'dead'. -- Jack Cohen


public at khwilliamson

Aug 21, 2012, 2:36 PM

Post #7 of 9 (273 views)
Permalink
Re: Seeking guidance on Benchmarking and potentially removing swash caching [In reply to]

On 08/21/2012 02:50 PM, Jarkko Hietaniemi wrote:
>>
>> Thank you for this idea. I did it for Russian, and it showed the current
>> scheme had between 20-25% advantage over my proposed one, so I won't be
>> pursuing the proposal as-is.
>
> Glad it gave you some results. In the meanwhile I remembered another
> source for more Unicode text,
> but this time it is much shorter (though you can probably just
> self-concat it enough times), and at least
> in principle the same text:
>
> http://www.unicode.org/standard/WhatIsUnicode-more.html
>

In the meantime, I looked at the Unicode 6.2 properties. There are 842
that match distinct sets of code points (not including the ones that are
complements of them.) (Some Unicode properties match the exact same set
of code points as others. For example Line_Break=CR and
Grapheme_Cluster_Break=CR both match exactly a carriage return; there
are others that are non-trivial)

Of those, 47% have just one or two elements in their inversion lists.
55% have up to four
63% have up to eight
70% have up to 16
77% have up to 32
81% have up to 64
85% have up to 128
90% have up to 256

This indicates that we shouldn't be generating swashes for most official
Unicode properties.


davem at iabyn

Aug 21, 2012, 2:45 PM

Post #8 of 9 (274 views)
Permalink
Re: Seeking guidance on Benchmarking and potentially removing swash caching [In reply to]

On Tue, Aug 21, 2012 at 10:09:31AM -0600, Karl Williamson wrote:
> On 08/20/2012 04:22 AM, Dave Mitchell wrote:
> >On Sun, Aug 19, 2012 at 12:14:39PM -0600, Karl Williamson wrote:
> >>I would prefer to benchmark more realistic real-world data, and
> >>wonder if anyone has some ideas on such, or any other thoughts on
> >>this issue.
> >
> >Well, one thing would be to benchmark a worst-case scenario. Pick a
> >swatch that has a large inversion list: i.e. something that is maximally
> >bad for binary search; then run a loop where only the same character is
> >repeatedly looked up (so minimally small cache). And see whether the
> >result is significantly bad.
> >
> >PS: If a swatch has only a small, fixed number of fields, would it be
> >better to use an array rather than a hash to represent it?
> >
>
>
> I'm not understanding you here very well.
>
> You use the term 'swatch'. I think you meant what I called 'swash'.
> The comments in the code refer to a 'swatch' as the 64-bit bit map
> (8 bytes) that each value in swash hashes is.

Brain fart on my part. I meant swash.

> If so, then I still don't understand. My email gave the results of
> using binary search with and without caching the last look-up
> result, and caching gave somewhat better results, so my proposal
> included caching, and so your test would just keep returning that
> cached value, so would be fine.

I wasn't paying close enough attention. Ignore me.
>
> I also don't understand the postscript. An inversion list is an
> ordered array, so that would seem to fit your description. Perhaps
> you meant an adaptive algorithm, which I hadn't thought of. Use the
> inversion list if the number of elements is small, and use a swash
> if it is larger. (The largest inversion list for an official Unicode
> property is about 1600 elements, which is 11 probes to search, BTW.)
> regcomp.c could decide for any given property or bracketed character
> class which one to use. Possibly, also, a swash that was getting
> too large and losing performance could be switched to an inversion
> list. That seems probably more trouble than the possible gain.

From your original email: "A swash is a hash with a few keys.". I was just
randomly wondering if the key set is small and well known, whether swashes
could be implemented using arrays, e.g.

$swash{BITS}
becomes
use constant BITS => 3;
$swash[BITS]

But knowing almost nothing about swashes, I have no idea whether that
would provide any benefit.

--
I've often wanted to drown my troubles, but I can't get my wife to go
swimming.


nick at ccl4

Aug 29, 2012, 8:02 AM

Post #9 of 9 (250 views)
Permalink
Re: Seeking guidance on Benchmarking and potentially removing swash caching [In reply to]

On Tue, Aug 21, 2012 at 02:26:42PM -0600, Karl Williamson wrote:

> It got me to being more realistic about what real-world applications
> look like. Most things are written in just one language, or at most a
> few. And so processing will be of just a relatively few code points.
> The hash implementation shines in this regard. Modern Cyrillic has 32
> characters IIRC, times 2 for upper/lower case. Swash hashes will handle
> these in just a couple of keys. Chinese has a lot more characters, so I
> tried it on a Chinese wikipedia page, and got similar results. So even
> though there are more keys in the hash, it's not enough to degrade hash
> performance.

IIRC this was Larry's rough assumption about 10-12 years ago when he chose
the implementation he did - most programs won't be dealing with more than
one language, code points used by a language are in clusters, so most
programs won't ever have to load more than an few bits of Unicode.

Hence:

> knowledge of how hashes work, which is decades old.) But I don't think
> there are that many real world situations where this likely happens.

> In the long run, it would be best to get most or all of the standard
> Unicode properties into memory, using the techniques that ICU does; then
> this wouldn't much matter.

Agree.

Nicholas Clark

Perl porters RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.