Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Python

Code that ought to run fast, but can't due to Python limitations.

 

 

First page Previous page 1 2 Next page Last page  View All Python python RSS feed   Index | Next | Previous | View Threaded


nagle at animats

Jul 4, 2009, 10:33 AM

Post #1 of 46 (701 views)
Permalink
Code that ought to run fast, but can't due to Python limitations.

As an example of code that really needs to run fast, but is
speed-limited by Python's limitations, see "tokenizer.py" in

http://code.google.com/p/html5lib/

This is a parser for HTML 5, a piece of code that will be needed
in many places and will process large amounts of data. It's written
entirely in Python. Take a look at how much work has to be performed
per character.

This is a good test for Python implementation bottlenecks. Run
that tokenizer on HTML, and see where the time goes.

("It should be written in C" is not an acceptable answer.)

Python doesn't have a "switch" or "case" statement, and when
you need a state machine with many states, that makes for painful,
slow code. There's a comment in the code that it would be useful
to run a few billion lines of HTML through an instrumented version
of the parser to decide in which order the IF statements should be
executed. You shouldn't have to do that.

Yes, I've read PEP 3103. The big problem is the difficulty of figuring
out what's a constant and what might change. If all the cases are constants,
case statements are easy. But in Python, the compiler can't tell.

Parsers have many named compile-time constants. Python doesn't support
named compile-time constants, and this is one of the places where we
have to pay the bill for that limitation.

Something to think about when you need three more racks of servers
because the HTML parser is slow.

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


benjamin.kaplan at case

Jul 4, 2009, 11:03 AM

Post #2 of 46 (685 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

On Sat, Jul 4, 2009 at 1:33 PM, John Nagle<nagle[at]animats.com> wrote:
>   As an example of code that really needs to run fast, but is
> speed-limited by Python's limitations, see "tokenizer.py" in
>
>        http://code.google.com/p/html5lib/
>
> This is a parser for HTML 5, a piece of code that will be needed
> in many places and will process large amounts of data. It's written
> entirely in Python.  Take a look at how much work has to be performed
> per character.
>
> This is a good test for Python implementation bottlenecks.  Run
> that tokenizer on HTML, and see where the time goes.
>
> ("It should be written in C" is not an acceptable answer.)
>
> Python doesn't have a "switch" or "case" statement, and when
> you need a state machine with many states, that makes for painful,
> slow code.  There's a comment in the code that it would be useful
> to run a few billion lines of HTML through an instrumented version
> of the parser to decide in which order the IF statements should be
> executed.  You shouldn't have to do that.
>

If you're cases are hashable, just use a dict instead of the if chain.
Then you get the constant time access to it.

def func_a() :
...
def func_b():
...
def func_c():
....

case = {"a":func_a, "b":func_b, "c":func_c}

case[value]()

> Yes, I've read PEP 3103.  The big problem is the difficulty of figuring
> out what's a constant and what might change.  If all the cases are
> constants,
> case statements are easy.  But in Python, the compiler can't tell.
>
> Parsers have many named compile-time constants.  Python doesn't support
> named compile-time constants, and this is one of the places where we
> have to pay the bill for that limitation.
>
> Something to think about when you need three more racks of servers
> because the HTML parser is slow.
>
>                                John Nagle
> --
> http://mail.python.org/mailman/listinfo/python-list
>
--
http://mail.python.org/mailman/listinfo/python-list


mwilson at the-wire

Jul 4, 2009, 11:20 AM

Post #3 of 46 (685 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

John Nagle wrote:
[ ... ]
> Parsers have many named compile-time constants. Python doesn't support
> named compile-time constants, and this is one of the places where we
> have to pay the bill for that limitation.
>
> Something to think about when you need three more racks of servers
> because the HTML parser is slow.

One technique used in such a case is to dispatch different case-handling
functions via a dictionary lookup.

Mel.


--
http://mail.python.org/mailman/listinfo/python-list


http://phr.cx at NOSPAM

Jul 4, 2009, 11:40 AM

Post #4 of 46 (685 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

John Nagle <nagle[at]animats.com> writes:
> Python doesn't have a "switch" or "case" statement, and when
> you need a state machine with many states, that makes for painful,
> slow code. ...
> There's a comment in the code that it would be useful
> to run a few billion lines of HTML through an instrumented version
> of the parser to decide in which order the IF statements should be
> executed. You shouldn't have to do that.

In that particular program it would probably be better to change those
if/elif/elif/else constructs to dictionary lookups. I see the program
already does that for some large tables.

--
http://mail.python.org/mailman/listinfo/python-list


nagle at animats

Jul 4, 2009, 4:35 PM

Post #5 of 46 (676 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

Paul Rubin wrote:
> John Nagle <nagle[at]animats.com> writes:
>> Python doesn't have a "switch" or "case" statement, and when
>> you need a state machine with many states, that makes for painful,
>> slow code. ...
>> There's a comment in the code that it would be useful
>> to run a few billion lines of HTML through an instrumented version
>> of the parser to decide in which order the IF statements should be
>> executed. You shouldn't have to do that.
>
> In that particular program it would probably be better to change those
> if/elif/elif/else constructs to dictionary lookups. I see the program
> already does that for some large tables.

A dictionary lookup (actually, several of them) for every
input character is rather expensive. Tokenizers usually index into
a table of character classes, then use the character class index in
a switch statement.

This is an issue that comes up whenever you have to parse some formal
structure, from XML/HTML to Pickle to JPEG images to program source.

If Python could figure out what's a constant and what isn't during
compilation, this sort of thing could be much more efficient. In fact,
you don't even need a switch statement at the source level, if the
language is such that the compiler can figure out when "elif" clauses
are mutually exclusive.

The temptation is to write tokenizers in C, but that's an admission
of language design failure.

(A general problem with Python is "hidden dynamism". That is,
changes to variables that can't be found by examining the source.
This is a killer for optimizations. One could take the position that any
module variable with exactly one visible assignment to it is, in fact,
only assigned in one place, and if the right hand side is a constant,
the variable is a constant. This would break some programs doing funny
stuff with "eval", or using some of the backdoor ways to modify variables,
but that's very rare in practice. In return, you get the ability to
hard-compile more of Python into fast code. I'm thinking Shed Skin here,
not yet another attempt at a JIT system.)

On the other hand, trying to do this in Perl, where you can't even index
strings, is far worse.

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


http://phr.cx at NOSPAM

Jul 4, 2009, 4:48 PM

Post #6 of 46 (678 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

John Nagle <nagle[at]animats.com> writes:
> A dictionary lookup (actually, several of them) for every
> input character is rather expensive. Tokenizers usually index into
> a table of character classes, then use the character class index in
> a switch statement.

Maybe you could use a regexp (and then have -two- problems...) to
find the token boundaries, then a dict to identify the actual token.
Tables of character classes seem a bit less attractive in the Unicode
era than in the old days.
--
http://mail.python.org/mailman/listinfo/python-list


nobody at nowhere

Jul 4, 2009, 6:25 PM

Post #7 of 46 (673 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

On Sat, 04 Jul 2009 16:35:08 -0700, John Nagle wrote:

> The temptation is to write tokenizers in C, but that's an admission
> of language design failure.

The only part that really needs to be written in C is the DFA loop. The
code to construct the state table from regexps could be written
entirely in Python, but I don't see any advantage to doing so.

--
http://mail.python.org/mailman/listinfo/python-list


ben+python at benfinney

Jul 4, 2009, 7:09 PM

Post #8 of 46 (673 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

John Nagle <nagle[at]animats.com> writes:

> A dictionary lookup (actually, several of them) for every input
> character is rather expensive. Tokenizers usually index into a table
> of character classes, then use the character class index in a switch
> statement.
>
> This is an issue that comes up whenever you have to parse some
> formal structure, from XML/HTML to Pickle to JPEG images to program
> source.
> […]
> The temptation is to write tokenizers in C, but that's an admission
> of language design failure.

This sounds like a job for <URL:http://pyparsing.wikispaces.com/>
Pyparsing.

--
\ “Better not take a dog on the space shuttle, because if he |
`\ sticks his head out when you're coming home his face might burn |
_o__) up.†—Jack Handey |
Ben Finney
--
http://mail.python.org/mailman/listinfo/python-list


nagle at animats

Jul 4, 2009, 8:15 PM

Post #9 of 46 (673 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

Paul Rubin wrote:
> John Nagle <nagle[at]animats.com> writes:
>> A dictionary lookup (actually, several of them) for every
>> input character is rather expensive. Tokenizers usually index into
>> a table of character classes, then use the character class index in
>> a switch statement.
>
> Maybe you could use a regexp (and then have -two- problems...) to
> find the token boundaries, then a dict to identify the actual token.
> Tables of character classes seem a bit less attractive in the Unicode
> era than in the old days.

I want to see a regular expression that expresses the HTML 5 token
parsing rules, including all the explicitly specified error handling.

Here's some actual code, from "tokenizer.py". This is called once
for each character in an HTML document, when in "data" state (outside
a tag). It's straightforward code, but look at all those
dictionary lookups.

def dataState(self):
data = self.stream.char()

# Keep a charbuffer to handle the escapeFlag
if self.contentModelFlag in\
(contentModelFlags["CDATA"], contentModelFlags["RCDATA"]):
if len(self.lastFourChars) == 4:
self.lastFourChars.pop(0)
self.lastFourChars.append(data)

# The rest of the logic
if data == "&" and self.contentModelFlag in\
(contentModelFlags["PCDATA"], contentModelFlags["RCDATA"]) and not\
self.escapeFlag:
self.state = self.states["entityData"]
elif data == "-" and self.contentModelFlag in\
(contentModelFlags["CDATA"], contentModelFlags["RCDATA"]) and not\
self.escapeFlag and "".join(self.lastFourChars) == "<!--":
self.escapeFlag = True
self.tokenQueue.append({"type": "Characters", "data":data})
elif (data == "<" and (self.contentModelFlag == contentModelFlags["PCDATA"]
or (self.contentModelFlag in
(contentModelFlags["CDATA"],
contentModelFlags["RCDATA"]) and
self.escapeFlag == False))):
self.state = self.states["tagOpen"]
elif data == ">" and self.contentModelFlag in\
(contentModelFlags["CDATA"], contentModelFlags["RCDATA"]) and\
self.escapeFlag and "".join(self.lastFourChars)[1:] == "-->":
self.escapeFlag = False
self.tokenQueue.append({"type": "Characters", "data":data})
elif data == EOF:
# Tokenization ends.
return False
elif data in spaceCharacters:
# Directly after emitting a token you switch back to the "data
# state". At that point spaceCharacters are important so they are
# emitted separately.
self.tokenQueue.append({"type": "SpaceCharacters", "data":
data + self.stream.charsUntil(spaceCharacters, True)})
# No need to update lastFourChars here, since the first space will
# have already broken any <!-- or --> sequences
else:
chars = self.stream.charsUntil(("&", "<", ">", "-"))
self.tokenQueue.append({"type": "Characters", "data":
data + chars})
self.lastFourChars += chars[-4:]
self.lastFourChars = self.lastFourChars[-4:]
return True



John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


aahz at pythoncraft

Jul 4, 2009, 8:37 PM

Post #10 of 46 (670 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

In article <4a501a5e$0$1640$742ec2ed[at]news.sonic.net>,
John Nagle <nagle[at]animats.com> wrote:
>
> Here's some actual code, from "tokenizer.py". This is called once
>for each character in an HTML document, when in "data" state (outside
>a tag). It's straightforward code, but look at all those
>dictionary lookups.
>
> def dataState(self):
> data = self.stream.char()
>
> # Keep a charbuffer to handle the escapeFlag
> if self.contentModelFlag in\
> (contentModelFlags["CDATA"], contentModelFlags["RCDATA"]):
> if len(self.lastFourChars) == 4:
> self.lastFourChars.pop(0)
> self.lastFourChars.append(data)
>
> # The rest of the logic
> if data == "&" and self.contentModelFlag in\
> (contentModelFlags["PCDATA"], contentModelFlags["RCDATA"]) and not\
> self.escapeFlag:
> self.state = self.states["entityData"]
> elif data == "-" and self.contentModelFlag in\
> (contentModelFlags["CDATA"], contentModelFlags["RCDATA"]) and not\
> self.escapeFlag and "".join(self.lastFourChars) == "<!--":
> self.escapeFlag = True
> self.tokenQueue.append({"type": "Characters", "data":data})
> elif (data == "<" and (self.contentModelFlag == contentModelFlags["PCDATA"]
> or (self.contentModelFlag in
> (contentModelFlags["CDATA"],
> contentModelFlags["RCDATA"]) and
> self.escapeFlag == False))):
> self.state = self.states["tagOpen"]
> elif data == ">" and self.contentModelFlag in\
> (contentModelFlags["CDATA"], contentModelFlags["RCDATA"]) and\
> self.escapeFlag and "".join(self.lastFourChars)[1:] == "-->":
> self.escapeFlag = False
> self.tokenQueue.append({"type": "Characters", "data":data})
> elif data == EOF:
> # Tokenization ends.
> return False
> elif data in spaceCharacters:
> # Directly after emitting a token you switch back to the "data
> # state". At that point spaceCharacters are important so they are
> # emitted separately.
> self.tokenQueue.append({"type": "SpaceCharacters", "data":
> data + self.stream.charsUntil(spaceCharacters, True)})
> # No need to update lastFourChars here, since the first space will
> # have already broken any <!-- or --> sequences
> else:
> chars = self.stream.charsUntil(("&", "<", ">", "-"))
> self.tokenQueue.append({"type": "Characters", "data":
> data + chars})
> self.lastFourChars += chars[-4:]
> self.lastFourChars = self.lastFourChars[-4:]
> return True

Every single "self." is a dictionary lookup. Were you referring to
those? If not, I don't see your point. If yes, well, that's kind of the
whole point of using Python. You do pay a performance penalty. You can
optimize out some lookups, but you need to switch to C for some kinds of
computationally intensive algorithms. In this case, you can probably get
a large boost out of Pysco or Cython or Pyrex.
--
Aahz (aahz[at]pythoncraft.com) <*> http://www.pythoncraft.com/

"as long as we like the same operating system, things are cool." --piranha
--
http://mail.python.org/mailman/listinfo/python-list


pavlovevidence at gmail

Jul 4, 2009, 10:51 PM

Post #11 of 46 (667 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

On Jul 4, 4:35 pm, John Nagle <na...@animats.com> wrote:
>     The temptation is to write tokenizers in C, but that's an admission
> of language design failure.

No it isn't. It's only a failure of Python to be the language that
does everything *you* want.

Carl Banks
--
http://mail.python.org/mailman/listinfo/python-list


nick at craig-wood

Jul 5, 2009, 12:30 AM

Post #12 of 46 (665 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

John Nagle <nagle[at]animats.com> wrote:
> As an example of code that really needs to run fast, but is
> speed-limited by Python's limitations, see "tokenizer.py" in
>
> http://code.google.com/p/html5lib/
>
> This is a parser for HTML 5, a piece of code that will be needed
> in many places and will process large amounts of data. It's written
> entirely in Python. Take a look at how much work has to be performed
> per character.
>
> This is a good test for Python implementation bottlenecks. Run
> that tokenizer on HTML, and see where the time goes.
>
> ("It should be written in C" is not an acceptable answer.)

You could compile it with Cython though. lxml took this route...

--
Nick Craig-Wood <nick[at]craig-wood.com> -- http://www.craig-wood.com/nick
--
http://mail.python.org/mailman/listinfo/python-list


mail at microcorp

Jul 5, 2009, 1:12 AM

Post #13 of 46 (664 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

"John Nagle" <nagle[at]...ats.com> wrote:

> Python doesn't have a "switch" or "case" statement, and when
> you need a state machine with many states, that makes for painful,
> slow code. There's a comment in the code that it would be useful
> to run a few billion lines of HTML through an instrumented version
> of the parser to decide in which order the IF statements should be
> executed. You shouldn't have to do that.

You do not have to implement a state machine in a case statement,
or in a series of if... elifs.

Python is not C.

Use a dispatch dict, and have each state return the next state.
Then you can use strings representing state names, and
everybody will be able to understand the code.

toy example, not tested, nor completed:

protocol = {"start":initialiser,"hunt":hunter,"classify":classifier,....other
states}

def state_machine():
next_step = protocol["start"]()
while True:
next_step = protocol[next_step]()

Simple, and almost as fast as if you did the same thing
in assembler using pointers.

And each state will have a finite set of reasons to
either stay where its at, or move on. Not a lot you
can do about that, but test for them one at a time.
But at least you will have split the problem up,
and you won't be doing irrelevant tests.

You can even do away with the dict, by having
each state return the actual next state routine:

next_state = protocol_initialiser()
while True:
next_state = next_state()
time.sleep(0.001) # this prevents thrashing when monitoring real events

If you are using a gui, and you have access
to an after callback, then you can make a
ticking stutter thread to run some monitoring
machine in the background, using the same
"tell me what to do next" technique.

To take the timing thing further, you can do:

wait_time, next_state = protocol_initialiser()
while True:
if wait_time:
time.sleep(wait_time)
wait_time, next_state = next_state()

This gives you control over busy-wait loops,
and lets you speed up when it is needed.

Python really is not C.

- Hendrik




--
http://mail.python.org/mailman/listinfo/python-list


steve at REMOVE-THIS-cybersource

Jul 5, 2009, 1:28 AM

Post #14 of 46 (664 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

On Sat, 04 Jul 2009 20:15:11 -0700, John Nagle wrote:

> Paul Rubin wrote:
>> John Nagle <nagle[at]animats.com> writes:
>>> A dictionary lookup (actually, several of them) for every
>>> input character is rather expensive. Tokenizers usually index into a
>>> table of character classes, then use the character class index in a
>>> switch statement.
>>
>> Maybe you could use a regexp (and then have -two- problems...) to find
>> the token boundaries, then a dict to identify the actual token. Tables
>> of character classes seem a bit less attractive in the Unicode era than
>> in the old days.
>
> I want to see a regular expression that expresses the HTML 5 token
> parsing rules, including all the explicitly specified error handling.

Obviously the regex can't do the error handling. Nor should you expect a
single regex to parse an entire HTML document. But you could (perhaps)
use regexes to parse pieces of the document, as needed.

Have you investigated the pyparsing module? Unless you have some reason
for avoiding it, for any complicated parsing job I'd turn to that before
trying to roll your own.


> Here's some actual code, from "tokenizer.py". This is called once
> for each character in an HTML document, when in "data" state (outside a
> tag). It's straightforward code, but look at all those dictionary
> lookups.

Okay, we get it. Parsing HTML 5 is a bitch. What's your point? I don't
see how a case statement would help you here: you're not dispatching on a
value, but running through a series of tests until one passes. There are
languages where you can write something like:

case:
x > 0: process_positive(x)
x < 0: process_negative(x)
x == 0: process_zero(x)

but that's generally just syntactic sugar for the obvious if...elif...
block. (Although clever compilers might recognise that it's the same x in
each expression, and do something clever to optimize the code.)


Nor do I see why you were complaining about Python not having true
constants. I don't see how that would help you... most of your explicit
dict lookups are against string literals e.g. contentModelFlags["RCDATA"].

So while I feel your pain, I'm not sure I understand why you're blaming
this on *Python*.


--
Steven
--
http://mail.python.org/mailman/listinfo/python-list


stefan_ml at behnel

Jul 5, 2009, 1:41 AM

Post #15 of 46 (664 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

John Nagle wrote:
> Python doesn't have a "switch" or "case" statement, and when
> you need a state machine with many states, that makes for painful,
> slow code.

Cython has a built-in optimisation that maps if-elif-else chains to C's
switch statement if they only test a single int/char variable, even when
you write things like "elif x in [1,5,9,12]". This works in Cython, because
we know that the comparison to a C int/char is side-effect free. It may not
always be side-effect free in Python, so this won't work in general. It
would be perfect for your case, though.

Stefan
--
http://mail.python.org/mailman/listinfo/python-list


http://phr.cx at NOSPAM

Jul 5, 2009, 1:58 AM

Post #16 of 46 (664 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

Steven D'Aprano <steve[at]REMOVE-THIS-cybersource.com.au> writes:
> Okay, we get it. Parsing HTML 5 is a bitch. What's your point? I don't
> see how a case statement would help you here: you're not dispatching on a
> value, but running through a series of tests until one passes.

A case statement switch(x):... into a bunch of constant case labels
would be able to use x as an index into a jump vector, and/or do an
unrolled logarithmic (bisection-like) search through the tests,
instead of a linear search.
--
http://mail.python.org/mailman/listinfo/python-list


stefan_ml at behnel

Jul 5, 2009, 1:58 AM

Post #17 of 46 (665 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

John Nagle wrote:
> Here's some actual code, from "tokenizer.py". This is called once
> for each character in an HTML document, when in "data" state (outside
> a tag). It's straightforward code, but look at all those
> dictionary lookups.
>
> def dataState(self):
> data = self.stream.char()
>
> # Keep a charbuffer to handle the escapeFlag
> if self.contentModelFlag in\
> (contentModelFlags["CDATA"], contentModelFlags["RCDATA"]):

Is the tuple

(contentModelFlags["CDATA"], contentModelFlags["RCDATA"])

constant? If that is the case, I'd cut it out into a class member (or
module-local variable) first thing in the morning. And I'd definitely keep
the result of the "in" test in a local variable for reuse, seeing how many
times it's used in the rest of the code.

Writing inefficient code is not something to blame the language for.

Stefan
--
http://mail.python.org/mailman/listinfo/python-list


stefan_ml at behnel

Jul 5, 2009, 2:04 AM

Post #18 of 46 (664 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

John Nagle wrote:
> Paul Rubin wrote:
>> John Nagle <nagle[at]animats.com> writes:
>>> Python doesn't have a "switch" or "case" statement, and when
>>> you need a state machine with many states, that makes for painful,
>>> slow code. ...
>>> There's a comment in the code that it would be useful
>>> to run a few billion lines of HTML through an instrumented version
>>> of the parser to decide in which order the IF statements should be
>>> executed. You shouldn't have to do that.
>>
>> In that particular program it would probably be better to change those
>> if/elif/elif/else constructs to dictionary lookups. I see the program
>> already does that for some large tables.
>
> A dictionary lookup (actually, several of them) for every
> input character is rather expensive.

Did you implement this and prove your claim in benchmarks? Taking a look at
the current implementation, I'm pretty sure a dict-based implementation
would outrun it in your first try.

Stefan
--
http://mail.python.org/mailman/listinfo/python-list


steve at REMOVE-THIS-cybersource

Jul 5, 2009, 2:07 AM

Post #19 of 46 (664 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

On Sun, 05 Jul 2009 10:12:54 +0200, Hendrik van Rooyen wrote:

> Python is not C.

John Nagle is an old hand at Python. He's perfectly aware of this, and
I'm sure he's not trying to program C in Python.

I'm not entirely sure *what* he is doing, and hopefully he'll speak up
and say, but whatever the problem is it's not going to be as simple as
that.


--
Steven
--
http://mail.python.org/mailman/listinfo/python-list


steve at REMOVE-THIS-cybersource

Jul 5, 2009, 2:13 AM

Post #20 of 46 (664 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

On Sun, 05 Jul 2009 01:58:13 -0700, Paul Rubin wrote:

> Steven D'Aprano <steve[at]REMOVE-THIS-cybersource.com.au> writes:
>> Okay, we get it. Parsing HTML 5 is a bitch. What's your point? I don't
>> see how a case statement would help you here: you're not dispatching on
>> a value, but running through a series of tests until one passes.
>
> A case statement switch(x):... into a bunch of constant case labels
> would be able to use x as an index into a jump vector, and/or do an
> unrolled logarithmic (bisection-like) search through the tests, instead
> of a linear search.

Yes, I'm aware of that, but that's not what John's code is doing -- he's
doing a series of if expr ... elif expr tests. I don't think a case
statement can do much to optimize that.



--
Steven
--
http://mail.python.org/mailman/listinfo/python-list


http://phr.cx at NOSPAM

Jul 5, 2009, 2:38 AM

Post #21 of 46 (661 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

Steven D'Aprano <steve[at]REMOVE-THIS-cybersource.com.au> writes:
> Yes, I'm aware of that, but that's not what John's code is doing -- he's
> doing a series of if expr ... elif expr tests. I don't think a case
> statement can do much to optimize that.

The series of tests is written that way because there is no case
statement available. It is essentially switching on a bunch of
character constants and then doing some additional tests in each
branch.

It could be that using ord(c) as an index into a list of functions
might be faster than a dict lookup on c to get a function. I think
John is hoping to avoid a function call and instead get an indexed
jump within the Python bytecode for the big function.
--
http://mail.python.org/mailman/listinfo/python-list


stefan_ml at behnel

Jul 5, 2009, 2:41 AM

Post #22 of 46 (661 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

Stefan Behnel wrote:
> John Nagle wrote:
>> Here's some actual code, from "tokenizer.py". This is called once
>> for each character in an HTML document, when in "data" state (outside
>> a tag). It's straightforward code, but look at all those
>> dictionary lookups.
>>
>> def dataState(self):
>> data = self.stream.char()
>>
>> # Keep a charbuffer to handle the escapeFlag
>> if self.contentModelFlag in\
>> (contentModelFlags["CDATA"], contentModelFlags["RCDATA"]):
>
> Is the tuple
>
> (contentModelFlags["CDATA"], contentModelFlags["RCDATA"])
>
> constant? If that is the case, I'd cut it out into a class member (or
> module-local variable) first thing in the morning.

Ah, and there's also this little trick to make it a (fast) local variable
in that method:

def some_method(self, some_const=(1,2,3,4)):
...

Stefan
--
http://mail.python.org/mailman/listinfo/python-list


stefan_ml at behnel

Jul 5, 2009, 2:48 AM

Post #23 of 46 (661 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

Paul Rubin wrote:
> Steven D'Aprano writes:
>> Yes, I'm aware of that, but that's not what John's code is doing -- he's
>> doing a series of if expr ... elif expr tests. I don't think a case
>> statement can do much to optimize that.
>
> The series of tests is written that way because there is no case
> statement available. It is essentially switching on a bunch of
> character constants and then doing some additional tests in each
> branch.

Although doing some of the tests first and then checking the input
conditionally might be faster here.

Another idea: You could exchange the methods whenever self.contentModelFlag
changes, i.e. you'd have a "dataState_CDATA", a "dataState_PCDATA" etc.


> It could be that using ord(c) as an index into a list of functions
> might be faster than a dict lookup on c to get a function.

Rather unlikely, given that calling "ord(c)" involves a dict lookup for
"ord". You might get away with the pre-initialised keywords trick, though.


> I think
> John is hoping to avoid a function call and instead get an indexed
> jump within the Python bytecode for the big function.

Hmm, yes, the actual code inside the conditionals is pretty short, so the
call overhead might hurt here.

Stefan
--
http://mail.python.org/mailman/listinfo/python-list


http://phr.cx at NOSPAM

Jul 5, 2009, 2:52 AM

Post #24 of 46 (661 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

Stefan Behnel <stefan_ml[at]behnel.de> writes:
> > # Keep a charbuffer to handle the escapeFlag
> > if self.contentModelFlag in\
> > (contentModelFlags["CDATA"], contentModelFlags["RCDATA"]):
> Is the tuple
> (contentModelFlags["CDATA"], contentModelFlags["RCDATA"])
> constant? If that is the case, I'd cut it out into a class member ...

I think the main issue for that function comes after that if statement.
There is a multi-way switch on a bunch of different possible character
values. I do agree with you that the first "if" can also be optimized.
--
http://mail.python.org/mailman/listinfo/python-list


http://phr.cx at NOSPAM

Jul 5, 2009, 2:54 AM

Post #25 of 46 (661 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

Stefan Behnel <stefan_ml[at]behnel.de> writes:
> > The series of tests is written that way because there is no case
> > statement available. It is essentially switching on a bunch of
> > character constants and then doing some additional tests in each
> > branch.
> Although doing some of the tests first and then checking the input
> conditionally might be faster here.

That is essentially what happens. There are a bunch of tests of the
form
if data=='<' and [some other stuff]: ...

Because of short-circuit evaluation of "and", the additional tests
only happen once the character has been matched.
--
http://mail.python.org/mailman/listinfo/python-list

First page Previous page 1 2 Next page Last page  View All Python python RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.