Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Python

Code that ought to run fast, but can't due to Python limitations.

 

 

First page Previous page 1 2 Next page Last page  View All Python python RSS feed   Index | Next | Previous | View Threaded


stefan_ml at behnel

Jul 5, 2009, 3:04 AM

Post #26 of 46 (723 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

Paul Rubin wrote:
> Stefan Behnel writes:
>>> The series of tests is written that way because there is no case
>>> statement available. It is essentially switching on a bunch of
>>> character constants and then doing some additional tests in each
>>> branch.
>> Although doing some of the tests first and then checking the input
>> conditionally might be faster here.
>
> That is essentially what happens. There are a bunch of tests of the
> form
> if data=='<' and [some other stuff]: ...

That's what I meant. Some of the "other stuff" is redundant enough to do it
once at the beginning of the function (or even before entering the
function, by writing specialised methods), i.e. I'd (partially) reverse the
order of the "and" operands.

Stefan
--
http://mail.python.org/mailman/listinfo/python-list


stefan_ml at behnel

Jul 5, 2009, 3:06 AM

Post #27 of 46 (723 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

Paul Rubin wrote:
> Stefan Behnel writes:
>>> # Keep a charbuffer to handle the escapeFlag
>>> if self.contentModelFlag in\
>>> (contentModelFlags["CDATA"], contentModelFlags["RCDATA"]):
>> Is the tuple
>> (contentModelFlags["CDATA"], contentModelFlags["RCDATA"])
>> constant? If that is the case, I'd cut it out into a class member ...
>
> I think the main issue for that function comes after that if statement.
> There is a multi-way switch on a bunch of different possible character
> values. I do agree with you that the first "if" can also be optimized.

You may notice that the creation of this exact tuple appears in almost all
if the conditionals of this method. So it is part of the bottleneck.

Stefan
--
http://mail.python.org/mailman/listinfo/python-list


http://phr.cx at NOSPAM

Jul 5, 2009, 4:09 AM

Post #28 of 46 (723 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

Stefan Behnel <stefan_ml [at] behnel> writes:
> You may notice that the creation of this exact tuple appears in almost all
> if the conditionals of this method. So it is part of the bottleneck.

I don't think so. The tuple is only created when the character has
already matched, and for the vast majority of the chars in the input
stream (ordinary text chars rather than html delimiters) none of them
match.
--
http://mail.python.org/mailman/listinfo/python-list


stefan_ml at behnel

Jul 5, 2009, 4:23 AM

Post #29 of 46 (725 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

Paul Rubin wrote:
> Stefan Behnel writes:
>> You may notice that the creation of this exact tuple appears in almost all
>> if the conditionals of this method. So it is part of the bottleneck.
>
> I don't think so. The tuple is only created when the character has
> already matched, and for the vast majority of the chars in the input
> stream (ordinary text chars rather than html delimiters) none of them
> match.

Well, it's the second thing that happens when entering the method, it
happens several times later on when specific characters are matched, and it
also happens at the end when none of the special characters did match. So
it /is/ part of the bottleneck because the dict lookups, the tuple
creation, the "in" test and the tuple deallocation happen *twice* for
almost all characters in the stream.

Stefan
--
http://mail.python.org/mailman/listinfo/python-list


ptmcg at austin

Jul 5, 2009, 4:41 AM

Post #30 of 46 (722 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

On Jul 5, 3:12 am, "Hendrik van Rooyen" <m...@microcorp.co.za> wrote:
>
> Use a dispatch dict, and have each state return the next state.
> Then you can use strings representing state names, and
> everybody will be able to understand the code.
>
> toy example, not tested, nor completed:
>
> protocol = {"start":initialiser,"hunt":hunter,"classify":classifier,....other
> states}
>
> def state_machine():
> next_step = protocol["start"]()
> while True:
> next_step = protocol[next_step]()
>

I've just spent about an hour looking over this code, with a few
comments to inject to the thread here:

- To all those suggesting the OP convert to a dispatch table, be
assured that this code is well aware of this idiom. It is used
HEAVILY at a macro level, picking through the various HTML states
(starting a tag, reading attributes, reading body, etc.). There still
are a number of cascading if-elif's within some of these states, and
some of them *may* be candidates for further optimization.

- There is an underlying HTMLInputStream that seems to be doing some
unnecessary position bookkeeping (positionLine and positionCol).
Commenting this out increases my test speed by about 13%. In my
ignorance, I may be removing some important behavior, but this does
not seem to be critical as I tested against a few megs of HTML
source. Before blaming the tokenizer for everything, there may be
more performance to be wrung from the input stream processor. For
that matter, I would guess that about 90% of all HTML files that this
code would process would easily fit in memory - in that case, the
stream processing (and all of the attendant "if I'm not at the end of
the current chunk" code) could be skipped/removed entirely.

- The HTMLInputStream's charsUntil code is an already-identified
bottleneck, and some re enhancements have been applied here to help
out.

- Run-time construction of tuple literals where the tuple members are
constants can be lifted out. emitCurrentToken rebuilds this tuple
every time it is called (which is a lot!):

if (token["type"] in (tokenTypes["StartTag"], tokenTypes
["EndTag"], tokenTypes["EmptyTag"])):

Move this tuple literal into a class constant (or if you can tolerate
it, a default method argument to gain LOAD_FAST benefits - sometimes
optimization isn't pretty).

- These kinds of optimizations are pretty small, and only make sense
if they are called frequently. Tallying which states are called in my
test gives the following list in decreasing frequency. Such a list
would help guide your further tuning efforts:

tagNameState 194848
dataState 182179
attributeNameState 116507
attributeValueDoubleQuotedState 114931
tagOpenState 105556
beforeAttributeNameState 58612
beforeAttributeValueState 58216
afterAttributeValueState 58083
closeTagOpenState 50547
entityDataState 1673
attributeValueSingleQuotedState 1098
commentEndDashState 372
markupDeclarationOpenState 370
commentEndState 364
commentStartState 362
commentState 362
selfClosingStartTagState 359
doctypePublicIdentifierDoubleQuotedState 291
doctypeSystemIdentifierDoubleQuotedState 247
attributeValueUnQuotedState 191
doctypeNameState 32
beforeDoctypePublicIdentifierState 16
afterDoctypePublicIdentifierState 14
afterDoctypeNameState 9
doctypeState 8
beforeDoctypeNameState 8
afterDoctypeSystemIdentifierState 6
afterAttributeNameState 5
commentStartDashState 2
bogusCommentState 2

For instance, I wouldn't bother doing much tuning of the
bogusCommentState. Anything called fewer than 50,000 times in this
test doesn't look like it would be worth the trouble.


-- Paul

(Thanks to those who suggested pyparsing as an alternative, but I
think this code is already beyond pyparsing in a few respects. For
one thing, this code works with an input stream, in order to process
large HTML files; pyparsing *only* works with an in-memory string.
This code can also take advantage of some performance short cuts,
knowing that it is parsing HTML; pyparsing's generic classes can't do
that.)
--
http://mail.python.org/mailman/listinfo/python-list


stefan_ml at behnel

Jul 5, 2009, 4:52 AM

Post #31 of 46 (722 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

John Nagle wrote:
> Here's some actual code, from "tokenizer.py". This is called once
> for each character in an HTML document, when in "data" state (outside
> a tag). It's straightforward code, but look at all those
> dictionary lookups.
>
> def dataState(self):
> data = self.stream.char()
>
> # Keep a charbuffer to handle the escapeFlag
> if self.contentModelFlag in\
> (contentModelFlags["CDATA"], contentModelFlags["RCDATA"]):
> if len(self.lastFourChars) == 4:
> self.lastFourChars.pop(0)
> self.lastFourChars.append(data)
>
> # The rest of the logic
> if data == "&" and self.contentModelFlag in\
> (contentModelFlags["PCDATA"], contentModelFlags["RCDATA"]) and
> not\
> self.escapeFlag:
> self.state = self.states["entityData"]
> elif data == "-" and self.contentModelFlag in\
> (contentModelFlags["CDATA"], contentModelFlags["RCDATA"]) and
> not\
> self.escapeFlag and "".join(self.lastFourChars) == "<!--":
> self.escapeFlag = True
> self.tokenQueue.append({"type": "Characters", "data":data})
> elif (data == "<" and (self.contentModelFlag ==
> contentModelFlags["PCDATA"]
> or (self.contentModelFlag in
> (contentModelFlags["CDATA"],
> contentModelFlags["RCDATA"]) and
> self.escapeFlag == False))):
> self.state = self.states["tagOpen"]
> elif data == ">" and self.contentModelFlag in\
> (contentModelFlags["CDATA"], contentModelFlags["RCDATA"]) and\
> self.escapeFlag and "".join(self.lastFourChars)[1:] == "-->":
> self.escapeFlag = False
> self.tokenQueue.append({"type": "Characters", "data":data})
> elif data == EOF:
> # Tokenization ends.
> return False
> elif data in spaceCharacters:
> # Directly after emitting a token you switch back to the "data
> # state". At that point spaceCharacters are important so
> they are
> # emitted separately.
> self.tokenQueue.append({"type": "SpaceCharacters", "data":
> data + self.stream.charsUntil(spaceCharacters, True)})
> # No need to update lastFourChars here, since the first
> space will
> # have already broken any <!-- or --> sequences
> else:
> chars = self.stream.charsUntil(("&", "<", ">", "-"))
> self.tokenQueue.append({"type": "Characters", "data":
> data + chars})
> self.lastFourChars += chars[-4:]
> self.lastFourChars = self.lastFourChars[-4:]
> return True

Giving this some more thought, I'd also try is to split the huge
if-elif-else block like this:

if data in string_with_all_special_characters:
if data == '&' ...:
...
else:
...

So there are three things to improve:

- eliminate common subexpressions which you know are constant
- split the large conditional sequence as shown above
- use separate dataState() methods when inside and outside of CDATA/RCDATA
blocks and (maybe) escaped blocks

Stefan
--
http://mail.python.org/mailman/listinfo/python-list


mail at microcorp

Jul 5, 2009, 6:45 AM

Post #32 of 46 (714 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

"Steven D'Aprano" <steve [at] REMOVE-THIS-cy> wrote:

>On Sun, 05 Jul 2009 10:12:54 +0200, Hendrik van Rooyen wrote:
>
>> Python is not C.
>
>John Nagle is an old hand at Python. He's perfectly aware of this, and
>I'm sure he's not trying to program C in Python.
>
>I'm not entirely sure *what* he is doing, and hopefully he'll speak up
>and say, but whatever the problem is it's not going to be as simple as
>that.

I am well aware that John is not a newbie.
He was complaining about Python's lack of a case statement
in the context of a state machine.

The point I was trying to make is that, if any state machine
is examined, then, if you examine any one state, the reasons
for leaving it ("state transitions") is always a subset of the
choices that _can_ be made.

So that drawing a circle round each state in a state diagram,
and making a routine to examine the arrows leaving that circle,
and returning the destination point of the chosen arrow,
is a way of splitting the job up, and results in making only
the relevant decisions at the time of their relevance.

This is in contrast to the classic C way of making one big
case statement to implement a finite state machine, which
gets its efficiency (if any) out of compiler optimisations
such as replacing a skip chain with a jump table.

I understand that it leads to a lot of what looks like
boilerplate code, but he was looking for speed...

- Hendrik



--
http://mail.python.org/mailman/listinfo/python-list


mail at microcorp

Jul 5, 2009, 7:03 AM

Post #33 of 46 (715 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

"Paul Rubin" <http://phr.cx [at] NOSPAM> wrote:

> The series of tests is written that way because there is no case
> statement available. It is essentially switching on a bunch of
> character constants and then doing some additional tests in each
> branch.
>
> It could be that using ord(c) as an index into a list of functions
> might be faster than a dict lookup on c to get a function. I think
> John is hoping to avoid a function call and instead get an indexed
> jump within the Python bytecode for the big function.

I agree about the ord(c). However, avoiding the function call
is, I think, right now, in the realms of wishful thinking.

I cannot see how you could avoid a python function call - even if he
bites the bullet and implements my laborious scheme, he would still
have to fetch the next character to test against, inside the current state.

So if it is the function calls that is slowing him down, I cannot
imagine a solution using less than one per character, in which
case he is screwed no matter what he does.

But wait - maybe if he passes an iterator around - the equivalent of
for char in input_stream...
Still no good though, unless the next call to the iterator is faster
than an ordinary python call.

- Hendrik



--
http://mail.python.org/mailman/listinfo/python-list


l.mastrodomenico at gmail

Jul 5, 2009, 7:25 AM

Post #34 of 46 (718 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

2009/7/5 Hendrik van Rooyen <mail [at] microcorp>:
> I cannot see how you could avoid a python function call - even if he
> bites the bullet and implements my laborious scheme, he would still
> have to fetch the next character to test against, inside the current state.
>
> So if it is the function calls that is slowing him down, I cannot
> imagine a solution using less than one per character, in which
> case he is screwed no matter what he does.

A simple solution may be to read the whole input HTML file in a
string. This potentially requires lots of memory but I suspect that
the use case by far most common for this parser is to build a DOM (or
DOM-like) tree of the whole document. This tree usually requires much
more memory that the HTML source itself.

So, if the code duplication is acceptable, I suggest keeping this
implementation for cases where the input is extremely big *AND* the
whole program will work on it in "streaming", not just the parser
itself.

Then write a simpler and faster parser for the more common case when
the data is not huge *OR* the user will keep the whole document in
memory anyway (e.g. on a tree).

Also: profile, profile a lot. HTML pages are very strange beasts and
the bottlenecks may be in innocent-looking places!

--
Lino Mastrodomenico
--
http://mail.python.org/mailman/listinfo/python-list


aahz at pythoncraft

Jul 5, 2009, 8:08 AM

Post #35 of 46 (717 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

In article <mailman.2639.1246802753.8015.python-list [at] python>,
Hendrik van Rooyen <mail [at] microcorp> wrote:
>
>But wait - maybe if he passes an iterator around - the equivalent of
>for char in input_stream... Still no good though, unless the next call
>to the iterator is faster than an ordinary python call.

Calls to iterators created by generators are indeed faster than an
ordinary Python call, because the stack frame is already mostly set up.
--
Aahz (aahz [at] pythoncraft) <*> http://www.pythoncraft.com/

"as long as we like the same operating system, things are cool." --piranha
--
http://mail.python.org/mailman/listinfo/python-list


martin at v

Jul 5, 2009, 12:23 PM

Post #36 of 46 (703 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

> This is a good test for Python implementation bottlenecks. Run
> that tokenizer on HTML, and see where the time goes.

I looked at it with cProfile, and the top function that comes up
for a larger document (52k) is
...validator.HTMLConformanceChecker.__iter__.

This method dispatches various validation routines, and it computes
the method names from the input over and over again, doing lots
of redundant string concatenations. It also capitalizes the element
names, even though the spelling in the original document is probably
not capitalized (but either upper-case or lower case).

In my patch below, I create a dictionary of bound methods, indexed
by (syntax) type and name, following the logic of falling back to
just type-based validation if no type/name routine exists. However,
in order to reduce the number of dictionary lookups, it will also
cache type/name pairs (both in the original spelling, and the
capitalized spelling), so that subsequent occurrences of the same
element will hit the method cache.

With this simple optimization, I get a 20% speedup on my test
case. In my document, there are no attributes - the same changes
should be made to attribute validation routines.

I don't think this has anything to do with the case statement.

Regards,
Martin
Attachments: methodlookup.diff (2.71 KB)


david.m.cooke at gmail

Jul 6, 2009, 1:16 AM

Post #37 of 46 (694 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

Martin v. Löwis <martin <at> v.loewis.de> writes:

> > This is a good test for Python implementation bottlenecks. Run
> > that tokenizer on HTML, and see where the time goes.
>
> I looked at it with cProfile, and the top function that comes up
> for a larger document (52k) is
> ...validator.HTMLConformanceChecker.__iter__.
[...]
> With this simple optimization, I get a 20% speedup on my test
> case. In my document, there are no attributes - the same changes
> should be made to attribute validation routines.
>
> I don't think this has anything to do with the case statement.

I agree. I ran cProfile over just the tokenizer step; essentially

tokenizer = html5lib.tokenizer.HTMLStream(htmldata)
for tok in tokenizer:
pass

It mostly *isn't* tokenizer.py that's taking the most time, it's
inputstream.py. (There is one exception:
tokenizer.py:HTMLStream.__init__ constructs a dictionary of states
each time -- this is unnecessary, replace all expressions like
self.states["attributeName"] with self.attributeNameState.)

I've done several optimisations -- I'll upload the patch to the
html5lib issue tracker. In particular,

* The .position property of EncodingBytes is used a lot. Every
self.position +=1 calls getPosition() and setPosition(). Another
getPosition() call is done in the self.currentByte property. Most of
these can be optimised away by using methods that move the position
and return the current byte.

* In HTMLInputStream, the current line number and column are updated
every time a new character is read with .char(). The current position
is *only* used in error reporting, so I reworked it to only calculate
the position when .position() is called, by keeping track of the
number of lines in previous read chunks, and computing the number of
lines to the current offset in the current chunk.

These give me about a 20% speedup.

This just illustrates that the first step in optimisation is profiling :D

As other posters have said, slurping the whole document into memory
and using a regexp-based parser (such as pyparsing) would likely give
you the largest speedups. If you want to keep the chunk- based
approach, you can still use regexp's, but you'd have to think about
matching on chunk boundaries. One way would be to guarantee a minimum
number of characters available, say 10 or 50 (unless end-of-file, of
course) -- long enough such that any *constant* string you'd want to
match like <![.CDATA[. would fit inside that guaranteed length. Any
arbitrary-length tokens (such as attribute names and values) would be
matched, not with regexps like [a-z]+, but with [a-z]{1,10} (match
[a-z] from 1 to 10 times), and joining the individual matches together
to make one token.

Since html5lib has several implementations for several languages, it
may actually be worth it to generate lexers for each language from one
specification file.

Take care,
David M. Cooke <david.m.cooke [at] gmail>

--
http://mail.python.org/mailman/listinfo/python-list


ldo at geek-central

Jul 6, 2009, 1:32 AM

Post #38 of 46 (685 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

In message <4a4f91f9$0$1587$742ec2ed [at] news>, John Nagle wrote:

> ("It should be written in C" is not an acceptable answer.)

I don't see why not. State machines that have to process input byte by byte
are well known to be impossible to implement efficiently in high-level
languages. That's why lex/flex isn't worth using. Even GCC has been moving
to hand-coded parsers, first the lexical analyzer, and more recently even
for the syntax parser (getting rid of yacc/bison), certainly for C.

--
http://mail.python.org/mailman/listinfo/python-list


jeanmichel at sequans

Jul 6, 2009, 5:54 AM

Post #39 of 46 (680 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

> protocol = {"start":initialiser,"hunt":hunter,"classify":classifier,....other
> states}
>
> def state_machine():
> next_step = protocol["start"]()
> while True:
> next_step = protocol[next_step]()
>
>
Woot ! I'll keep this one in my mind, while I may not be that concerned
by speed unlike the OP, I still find this way of doing very simple and
so intuitive (one will successfully argue how I was not figuring this
out by myself if it was so intuitive).
Anyway I wanted to participated to this thread, as soon as I saw 'due to
python limitations' in the title, I foretold a hell of a thread ! This
is just provocation ! :-)

JM
--
http://mail.python.org/mailman/listinfo/python-list


mail at microcorp

Jul 6, 2009, 7:04 AM

Post #40 of 46 (687 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

"Jean-Michel Pichavant" <jeanmichel [at] s> wrote:

> Woot ! I'll keep this one in my mind, while I may not be that concerned
> by speed unlike the OP, I still find this way of doing very simple and
> so intuitive (one will successfully argue how I was not figuring this
> out by myself if it was so intuitive).
> Anyway I wanted to participated to this thread, as soon as I saw 'due to
> python limitations' in the title, I foretold a hell of a thread ! This
> is just provocation ! :-)

The OP was not being provocative - he has a real problem, and the
code he is complaining about already does more or less what my
snippet showed, as I rushed in where angels fear to tread...

The bit that was not clearly shown in what I proposed, is that you
should stay in the individual states, testing for the reasons for the
state transitions, until it is time to change - so there is a while loop
in each of the individual states too. It becomes a terribly big structure
if you have a lot of states, it duplicates a lot of tests across the different
states, and it is very awkward if the states nest.

Have a look at the one without the dict too - it is even faster as it
avoids the dict lookup.

That, however, is a bit like assembler code, as it kind of "jumps"
from state to state, and there is no central thing to show what does,
and what does not, belong together, as there is no dict. Not an easy
beast to fix if it's big and it's wrong.

- Hendrik



--
http://mail.python.org/mailman/listinfo/python-list


james at agentultra

Jul 6, 2009, 8:06 AM

Post #41 of 46 (677 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

aahz [at] pythoncraft (Aahz) writes:

> In article <mailman.2639.1246802753.8015.python-list [at] python>,
> Hendrik van Rooyen <mail [at] microcorp> wrote:
>>
>>But wait - maybe if he passes an iterator around - the equivalent of
>>for char in input_stream... Still no good though, unless the next call
>>to the iterator is faster than an ordinary python call.
>
> Calls to iterators created by generators are indeed faster than an
> ordinary Python call, because the stack frame is already mostly set up.

I think Beazely demonstrated this in his talk on using the python 2.5
co-routines to setup an xml parser. I believe he benchmarked it roughly
and the initial results were rather impressive.

http://www.dabeaz.com/coroutines/
--
http://mail.python.org/mailman/listinfo/python-list


jeanmichel at sequans

Jul 6, 2009, 8:24 AM

Post #42 of 46 (672 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

Hendrik van Rooyen wrote:
> "Jean-Michel Pichavant" <jeanmichel [at] s> wrote:
>
>
>> Woot ! I'll keep this one in my mind, while I may not be that concerned
>> by speed unlike the OP, I still find this way of doing very simple and
>> so intuitive (one will successfully argue how I was not figuring this
>> out by myself if it was so intuitive).
>> Anyway I wanted to participated to this thread, as soon as I saw 'due to
>> python limitations' in the title, I foretold a hell of a thread ! This
>> is just provocation ! :-)
>>
>
> The OP was not being provocative - he has a real problem,

I was just kidding, asserting for python limitations in this list
guarantees that the thread will last for several days, whether or not
the assertion is right.

JM
--
http://mail.python.org/mailman/listinfo/python-list


nagle at animats

Jul 6, 2009, 11:21 PM

Post #43 of 46 (657 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

Steven D'Aprano wrote:
> On Sun, 05 Jul 2009 01:58:13 -0700, Paul Rubin wrote:
>
>> Steven D'Aprano <steve [at] REMOVE-THIS-cybersource> writes:
>>> Okay, we get it. Parsing HTML 5 is a bitch. What's your point? I don't
>>> see how a case statement would help you here: you're not dispatching on
>>> a value, but running through a series of tests until one passes.
>> A case statement switch(x):... into a bunch of constant case labels
>> would be able to use x as an index into a jump vector, and/or do an
>> unrolled logarithmic (bisection-like) search through the tests, instead
>> of a linear search.
>
> Yes, I'm aware of that, but that's not what John's code is doing -- he's
> doing a series of if expr ... elif expr tests. I don't think a case
> statement can do much to optimize that.

(I didn't write that code; it's from "http://code.google.com/p/html5lib/",
which is a general purpose HTML 5 parser written in Python. It's compatible
with ElementTree and/or BeautifulSoup. I currently use a modified
BeautifulSoup for parsing real-world HTML in a small-scale crawler, and
I'm looking at this as an HTML 5 compatible replacement.)

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


nagle at animats

Jul 6, 2009, 11:31 PM

Post #44 of 46 (658 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

Steven D'Aprano wrote:
> On Sun, 05 Jul 2009 10:12:54 +0200, Hendrik van Rooyen wrote:
>
>> Python is not C.
>
> John Nagle is an old hand at Python. He's perfectly aware of this, and
> I'm sure he's not trying to program C in Python.
>
> I'm not entirely sure *what* he is doing, and hopefully he'll speak up
> and say, but whatever the problem is it's not going to be as simple as
> that.

I didn't write this code; I'm just using it. As I said in the
original posting, it's from "http://code.google.com/p/html5lib".
It's from an effort to write a clean HTML 5 parser in Python for
general-purpose use. HTML 5 parsing is well-defined for the awful
cases that make older browsers incompatible, but quite complicated.
The Python implementation here is intended partly as a reference
implementation, so browser writers have something to compare with.

I have a small web crawler robust enough to parse
real-world HTML, which can be appallingly bad. I currently use
an extra-robust version of BeautifulSoup, and even that sometimes
blows up. So I'm very interested in a new Python parser which supposedly
handles bad HTML in the same way browsers do. But if it's slower
than BeautifulSoup, there's a problem.

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


stefan_ml at behnel

Jul 7, 2009, 2:09 AM

Post #45 of 46 (658 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

John Nagle wrote:
> I have a small web crawler robust enough to parse
> real-world HTML, which can be appallingly bad. I currently use
> an extra-robust version of BeautifulSoup, and even that sometimes
> blows up. So I'm very interested in a new Python parser which supposedly
> handles bad HTML in the same way browsers do. But if it's slower
> than BeautifulSoup, there's a problem.

Well, if performance matters in any way, you can always use lxml's
blazingly fast parser first, possibly trying a couple of different
configurations, and only if all fail, fall back to running html5lib over
the same input. That should give you a tremendous speed-up over your
current code in most cases, while keeping things robust in the hard cases.

Note the numbers that Ian Bicking has for HTML parser performance:

http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

You should be able to run lxml's parser ten times in different
configurations (e.g. different charset overrides) before it even reaches
the time that BeautifulSoup would need to parse a document once. Given that
undeclared character set detection is something where BS is a lot better
than lxml, you can also mix the best of both worlds and use BS's character
set detection to configure lxml's parser if you notice that the first
parsing attempts fail.

And yes, html5lib performs pretty badly in comparison (or did, at the
time). But the numbers seem to indicate that if you can drop the ratio of
documents that require a run of html5lib below 30% and use lxml's parser
for the rest, you will still be faster than with BeautifulSoup alone.

Stefan
--
http://mail.python.org/mailman/listinfo/python-list


nagle at animats

Jul 7, 2009, 9:40 AM

Post #46 of 46 (666 views)
Permalink
Re: Code that ought to run fast, but can't due to Python limitations. [In reply to]

Stefan Behnel wrote:
> John Nagle wrote:
>> I have a small web crawler robust enough to parse
>> real-world HTML, which can be appallingly bad. I currently use
>> an extra-robust version of BeautifulSoup, and even that sometimes
>> blows up. So I'm very interested in a new Python parser which supposedly
>> handles bad HTML in the same way browsers do. But if it's slower
>> than BeautifulSoup, there's a problem.
>
> Well, if performance matters in any way, you can always use lxml's
> blazingly fast parser first, possibly trying a couple of different
> configurations, and only if all fail, fall back to running html5lib over
> the same input.

Detecting "fail" is difficult. A common problem is badly terminated
comments which eat most of the document if you follow the spec. The
document seems to parse correctly, but most of it is missing. The
HTML 5 spec actually covers things like

<!This is a bogus SGML directive>

and treats it as a bogus comment. (That's because HTML 5 doesn't
include general SGML; the only directive recognized is DOCTYPE.
Anything else after "<!" is treated as a token-level error.)

So using an agreed-upon parsing method, in the form of html5lib,
is desirable, in that it should mimic browser behavior.

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list

First page Previous page 1 2 Next page Last page  View All Python python RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.