Code that ought to run fast, but can't due to Python limitations.
Aahz
aahz at pythoncraft.com
Sat Jul 4 23:37:16 EDT 2009
In article <4a501a5e$0$1640$742ec2ed at news.sonic.net>,
John Nagle <nagle at animats.com> wrote:
>
> Here's some actual code, from "tokenizer.py". This is called once
>for each character in an HTML document, when in "data" state (outside
>a tag). It's straightforward code, but look at all those
>dictionary lookups.
>
> def dataState(self):
> data = self.stream.char()
>
> # Keep a charbuffer to handle the escapeFlag
> if self.contentModelFlag in\
> (contentModelFlags["CDATA"], contentModelFlags["RCDATA"]):
> if len(self.lastFourChars) == 4:
> self.lastFourChars.pop(0)
> self.lastFourChars.append(data)
>
> # The rest of the logic
> if data == "&" and self.contentModelFlag in\
> (contentModelFlags["PCDATA"], contentModelFlags["RCDATA"]) and not\
> self.escapeFlag:
> self.state = self.states["entityData"]
> elif data == "-" and self.contentModelFlag in\
> (contentModelFlags["CDATA"], contentModelFlags["RCDATA"]) and not\
> self.escapeFlag and "".join(self.lastFourChars) == "<!--":
> self.escapeFlag = True
> self.tokenQueue.append({"type": "Characters", "data":data})
> elif (data == "<" and (self.contentModelFlag == contentModelFlags["PCDATA"]
> or (self.contentModelFlag in
> (contentModelFlags["CDATA"],
> contentModelFlags["RCDATA"]) and
> self.escapeFlag == False))):
> self.state = self.states["tagOpen"]
> elif data == ">" and self.contentModelFlag in\
> (contentModelFlags["CDATA"], contentModelFlags["RCDATA"]) and\
> self.escapeFlag and "".join(self.lastFourChars)[1:] == "-->":
> self.escapeFlag = False
> self.tokenQueue.append({"type": "Characters", "data":data})
> elif data == EOF:
> # Tokenization ends.
> return False
> elif data in spaceCharacters:
> # Directly after emitting a token you switch back to the "data
> # state". At that point spaceCharacters are important so they are
> # emitted separately.
> self.tokenQueue.append({"type": "SpaceCharacters", "data":
> data + self.stream.charsUntil(spaceCharacters, True)})
> # No need to update lastFourChars here, since the first space will
> # have already broken any <!-- or --> sequences
> else:
> chars = self.stream.charsUntil(("&", "<", ">", "-"))
> self.tokenQueue.append({"type": "Characters", "data":
> data + chars})
> self.lastFourChars += chars[-4:]
> self.lastFourChars = self.lastFourChars[-4:]
> return True
Every single "self." is a dictionary lookup. Were you referring to
those? If not, I don't see your point. If yes, well, that's kind of the
whole point of using Python. You do pay a performance penalty. You can
optimize out some lookups, but you need to switch to C for some kinds of
computationally intensive algorithms. In this case, you can probably get
a large boost out of Pysco or Cython or Pyrex.
--
Aahz (aahz at pythoncraft.com) <*> http://www.pythoncraft.com/
"as long as we like the same operating system, things are cool." --piranha
More information about the Python-list
mailing list