Mailman 3 HTML Parser? - pypy-dev - python.org

newer
[rpython] What might prevent a...

HTML Parser?

older
Can I install pypy on Mac osx 32...

Joe Hillenbrand

Feb. 20, 2013

5:50 p.m.

What is the recommended HTML parser to run in PyPy? The typical goto for Python is lxml, but of course that doesn't work with PyPy. Has anyone tested any other libraries? Are there any benchmarks? Thanks, -Joe

Attachments:

attachment.html (text/html — 243 bytes)

Reply

Sign in to reply online Use email software

Show replies by date

Amaury Forgeot d'Arc

February 2013

6:02 p.m.

2013/2/20 Joe Hillenbrand <joehillen@gmail.com>

What is the recommended HTML parser to run in PyPy?

The typical goto for Python is lxml, but of course that doesn't work with PyPy.

This is not true anymore. There has been a lot of work on both sides to make lxml work with PyPy. You should try with latest versions. In addition, there is a port of lxml that does not use Cython nor the C API: https://github.com/amauryfa/lxml/tree/lxml-cffi most of the tests are passing (except objectify), but "setup.py install" does not work yet. It works from the source tree, though. -- Amaury Forgeot d'Arc

Reply

Sign in to reply online Use email software

Maciej Fijalkowski

6:07 p.m.

On Wed, Feb 20, 2013 at 8:02 PM, Amaury Forgeot d'Arc <amauryfa@gmail.com> wrote:

2013/2/20 Joe Hillenbrand <joehillen@gmail.com>

...
What is the recommended HTML parser to run in PyPy?

The typical goto for Python is lxml, but of course that doesn't work with PyPy.

This is not true anymore. There has been a lot of work on both sides to make lxml work with PyPy. You should try with latest versions.

In addition, there is a port of lxml that does not use Cython nor the C API: https://github.com/amauryfa/lxml/tree/lxml-cffi most of the tests are passing (except objectify), but "setup.py install" does not work yet. It works from the source tree, though.

-- Amaury Forgeot d'Arc _______________________________________________ pypy-dev mailing list pypy-dev@python.org http://mail.python.org/mailman/listinfo/pypy-dev

Is it working on released cffi or on cffi that's in-development or you need patches?

Reply

Sign in to reply online Use email software

Amaury Forgeot d'Arc

7:28 p.m.

2013/2/20 Maciej Fijalkowski <fijall@gmail.com>

Is it working on released cffi or on cffi that's in-development or you need patches?

It developed it with a nightly build from mid-January, and the cffi library that was available at the time. It's now released as cffi 0.5 I think. I did not test with CPython at all. At the time cffi used to return enum values as strings, but I just tested with the last version of cffi and pypy nightly build, and tests still pass! Ran 1006 tests in 34.730s FAILED (failures=1) and the only failure is:: self.assertTrue(hasattr(self.etree, '_import_c_api')) :-) -- Amaury Forgeot d'Arc

Reply

Sign in to reply online Use email software

Joe Hillenbrand

6:39 a.m.

Great to hear! I just got it working with scrapy. Unfortunately there wasn't any speedup. A normal crawl in CPython takes: real 1m32.238s user 0m56.576s sys 0m1.208s In PyPy: real 1m54.098s user 1m18.105s sys 0m1.372s Thanks for all your hard work. -Joe On Wed, Feb 20, 2013 at 11:28 AM, Amaury Forgeot d'Arc <amauryfa@gmail.com>wrote:

2013/2/20 Maciej Fijalkowski <fijall@gmail.com>

...
Is it working on released cffi or on cffi that's in-development or you need patches?

It developed it with a nightly build from mid-January, and the cffi library that was available at the time. It's now released as cffi 0.5 I think.

I did not test with CPython at all.

At the time cffi used to return enum values as strings, but I just tested with the last version of cffi and pypy nightly build, and tests still pass!

Ran 1006 tests in 34.730s FAILED (failures=1) and the only failure is:: self.assertTrue(hasattr(self.etree, '_import_c_api')) :-)

-- Amaury Forgeot d'Arc

Reply

Sign in to reply online Use email software

Maciej Fijalkowski

10:19 a.m.

On Fri, Feb 22, 2013 at 8:39 AM, Joe Hillenbrand <joehillen@gmail.com> wrote:

Great to hear! I just got it working with scrapy. Unfortunately there wasn't any speedup.

A normal crawl in CPython takes: real 1m32.238s user 0m56.576s sys 0m1.208s

In PyPy: real 1m54.098s user 1m18.105s sys 0m1.372s

Thanks for all your hard work.

-Joe

lxml-cffi is known to be slower than normal lxml. You'll get speedups if you start doing non-trivial logic in python, probably. For what is worth, cffi is missing a lot of trivial optimizations (and one non-trivial), so there is a lot of room for improvement.

Reply

Sign in to reply online Use email software

Armin Rigo

7:29 p.m.

Hi all, Just so everybody knows, the plan is to release CFFI 0.6 latest when we do the PyPy 2.0 release, and include it fully inside PyPy too. (The idea is to avoid "pip install cffi", which would get a potentially incompatible version: PyPy includes the "_cffi_backend" module, which only works with a specific version of CFFI). A bientôt, Armin.

Reply

Sign in to reply online Use email software

Alex Gaynor

7:39 p.m.

Are we also planning to bundle ply and cparser? Alex On Wed, Feb 20, 2013 at 11:29 AM, Armin Rigo <arigo@tunes.org> wrote:

Hi all,

Just so everybody knows, the plan is to release CFFI 0.6 latest when we do the PyPy 2.0 release, and include it fully inside PyPy too. (The idea is to avoid "pip install cffi", which would get a potentially incompatible version: PyPy includes the "_cffi_backend" module, which only works with a specific version of CFFI).

A bientôt,

Armin. _______________________________________________ pypy-dev mailing list pypy-dev@python.org http://mail.python.org/mailman/listinfo/pypy-dev

-- "I disapprove of what you say, but I will defend to the death your right to say it." -- Evelyn Beatrice Hall (summarizing Voltaire) "The people's good is the highest law." -- Cicero

Reply

Sign in to reply online Use email software

Maciej Fijalkowski

8:03 p.m.

On Wed, Feb 20, 2013 at 9:29 PM, Armin Rigo <arigo@tunes.org> wrote:

Hi all,

Just so everybody knows, the plan is to release CFFI 0.6 latest when we do the PyPy 2.0 release, and include it fully inside PyPy too. (The idea is to avoid "pip install cffi", which would get a potentially incompatible version: PyPy includes the "_cffi_backend" module, which only works with a specific version of CFFI).

A bientôt,

Armin. _______________________________________________ pypy-dev mailing list pypy-dev@python.org http://mail.python.org/mailman/listinfo/pypy-dev

One thing we have to consider is how do you write setup.py (or requirements.txt) in case you need to install cffi on cpython but not pypy

Reply

Sign in to reply online Use email software

4402

Age (days ago)

4404

Last active (days ago)

Download

8 comments

5 participants

tags

participants (5)

Alex Gaynor
Amaury Forgeot d'Arc
Armin Rigo
Joe Hillenbrand
Maciej Fijalkowski