[lxml-dev] HTMLParser status and issues
data:image/s3,"s3://crabby-images/c5c51/c5c5148aeed9b9d619f7c3e0fded64c010a11a8f" alt=""
Howdy. I was giving the htmlparser branch a try. In trying to compile it, I got: python setup.py build_ext -i running build_ext building 'lxml.etree' extension gcc -fno-strict-aliasing -Wno-long-double -no-cpp-precomp -mno-fused-madd -fno-common -dynamic -DNDEBUG -g -O3 -Wall -Wstrict-prototypes -I/Library/Frameworks/Python.framework/Versions/2.4/include/python2.4 -c src/lxml/etree.c -o build/temp.darwin-8.6.0-Power_Macintosh-2.4/src/lxml/etree.o -w -I/usr/include/libxml2 src/lxml/etree.c: In function '__pyx_f_5etree_10HTMLParser___init__': src/lxml/etree.c:17245: error: 'HTML_PARSE_RECOVER' undeclared (first use in this function) src/lxml/etree.c:17245: error: (Each undeclared identifier is reported only once src/lxml/etree.c:17245: error: for each function it appears in.) src/lxml/etree.c:17256: error: 'HTML_PARSE_COMPACT' undeclared (first use in this function) src/lxml/etree.c: In function 'initetree': src/lxml/etree.c:31135: error: 'HTML_PARSE_RECOVER' undeclared (first use in this function) src/lxml/etree.c:31135: error: 'HTML_PARSE_COMPACT' undeclared (first use in this function) error: command 'gcc' failed with exit status 1 --Paul
data:image/s3,"s3://crabby-images/c5c51/c5c5148aeed9b9d619f7c3e0fded64c010a11a8f" alt=""
Forgot to ask the question about status. :^) First, there are two branches: http://codespeak.net/svn/lxml/branch/htmlparse/ http://codespeak.net/svn/lxml/branch/htmlparser/ I'm presuming the latter is the one I want. Perhaps the former should get renamed to something less of a decoy? Next, once I get the parser working, I'd also like to use extensions as described here: http://codespeak.net/svn/lxml/trunk/doc/extensions.txt However, the htmlparser branch is older than the extensions work (I believe). Stefan, any chance the htmlparser branch could get the changes from the trunk? I'm particularly eager to get this combination working. The pipeline templating stuff I'm working on needs to handle non-well-formed HTML. It also needs a workaround for the fact that DOCTYPE (and encoding) information isn't available in the parse tree and thus isn't available in an XSLT template. As a workaround, I'd like to retrieve the information out-of-band and make it available as an extension function. --Paul Paul Everitt wrote:
Howdy. I was giving the htmlparser branch a try. In trying to compile it, I got:
python setup.py build_ext -i running build_ext building 'lxml.etree' extension gcc -fno-strict-aliasing -Wno-long-double -no-cpp-precomp -mno-fused-madd -fno-common -dynamic -DNDEBUG -g -O3 -Wall -Wstrict-prototypes -I/Library/Frameworks/Python.framework/Versions/2.4/include/python2.4 -c src/lxml/etree.c -o build/temp.darwin-8.6.0-Power_Macintosh-2.4/src/lxml/etree.o -w -I/usr/include/libxml2 src/lxml/etree.c: In function '__pyx_f_5etree_10HTMLParser___init__': src/lxml/etree.c:17245: error: 'HTML_PARSE_RECOVER' undeclared (first use in this function) src/lxml/etree.c:17245: error: (Each undeclared identifier is reported only once src/lxml/etree.c:17245: error: for each function it appears in.) src/lxml/etree.c:17256: error: 'HTML_PARSE_COMPACT' undeclared (first use in this function) src/lxml/etree.c: In function 'initetree': src/lxml/etree.c:31135: error: 'HTML_PARSE_RECOVER' undeclared (first use in this function) src/lxml/etree.c:31135: error: 'HTML_PARSE_COMPACT' undeclared (first use in this function) error: command 'gcc' failed with exit status 1
--Paul
data:image/s3,"s3://crabby-images/c6057/c6057bed8007c428c0e26b11fb68644c69f16b19" alt=""
Hi Paul, Paul Everitt wrote:
First, there are two branches:
http://codespeak.net/svn/lxml/branch/htmlparse/ http://codespeak.net/svn/lxml/branch/htmlparser/
I'm presuming the latter is the one I want.
Yes. I actually created that branch before I noticed that there already was a branch called "htmlparse"...
Perhaps the former should get renamed to something less of a decoy?
Would be better, yes. Anyway, if "htmlparser" gets merged into the trunk, that won't matter too much...
Next, once I get the parser working, I'd also like to use extensions as described here:
http://codespeak.net/svn/lxml/trunk/doc/extensions.txt
However, the htmlparser branch is older than the extensions work (I believe). Stefan, any chance the htmlparser branch could get the changes from the trunk?
Hmm, they should actually be in the branch. I merged them a while ago in order to make the diff usable.
I'm particularly eager to get this combination working. The pipeline templating stuff I'm working on needs to handle non-well-formed HTML. It also needs a workaround for the fact that DOCTYPE (and encoding) information isn't available in the parse tree and thus isn't available in an XSLT template.
As a workaround, I'd like to retrieve the information out-of-band and make it available as an extension function.
Just try, it should work. The more you test the branch, the faster it can be merged into the trunk. Then you will have everything in there that the current trunk supports. Stefan
data:image/s3,"s3://crabby-images/c6057/c6057bed8007c428c0e26b11fb68644c69f16b19" alt=""
Paul Everitt wrote:
Howdy. I was giving the htmlparser branch a try. In trying to compile it, I got:
src/lxml/etree.c: In function '__pyx_f_5etree_10HTMLParser___init__': src/lxml/etree.c:17245: error: 'HTML_PARSE_RECOVER' undeclared (first use in this function) src/lxml/etree.c:17245: error: (Each undeclared identifier is reported only once src/lxml/etree.c:17245: error: for each function it appears in.) src/lxml/etree.c:17256: error: 'HTML_PARSE_COMPACT' undeclared (first use in this function) src/lxml/etree.c: In function 'initetree': src/lxml/etree.c:31135: error: 'HTML_PARSE_RECOVER' undeclared (first use in this function) src/lxml/etree.c:31135: error: 'HTML_PARSE_COMPACT' undeclared (first use in this function) error: command 'gcc' failed with exit status 1
Hmm, I don't see a reason for that error. My clean checkout compiles nicely. What's your libxml2 version on MacOS? In my include/libxml2/HTMLparser.h it says somewhere around line 175: typedef enum { HTML_PARSE_RECOVER = 1<<0, /* Relaxed parsing */ HTML_PARSE_NOERROR = 1<<5, /* suppress error reports */ HTML_PARSE_NOWARNING= 1<<6, /* suppress warning reports */ HTML_PARSE_PEDANTIC = 1<<7, /* pedantic error reporting */ HTML_PARSE_NOBLANKS = 1<<8, /* remove blank nodes */ HTML_PARSE_NONET = 1<<11,/* Forbid network access */ HTML_PARSE_COMPACT = 1<<16 /* compact small text nodes */ } htmlParserOption; All options known in my place - but then, that's libxml 2.6.23 ... If the above enum contains the variables in your system, would you mind sending me the etree.c that Pyrex generated for you? Stefan
data:image/s3,"s3://crabby-images/c5c51/c5c5148aeed9b9d619f7c3e0fded64c010a11a8f" alt=""
Stefan Behnel wrote:
Paul Everitt wrote:
Howdy. I was giving the htmlparser branch a try. In trying to compile it, I got:
src/lxml/etree.c: In function '__pyx_f_5etree_10HTMLParser___init__': src/lxml/etree.c:17245: error: 'HTML_PARSE_RECOVER' undeclared (first use in this function) src/lxml/etree.c:17245: error: (Each undeclared identifier is reported only once src/lxml/etree.c:17245: error: for each function it appears in.) src/lxml/etree.c:17256: error: 'HTML_PARSE_COMPACT' undeclared (first use in this function) src/lxml/etree.c: In function 'initetree': src/lxml/etree.c:31135: error: 'HTML_PARSE_RECOVER' undeclared (first use in this function) src/lxml/etree.c:31135: error: 'HTML_PARSE_COMPACT' undeclared (first use in this function) error: command 'gcc' failed with exit status 1
Hmm, I don't see a reason for that error. My clean checkout compiles nicely.
What's your libxml2 version on MacOS? In my include/libxml2/HTMLparser.h it says somewhere around line 175:
$ xmllint --version xmllint: using libxml version 20622 You're not OS X, right?
typedef enum { HTML_PARSE_RECOVER = 1<<0, /* Relaxed parsing */ HTML_PARSE_NOERROR = 1<<5, /* suppress error reports */ HTML_PARSE_NOWARNING= 1<<6, /* suppress warning reports */ HTML_PARSE_PEDANTIC = 1<<7, /* pedantic error reporting */ HTML_PARSE_NOBLANKS = 1<<8, /* remove blank nodes */ HTML_PARSE_NONET = 1<<11,/* Forbid network access */ HTML_PARSE_COMPACT = 1<<16 /* compact small text nodes */ } htmlParserOption;
All options known in my place - but then, that's libxml 2.6.23 ...
That will be kinda funny if .22 is the smoking gun. ;^)
If the above enum contains the variables in your system, would you mind sending me the etree.c that Pyrex generated for you?
Yep, I'll send it in a private note. Thanks! --Paul
data:image/s3,"s3://crabby-images/c6057/c6057bed8007c428c0e26b11fb68644c69f16b19" alt=""
Paul Everitt wrote:
Stefan Behnel wrote:
Paul Everitt wrote:
Howdy. I was giving the htmlparser branch a try. In trying to compile it, I got:
src/lxml/etree.c: In function '__pyx_f_5etree_10HTMLParser___init__': src/lxml/etree.c:17245: error: 'HTML_PARSE_RECOVER' undeclared (first use in this function) src/lxml/etree.c:17245: error: (Each undeclared identifier is reported only once src/lxml/etree.c:17245: error: for each function it appears in.) src/lxml/etree.c:17256: error: 'HTML_PARSE_COMPACT' undeclared (first use in this function) src/lxml/etree.c: In function 'initetree': src/lxml/etree.c:31135: error: 'HTML_PARSE_RECOVER' undeclared (first use in this function) src/lxml/etree.c:31135: error: 'HTML_PARSE_COMPACT' undeclared (first use in this function) error: command 'gcc' failed with exit status 1
Hmm, I don't see a reason for that error. My clean checkout compiles nicely.
What's your libxml2 version on MacOS? In my include/libxml2/HTMLparser.h it says somewhere around line 175:
$ xmllint --version xmllint: using libxml version 20622
You're not OS X, right?
I'm on Linux. 2.6.22 should work perfectly, I just checked.
typedef enum { HTML_PARSE_RECOVER = 1<<0, /* Relaxed parsing */ HTML_PARSE_NOERROR = 1<<5, /* suppress error reports */ HTML_PARSE_NOWARNING= 1<<6, /* suppress warning reports */ HTML_PARSE_PEDANTIC = 1<<7, /* pedantic error reporting */ HTML_PARSE_NOBLANKS = 1<<8, /* remove blank nodes */ HTML_PARSE_NONET = 1<<11,/* Forbid network access */ HTML_PARSE_COMPACT = 1<<16 /* compact small text nodes */ } htmlParserOption;
All options known in my place - but then, that's libxml 2.6.23 ...
That will be kinda funny if .22 is the smoking gun. ;^)
If the above enum contains the variables in your system, would you mind sending me the etree.c that Pyrex generated for you?
Yep, I'll send it in a private note. Thanks!
Thanks. I really can't see a problem in there. Maybe it's a compiler issue. I rewrote a part that might have shown a different usage of those two enum values. Could you retry with the current SVN? Stefan
participants (2)
-
Paul Everitt
-
Stefan Behnel