[lxml-dev] lxml.html.parse() should behave like lxml.etree.parse()
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Hi, I noticed that lxml.html.parse() currently takes a string as argument that is interpreted as HTML code. However, lxml.etree.parse() also takes a string, which is interpreted as a filename. I think we should not divert from the other lxml APIs here. At least, it surprised me when I called etree.tostring(html.parse("doc/html/api.html")) and got "<p>doc/html/api.html</p>" as a result. I really want lxml to stay an integrated set of tools, things that work together smoothly. And a commen base API is very important here. I don't mind Elements having different APIs in different packages (that's the main idea after all), but I would like to keep functions and methods with similar names semantically close wherever possible. It's hard to come up with a good name for the functions, though, as the function that comes closest is called HTML(). Not really a perfect name for a function (but ok for a factory). What about "parse_chunk()" or "parse_string()"? Then the other functions would become something like "parse_string_element()" and "parse_string_elements()". I know, that's long, but I wouldn't mind that, since the meaning is clear and most of the time, you'd use "parse_string()" anyway. We could then add a "parse()" function that basically does what the current "parse()" function does, but for files as input. Would that be ok for you? Stefan
data:image/s3,"s3://crabby-images/94dae/94dae216b8f1e27e9c55899f9576c66982c42fb6" alt=""
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Stefan Behnel wrote:
+1 from me.
- -- =================================================================== Tres Seaver +1 540-429-0999 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGhQJV+gerLs4ltQ4RAlYiAJ9mJeUO0WAIbDcJaW/dOtlA11XC6wCcCtW3 mXgk2kzEkyo7+HUpucZZ0mc= =sw/p -----END PGP SIGNATURE-----
data:image/s3,"s3://crabby-images/9b726/9b72613785319981a8800f418b99740492b56b75" alt=""
Stefan Behnel wrote:
Sure; I wasn't even aware of the existing parse function. I'm not really intimately aware with all the API of ET or lxml, so I've made things up at times with the hope someone will correct me if there's a better name/etc. So mission accomplished?
Yeah, I'm not that happy with all the different parsing functions. I wish HTML() itself worked a little more cleanly; but I guess there's actually real differing expectations built in. If you want an HTML page, you want what HTML() currently does, which is to interpret the string like the browser does. If you are dealing with fragments, what HTML() does is very annoying. And even what HTML() does is a tad weird, because it can give you a page without a head sometimes, or without a body other times, depending on what you pass in. It does some normalization, but not as much as I'd want (if I actually want normalization). All of which leads to more options than I like, and I know even internally I often choose arbitrarily which parsing function I use. -- Ian Bicking | ianb@colorstudy.com | http://blog.ianbicking.org | Write code, do good | http://topp.openplans.org/careers
data:image/s3,"s3://crabby-images/94dae/94dae216b8f1e27e9c55899f9576c66982c42fb6" alt=""
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Stefan Behnel wrote:
+1 from me.
- -- =================================================================== Tres Seaver +1 540-429-0999 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGhQJV+gerLs4ltQ4RAlYiAJ9mJeUO0WAIbDcJaW/dOtlA11XC6wCcCtW3 mXgk2kzEkyo7+HUpucZZ0mc= =sw/p -----END PGP SIGNATURE-----
data:image/s3,"s3://crabby-images/9b726/9b72613785319981a8800f418b99740492b56b75" alt=""
Stefan Behnel wrote:
Sure; I wasn't even aware of the existing parse function. I'm not really intimately aware with all the API of ET or lxml, so I've made things up at times with the hope someone will correct me if there's a better name/etc. So mission accomplished?
Yeah, I'm not that happy with all the different parsing functions. I wish HTML() itself worked a little more cleanly; but I guess there's actually real differing expectations built in. If you want an HTML page, you want what HTML() currently does, which is to interpret the string like the browser does. If you are dealing with fragments, what HTML() does is very annoying. And even what HTML() does is a tad weird, because it can give you a page without a head sometimes, or without a body other times, depending on what you pass in. It does some normalization, but not as much as I'd want (if I actually want normalization). All of which leads to more options than I like, and I know even internally I often choose arbitrarily which parsing function I use. -- Ian Bicking | ianb@colorstudy.com | http://blog.ianbicking.org | Write code, do good | http://topp.openplans.org/careers
participants (3)
-
Ian Bicking
-
Stefan Behnel
-
Tres Seaver