[lxml-dev] Benchmark results on parse/iterparse

Hi all, I extended bench.py to support benchmarks on serialised XML data, so now there are micro-benchmarks for the plain parser performance. I attached some results which are pretty impressive. Expat claims to be the fastest XML parser on earth - and it looks like there's some truth in that statement. cET (1.0.5) beats lxml (trunk and iterparse branch) in virtually every parser benchmark. It's even up to 80% faster for large trees. On the other hand, cET is still much slower for serialisation (up to 30x!), so lxml easily wins the round-trip benchmarks for reading and writing XML and is some 3-4x faster at the end. For the iterparse benchmarks, lxml suffers slightly from its object creation overhead (and maybe also the young implementation), but the overhead is acceptable compared to the overall parse time. As expected, the additional overhead for cET is extremely small here, it's almost as fast as the plain parser. So, for (typical?) iterparse() applications that extract small amounts of data from large XML streams on the fly, cET is pretty much unbeatable. For everything else, lxml is still my favourite. :) Stefan Preparing test suites and trees ... Running benchmark on lxe, cET, ET Setup times for trees in seconds: lxe: -- S- U- -A SA UA T1: 0.1323 0.1295 0.1309 0.1400 0.1294 0.1281 T2: 0.1345 0.1326 0.1324 0.1439 0.1443 0.1436 T3: 0.0358 0.0286 0.0290 0.0872 0.0852 0.0867 T4: 0.0007 0.0006 0.0006 0.0018 0.0018 0.0018 cET: -- S- U- -A SA UA T1: 0.0427 0.0433 0.0421 0.0418 0.0426 0.0425 T2: 0.0440 0.0426 0.0448 0.0496 0.0424 0.0445 T3: 0.0109 0.0102 0.0110 0.0185 0.0146 0.0148 T4: 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 ET : -- S- U- -A SA UA T1: 0.2288 0.2882 0.2279 0.2701 0.3036 0.2273 T2: 0.2972 0.2352 0.2890 0.3218 0.2337 0.3112 T3: 0.0533 0.0572 0.0534 0.0616 0.0573 0.0949 T4: 0.0009 0.0009 0.0008 0.0008 0.0009 0.0009 lxe: iterparse_stringIO (S-X T1 ) 222.9110 msec/pass, best of ( 227.3209 224.8156 222.9110 ) cET: iterparse_stringIO (S-X T1 ) 38.3651 msec/pass, best of ( 38.6973 38.3651 38.4331 ) ET : iterparse_stringIO (S-X T1 ) 362.5654 msec/pass, best of ( 399.9243 364.9337 362.5654 ) lxe: iterparse_stringIO (SAX T1 ) 223.7216 msec/pass, best of ( 229.2649 225.4670 223.7216 ) cET: iterparse_stringIO (SAX T1 ) 38.7985 msec/pass, best of ( 38.8003 38.7985 39.0118 ) ET : iterparse_stringIO (SAX T1 ) 365.6647 msec/pass, best of ( 365.7858 368.4628 365.6647 ) lxe: iterparse_stringIO (U-X T1 ) 223.2167 msec/pass, best of ( 228.4652 225.6853 223.2167 ) cET: iterparse_stringIO (U-X T1 ) 38.6407 msec/pass, best of ( 38.9079 38.6407 38.7212 ) ET : iterparse_stringIO (U-X T1 ) 361.4145 msec/pass, best of ( 361.5446 362.3198 361.4145 ) lxe: iterparse_stringIO (UAX T1 ) 223.9983 msec/pass, best of ( 230.2715 224.7750 223.9983 ) cET: iterparse_stringIO (UAX T1 ) 38.9739 msec/pass, best of ( 39.1283 38.9739 38.9918 ) ET : iterparse_stringIO (UAX T1 ) 365.9984 msec/pass, best of ( 365.9984 368.7172 367.6013 ) lxe: iterparse_stringIO (S-X T2 ) 232.9033 msec/pass, best of ( 237.3149 233.7498 232.9033 ) cET: iterparse_stringIO (S-X T2 ) 41.1875 msec/pass, best of ( 41.4992 41.2292 41.1875 ) ET : iterparse_stringIO (S-X T2 ) 383.4325 msec/pass, best of ( 383.8255 386.8770 383.4325 ) lxe: iterparse_stringIO (SAX T2 ) 248.6702 msec/pass, best of ( 262.2535 251.9383 248.6702 ) cET: iterparse_stringIO (SAX T2 ) 46.2867 msec/pass, best of ( 46.3457 46.2998 46.2867 ) ET : iterparse_stringIO (SAX T2 ) 406.4786 msec/pass, best of ( 407.3699 406.4786 408.3880 ) lxe: iterparse_stringIO (U-X T2 ) 232.0476 msec/pass, best of ( 239.1271 232.3972 232.0476 ) cET: iterparse_stringIO (U-X T2 ) 41.0636 msec/pass, best of ( 41.0744 41.0636 41.1089 ) ET : iterparse_stringIO (U-X T2 ) 383.7441 msec/pass, best of ( 383.8562 384.8858 383.7441 ) lxe: iterparse_stringIO (UAX T2 ) 250.2190 msec/pass, best of ( 255.8399 250.2190 250.2771 ) cET: iterparse_stringIO (UAX T2 ) 46.3599 msec/pass, best of ( 46.5231 46.3599 47.1866 ) ET : iterparse_stringIO (UAX T2 ) 406.6847 msec/pass, best of ( 410.7563 406.6847 415.3738 ) lxe: iterparse_stringIO (S-X T3 ) 21.0934 msec/pass, best of ( 21.1457 21.0934 21.1917 ) cET: iterparse_stringIO (S-X T3 ) 9.9643 msec/pass, best of ( 9.9643 9.9952 9.9998 ) ET : iterparse_stringIO (S-X T3 ) 119.0730 msec/pass, best of ( 119.0730 119.2219 119.3496 ) lxe: iterparse_stringIO (SAX T3 ) 46.2915 msec/pass, best of ( 51.1358 46.5834 46.2915 ) cET: iterparse_stringIO (SAX T3 ) 41.7721 msec/pass, best of ( 41.7721 43.1893 41.8128 ) ET : iterparse_stringIO (SAX T3 ) 252.5569 msec/pass, best of ( 253.2359 252.5569 257.7912 ) lxe: iterparse_stringIO (U-X T3 ) 21.2006 msec/pass, best of ( 21.6766 21.2006 21.6195 ) cET: iterparse_stringIO (U-X T3 ) 10.1142 msec/pass, best of ( 10.1593 10.1525 10.1142 ) ET : iterparse_stringIO (U-X T3 ) 118.9059 msec/pass, best of ( 119.9316 118.9059 119.0838 ) lxe: iterparse_stringIO (UAX T3 ) 46.0920 msec/pass, best of ( 50.7096 46.3383 46.0920 ) cET: iterparse_stringIO (UAX T3 ) 41.7240 msec/pass, best of ( 41.9300 41.7240 41.9742 ) ET : iterparse_stringIO (UAX T3 ) 253.6689 msec/pass, best of ( 253.6689 253.6885 255.1465 ) lxe: iterparse_stringIO (S-X T4 ) 1.0302 msec/pass, best of ( 1.1030 1.0365 1.0302 ) cET: iterparse_stringIO (S-X T4 ) 0.3072 msec/pass, best of ( 0.3244 0.3115 0.3072 ) ET : iterparse_stringIO (S-X T4 ) 2.8629 msec/pass, best of ( 2.8674 2.8629 2.9106 ) lxe: iterparse_stringIO (SAX T4 ) 1.6899 msec/pass, best of ( 1.6899 1.6941 1.6969 ) cET: iterparse_stringIO (SAX T4 ) 0.9437 msec/pass, best of ( 0.9663 0.9437 0.9659 ) ET : iterparse_stringIO (SAX T4 ) 5.9165 msec/pass, best of ( 5.9764 5.9165 5.9525 ) lxe: iterparse_stringIO (U-X T4 ) 1.0378 msec/pass, best of ( 1.0475 1.0444 1.0378 ) cET: iterparse_stringIO (U-X T4 ) 0.3046 msec/pass, best of ( 0.3141 0.3144 0.3046 ) ET : iterparse_stringIO (U-X T4 ) 2.8697 msec/pass, best of ( 2.8855 2.8799 2.8697 ) lxe: iterparse_stringIO (UAX T4 ) 1.7076 msec/pass, best of ( 1.7076 1.7481 1.7081 ) cET: iterparse_stringIO (UAX T4 ) 0.9469 msec/pass, best of ( 0.9611 0.9520 0.9469 ) ET : iterparse_stringIO (UAX T4 ) 6.0304 msec/pass, best of ( 6.0437 6.0304 6.0345 ) lxe: iterparse_stringIO_clear (S-X T1 ) 234.3113 msec/pass, best of ( 234.3113 246.4326 256.5073 ) cET: iterparse_stringIO_clear (S-X T1 ) 45.1311 msec/pass, best of ( 45.1311 46.6361 45.1353 ) ET : iterparse_stringIO_clear (S-X T1 ) 398.4236 msec/pass, best of ( 411.1450 417.0196 398.4236 ) lxe: iterparse_stringIO_clear (SAX T1 ) 232.9236 msec/pass, best of ( 234.6838 235.7229 232.9236 ) cET: iterparse_stringIO_clear (SAX T1 ) 44.5997 msec/pass, best of ( 44.5997 44.6131 44.6254 ) ET : iterparse_stringIO_clear (SAX T1 ) 391.3037 msec/pass, best of ( 391.3037 392.3368 391.4310 ) lxe: iterparse_stringIO_clear (U-X T1 ) 230.3221 msec/pass, best of ( 240.0502 230.3221 230.4537 ) cET: iterparse_stringIO_clear (U-X T1 ) 44.5020 msec/pass, best of ( 44.5403 44.5163 44.5020 ) ET : iterparse_stringIO_clear (U-X T1 ) 392.5360 msec/pass, best of ( 392.5360 392.7761 392.8283 ) lxe: iterparse_stringIO_clear (UAX T1 ) 232.4709 msec/pass, best of ( 233.1237 232.5471 232.4709 ) cET: iterparse_stringIO_clear (UAX T1 ) 44.6517 msec/pass, best of ( 44.8149 44.6950 44.6517 ) ET : iterparse_stringIO_clear (UAX T1 ) 391.4469 msec/pass, best of ( 393.7384 393.8401 391.4469 ) lxe: iterparse_stringIO_clear (S-X T2 ) 237.8001 msec/pass, best of ( 244.5833 239.2776 237.8001 ) cET: iterparse_stringIO_clear (S-X T2 ) 46.5421 msec/pass, best of ( 46.5820 46.5421 46.6870 ) ET : iterparse_stringIO_clear (S-X T2 ) 404.6504 msec/pass, best of ( 405.8105 404.6504 406.5170 ) lxe: iterparse_stringIO_clear (SAX T2 ) 256.2644 msec/pass, best of ( 256.2644 256.6346 256.5515 ) cET: iterparse_stringIO_clear (SAX T2 ) 51.3243 msec/pass, best of ( 51.4606 51.3311 51.3243 ) ET : iterparse_stringIO_clear (SAX T2 ) 428.6695 msec/pass, best of ( 428.6695 429.9673 429.0031 ) lxe: iterparse_stringIO_clear (U-X T2 ) 238.2270 msec/pass, best of ( 238.2270 238.4732 238.4163 ) cET: iterparse_stringIO_clear (U-X T2 ) 46.8345 msec/pass, best of ( 46.8345 46.9356 47.8009 ) ET : iterparse_stringIO_clear (U-X T2 ) 407.4448 msec/pass, best of ( 407.4448 417.0420 412.9830 ) lxe: iterparse_stringIO_clear (UAX T2 ) 255.1546 msec/pass, best of ( 255.1546 256.3821 258.1754 ) cET: iterparse_stringIO_clear (UAX T2 ) 51.4117 msec/pass, best of ( 51.7307 51.6235 51.4117 ) ET : iterparse_stringIO_clear (UAX T2 ) 427.9256 msec/pass, best of ( 428.1150 435.2808 427.9256 ) lxe: iterparse_stringIO_clear (S-X T3 ) 24.3242 msec/pass, best of ( 24.3841 24.3453 24.3242 ) cET: iterparse_stringIO_clear (S-X T3 ) 11.3448 msec/pass, best of ( 11.5487 11.3886 11.3448 ) ET : iterparse_stringIO_clear (S-X T3 ) 127.7567 msec/pass, best of ( 128.1610 127.7567 128.4989 ) lxe: iterparse_stringIO_clear (SAX T3 ) 53.2374 msec/pass, best of ( 53.4208 53.2374 53.7359 ) cET: iterparse_stringIO_clear (SAX T3 ) 38.3572 msec/pass, best of ( 38.4457 38.3572 38.3621 ) ET : iterparse_stringIO_clear (SAX T3 ) 254.7633 msec/pass, best of ( 257.4664 254.7633 255.7398 ) lxe: iterparse_stringIO_clear (U-X T3 ) 24.6667 msec/pass, best of ( 24.7798 24.6667 24.7138 ) cET: iterparse_stringIO_clear (U-X T3 ) 11.3932 msec/pass, best of ( 11.5611 11.3932 11.5425 ) ET : iterparse_stringIO_clear (U-X T3 ) 127.7254 msec/pass, best of ( 127.8360 127.9421 127.7254 ) lxe: iterparse_stringIO_clear (UAX T3 ) 52.9185 msec/pass, best of ( 52.9185 53.2012 52.9428 ) cET: iterparse_stringIO_clear (UAX T3 ) 38.2394 msec/pass, best of ( 38.4341 38.2394 38.3053 ) ET : iterparse_stringIO_clear (UAX T3 ) 253.7129 msec/pass, best of ( 254.9093 254.8415 253.7129 ) lxe: iterparse_stringIO_clear (S-X T4 ) 1.1246 msec/pass, best of ( 1.2300 1.1246 1.1279 ) cET: iterparse_stringIO_clear (S-X T4 ) 0.3431 msec/pass, best of ( 0.3467 0.3463 0.3431 ) ET : iterparse_stringIO_clear (S-X T4 ) 3.0261 msec/pass, best of ( 3.0261 3.0825 3.0340 ) lxe: iterparse_stringIO_clear (SAX T4 ) 1.8222 msec/pass, best of ( 1.8494 1.8222 1.8429 ) cET: iterparse_stringIO_clear (SAX T4 ) 0.9881 msec/pass, best of ( 0.9933 0.9881 1.0033 ) ET : iterparse_stringIO_clear (SAX T4 ) 6.1015 msec/pass, best of ( 6.1317 6.1015 6.1112 ) lxe: iterparse_stringIO_clear (U-X T4 ) 1.1314 msec/pass, best of ( 1.1320 1.1569 1.1314 ) cET: iterparse_stringIO_clear (U-X T4 ) 0.3512 msec/pass, best of ( 0.3577 0.3564 0.3512 ) ET : iterparse_stringIO_clear (U-X T4 ) 3.0102 msec/pass, best of ( 3.0351 3.0412 3.0102 ) lxe: iterparse_stringIO_clear (UAX T4 ) 1.8182 msec/pass, best of ( 1.8478 1.8182 1.8598 ) cET: iterparse_stringIO_clear (UAX T4 ) 0.9898 msec/pass, best of ( 0.9898 0.9990 0.9943 ) ET : iterparse_stringIO_clear (UAX T4 ) 6.1685 msec/pass, best of ( 6.2379 6.2236 6.1685 ) lxe: parse_stringIO (S-X T1 ) 170.6294 msec/pass, best of ( 173.7078 175.0854 170.6294 ) cET: parse_stringIO (S-X T1 ) 31.2479 msec/pass, best of ( 31.4530 31.2479 31.2610 ) ET : parse_stringIO (S-X T1 ) 320.9497 msec/pass, best of ( 321.8702 322.3815 320.9497 ) lxe: parse_stringIO (SAX T1 ) 171.2800 msec/pass, best of ( 171.2800 175.6441 173.7187 ) cET: parse_stringIO (SAX T1 ) 31.3205 msec/pass, best of ( 32.3134 31.3205 31.3392 ) ET : parse_stringIO (SAX T1 ) 323.7324 msec/pass, best of ( 323.7324 323.7890 328.4510 ) lxe: parse_stringIO (U-X T1 ) 171.6893 msec/pass, best of ( 172.5678 174.2148 171.6893 ) cET: parse_stringIO (U-X T1 ) 31.2387 msec/pass, best of ( 32.1226 31.2387 31.4539 ) ET : parse_stringIO (U-X T1 ) 321.3753 msec/pass, best of ( 323.4447 321.7778 321.3753 ) lxe: parse_stringIO (UAX T1 ) 171.9517 msec/pass, best of ( 171.9517 174.0736 176.9163 ) cET: parse_stringIO (UAX T1 ) 31.5030 msec/pass, best of ( 34.8593 31.5832 31.5030 ) ET : parse_stringIO (UAX T1 ) 323.3877 msec/pass, best of ( 323.6831 323.3877 329.6061 ) lxe: parse_stringIO (S-X T2 ) 179.9828 msec/pass, best of ( 179.9828 181.4266 182.8689 ) cET: parse_stringIO (S-X T2 ) 33.5708 msec/pass, best of ( 34.6295 33.6188 33.5708 ) ET : parse_stringIO (S-X T2 ) 340.2606 msec/pass, best of ( 340.2606 340.5859 342.8782 ) lxe: parse_stringIO (SAX T2 ) 197.7678 msec/pass, best of ( 197.7678 199.4126 199.2439 ) cET: parse_stringIO (SAX T2 ) 38.9390 msec/pass, best of ( 40.0341 38.9390 39.0063 ) ET : parse_stringIO (SAX T2 ) 364.3468 msec/pass, best of ( 365.5197 364.3468 365.6537 ) lxe: parse_stringIO (U-X T2 ) 182.7992 msec/pass, best of ( 182.9719 182.9173 182.7992 ) cET: parse_stringIO (U-X T2 ) 33.7937 msec/pass, best of ( 34.7863 33.8375 33.7937 ) ET : parse_stringIO (U-X T2 ) 340.1892 msec/pass, best of ( 340.1892 341.4799 342.9891 ) lxe: parse_stringIO (UAX T2 ) 197.9534 msec/pass, best of ( 197.9534 199.2566 206.4527 ) cET: parse_stringIO (UAX T2 ) 38.9602 msec/pass, best of ( 40.2117 38.9602 39.0238 ) ET : parse_stringIO (UAX T2 ) 365.2490 msec/pass, best of ( 366.4139 366.0356 365.2490 ) lxe: parse_stringIO (S-X T3 ) 9.3905 msec/pass, best of ( 9.4029 9.3905 9.4458 ) cET: parse_stringIO (S-X T3 ) 7.6507 msec/pass, best of ( 7.6840 7.6507 7.7423 ) ET : parse_stringIO (S-X T3 ) 108.2791 msec/pass, best of ( 108.2791 110.4766 108.3301 ) lxe: parse_stringIO (SAX T3 ) 48.6069 msec/pass, best of ( 48.6069 48.8698 48.9391 ) cET: parse_stringIO (SAX T3 ) 39.8525 msec/pass, best of ( 40.9317 39.8525 39.9056 ) ET : parse_stringIO (SAX T3 ) 237.2957 msec/pass, best of ( 237.5886 237.2957 237.4677 ) lxe: parse_stringIO (U-X T3 ) 9.3709 msec/pass, best of ( 9.4539 9.3835 9.3709 ) cET: parse_stringIO (U-X T3 ) 7.5603 msec/pass, best of ( 7.7213 7.6098 7.5603 ) ET : parse_stringIO (U-X T3 ) 108.1631 msec/pass, best of ( 108.1631 108.8036 108.1696 ) lxe: parse_stringIO (UAX T3 ) 48.6735 msec/pass, best of ( 48.6735 48.8036 48.9602 ) cET: parse_stringIO (UAX T3 ) 39.7455 msec/pass, best of ( 40.0213 39.8740 39.7455 ) ET : parse_stringIO (UAX T3 ) 237.9971 msec/pass, best of ( 237.9971 238.7502 238.3795 ) lxe: parse_stringIO (S-X T4 ) 0.3790 msec/pass, best of ( 0.4000 0.3790 0.3791 ) cET: parse_stringIO (S-X T4 ) 0.2547 msec/pass, best of ( 0.2676 0.2568 0.2547 ) ET : parse_stringIO (S-X T4 ) 2.5935 msec/pass, best of ( 2.5935 2.5990 2.5970 ) lxe: parse_stringIO (SAX T4 ) 1.1852 msec/pass, best of ( 1.2173 1.1852 1.2512 ) cET: parse_stringIO (SAX T4 ) 0.8995 msec/pass, best of ( 0.9106 0.8995 0.9005 ) ET : parse_stringIO (SAX T4 ) 5.5938 msec/pass, best of ( 5.6073 5.6163 5.5938 ) lxe: parse_stringIO (U-X T4 ) 0.3752 msec/pass, best of ( 0.3799 0.3844 0.3752 ) cET: parse_stringIO (U-X T4 ) 0.2558 msec/pass, best of ( 0.2732 0.2599 0.2558 ) ET : parse_stringIO (U-X T4 ) 2.5950 msec/pass, best of ( 2.6326 2.6283 2.5950 ) lxe: parse_stringIO (UAX T4 ) 1.1881 msec/pass, best of ( 1.1881 1.2204 1.2204 ) cET: parse_stringIO (UAX T4 ) 0.8957 msec/pass, best of ( 0.9226 0.8992 0.8957 ) ET : parse_stringIO (UAX T4 ) 5.6234 msec/pass, best of ( 5.6234 5.6872 5.6961 ) lxe: write_utf8_parse_stringIO (S-T T1 ) 189.2201 msec/pass, best of ( 199.9139 189.2201 190.3712 ) cET: write_utf8_parse_stringIO (S-T T1 ) 795.5032 msec/pass, best of ( 844.4007 824.9086 795.5032 ) ET : write_utf8_parse_stringIO (S-T T1 ) 1138.0186 msec/pass, best of ( 1140.2423 1141.4423 1138.0186 ) lxe: write_utf8_parse_stringIO (SAT T1 ) 190.7312 msec/pass, best of ( 192.9438 190.7312 193.4001 ) cET: write_utf8_parse_stringIO (SAT T1 ) 801.8634 msec/pass, best of ( 807.4269 803.6557 801.8634 ) ET : write_utf8_parse_stringIO (SAT T1 ) 1149.8636 msec/pass, best of ( 1149.8636 1150.1960 1161.8435 ) lxe: write_utf8_parse_stringIO (U-T T1 ) 187.6989 msec/pass, best of ( 187.6989 190.4670 190.4346 ) cET: write_utf8_parse_stringIO (U-T T1 ) 797.0220 msec/pass, best of ( 799.1436 797.0220 797.8761 ) ET : write_utf8_parse_stringIO (U-T T1 ) 1139.1940 msec/pass, best of ( 1139.1940 1143.0199 1144.4894 ) lxe: write_utf8_parse_stringIO (UAT T1 ) 192.8713 msec/pass, best of ( 192.8713 194.3054 193.6350 ) cET: write_utf8_parse_stringIO (UAT T1 ) 801.4468 msec/pass, best of ( 801.4468 801.8691 803.9093 ) ET : write_utf8_parse_stringIO (UAT T1 ) 1149.7449 msec/pass, best of ( 1152.1680 1208.7331 1149.7449 ) lxe: write_utf8_parse_stringIO (S-T T2 ) 197.3249 msec/pass, best of ( 198.7984 197.3249 199.2573 ) cET: write_utf8_parse_stringIO (S-T T2 ) 836.6382 msec/pass, best of ( 836.6382 839.7354 838.3641 ) ET : write_utf8_parse_stringIO (S-T T2 ) 1198.1185 msec/pass, best of ( 1201.7853 1200.2708 1198.1185 ) lxe: write_utf8_parse_stringIO (SAT T2 ) 220.1455 msec/pass, best of ( 220.1455 221.5693 221.5681 ) cET: write_utf8_parse_stringIO (SAT T2 ) 933.0841 msec/pass, best of ( 933.7380 937.9234 933.0841 ) ET : write_utf8_parse_stringIO (SAT T2 ) 1311.6351 msec/pass, best of ( 1311.6351 1314.4752 1314.5881 ) lxe: write_utf8_parse_stringIO (U-T T2 ) 200.4080 msec/pass, best of ( 200.4080 203.1036 200.5778 ) cET: write_utf8_parse_stringIO (U-T T2 ) 836.8698 msec/pass, best of ( 836.8698 838.9036 840.1444 ) ET : write_utf8_parse_stringIO (U-T T2 ) 1198.3584 msec/pass, best of ( 1198.3584 1200.3080 1206.1083 ) lxe: write_utf8_parse_stringIO (UAT T2 ) 218.4633 msec/pass, best of ( 222.4844 218.4633 275.0107 ) cET: write_utf8_parse_stringIO (UAT T2 ) 926.8864 msec/pass, best of ( 934.6416 926.8864 930.0010 ) ET : write_utf8_parse_stringIO (UAT T2 ) 1310.0202 msec/pass, best of ( 1312.2715 1317.1640 1310.0202 ) lxe: write_utf8_parse_stringIO (S-T T3 ) 12.0866 msec/pass, best of ( 12.1740 12.0866 12.2349 ) cET: write_utf8_parse_stringIO (S-T T3 ) 109.3471 msec/pass, best of ( 109.3471 109.3881 109.5528 ) ET : write_utf8_parse_stringIO (S-T T3 ) 226.3420 msec/pass, best of ( 226.3420 226.8248 228.9372 ) lxe: write_utf8_parse_stringIO (SAT T3 ) 61.4568 msec/pass, best of ( 61.7457 61.9032 61.4568 ) cET: write_utf8_parse_stringIO (SAT T3 ) 608.1295 msec/pass, best of ( 608.6636 608.1295 613.7108 ) ET : write_utf8_parse_stringIO (SAT T3 ) 825.5513 msec/pass, best of ( 825.8673 827.0674 825.5513 ) lxe: write_utf8_parse_stringIO (U-T T3 ) 12.0974 msec/pass, best of ( 12.1823 12.0974 12.1191 ) cET: write_utf8_parse_stringIO (U-T T3 ) 109.5899 msec/pass, best of ( 109.6426 109.7568 109.5899 ) ET : write_utf8_parse_stringIO (U-T T3 ) 226.3300 msec/pass, best of ( 226.4436 227.5431 226.3300 ) lxe: write_utf8_parse_stringIO (UAT T3 ) 61.9726 msec/pass, best of ( 64.3755 61.9726 62.0165 ) cET: write_utf8_parse_stringIO (UAT T3 ) 609.6428 msec/pass, best of ( 609.6428 609.8895 615.0596 ) ET : write_utf8_parse_stringIO (UAT T3 ) 824.3621 msec/pass, best of ( 824.5392 826.6240 824.3621 ) lxe: write_utf8_parse_stringIO (S-T T4 ) 0.4812 msec/pass, best of ( 0.4814 0.4812 0.4846 ) cET: write_utf8_parse_stringIO (S-T T4 ) 5.3454 msec/pass, best of ( 5.4478 5.3454 5.3874 ) ET : write_utf8_parse_stringIO (S-T T4 ) 7.9776 msec/pass, best of ( 8.0112 7.9776 8.0064 ) lxe: write_utf8_parse_stringIO (SAT T4 ) 1.7338 msec/pass, best of ( 1.7576 1.7409 1.7338 ) cET: write_utf8_parse_stringIO (SAT T4 ) 17.3748 msec/pass, best of ( 17.3826 17.3748 17.3764 ) ET : write_utf8_parse_stringIO (SAT T4 ) 22.7056 msec/pass, best of ( 22.7056 22.7506 22.7812 ) lxe: write_utf8_parse_stringIO (U-T T4 ) 0.4791 msec/pass, best of ( 0.4839 0.4873 0.4791 ) cET: write_utf8_parse_stringIO (U-T T4 ) 5.3264 msec/pass, best of ( 5.3264 5.3404 5.3775 ) ET : write_utf8_parse_stringIO (U-T T4 ) 8.0147 msec/pass, best of ( 8.0221 8.0147 8.0596 ) lxe: write_utf8_parse_stringIO (UAT T4 ) 1.7334 msec/pass, best of ( 1.7661 1.7767 1.7334 ) cET: write_utf8_parse_stringIO (UAT T4 ) 17.3602 msec/pass, best of ( 17.3602 17.3674 17.3672 ) ET : write_utf8_parse_stringIO (UAT T4 ) 22.6152 msec/pass, best of ( 22.6621 22.6913 22.6152 )

Hi all, this is my first post to this mailing list and I apologize for it being rather lengthy. First of all congrats for producing a great piece of software. I am currently evaluating the usage of lxml trees as the program data model for messaging middleware interfacing applications. These applications (adapters) usually receive data as messages from some middleware component, perform certain operations on the data and then e.g. put out the transformed data as a file, write it to a database or publish it on the middleware, again. To abstract from the proprietary middleware message formats I plan to transport xml blobs in the messages and use lxml to represent the xml data in the adapter programs. However, I'd like to add an (arguably :-) "even-more-pythonic" API layer on top of lxml, enabling the dot (.) operator syntax to navigate through the tree, similar to amara or gnosis.xml.objectify, plus the possibility to assign simple Python builtin types transparently. I've been playing around a bit with lxml, renaming the _Element class to _ElementBase and defining a new class _Element that inherits from _ElementBase and implements the necessary __setattr__/__getattr__ magic, which works quite nicely. For my purposes, element.text is regarded as the element data, and ns-unqualified subelement access is allowed by simply using the parent ns-prefix, if no qualified name was given, i.e. getattr(elt, 'foo') ---> returns children of elt with tagname {<ns-qualification of elt>}foo getattr(elt, '{myURI}foo') --> returns children of elt with tagname {myURI}foo E.g.:
tree <etree._ElementTree object at 0x403170> tree.foo Traceback (most recent call last): File "<stdin>", line 1, in ? AttributeError: 'etree._ElementTree' object has no attribute 'foo' tree.getroot().party [<Element {http://www.fpml.org/2005/FpML-4-2}party at 401660>, <Element {http://www.fpml.org/2005/FpML-4-2}party at 401690>] tree.getroot().party[0] <Element {http://www.fpml.org/2005/FpML-4-2}party at 401660> tree.getroot().party[0].partyId <Element {http://www.fpml.org/2005/FpML-4-2}partyId at 401630> tree.getroot().party[0].partyId.foo = 187873 tree.getroot().party[0].partyId.foo <Element {http://www.fpml.org/2005/FpML-4-2}foo at 401690> tree.getroot().party[0].partyId.foo() 187873 etree.tostring(tree.getroot().party[0]) '<party id="PartyA">\n\t\t<partyId>Party A<foo>187873</foo></partyId>\n\t</party>\n\t'
I am now wondering how to implement such an API-layer a bit more non-intrusively, without changing the lxml base classes themselves. Given the structure of lxml with the usage of _elementFactory, I'm unsure how to do so. Any hints? Anyone else interested in such an additional API layer? Btw. please ignore the stupid disclaimer my employer adds to my emails..otoh they let me use Python :-) Best regards, Holger Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empfänger sind oder falls diese E-Mail irrtümlich an Sie adressiert wurde, verständigen Sie bitte den Absender sofort und löschen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte Übermittlung sind nicht gestattet. Die Sicherheit von Übermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Bestätigung wünschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version.

Hi Holger, Holger Joukl wrote:
I'd like to add an (arguably :-) "even-more-pythonic" API layer on top of lxml, enabling the dot (.) operator syntax to navigate through the tree, similar to amara or gnosis.xml.objectify, plus the possibility to assign simple Python builtin types transparently.
E.g.:
tree <etree._ElementTree object at 0x403170> tree.foo Traceback (most recent call last): File "<stdin>", line 1, in ? AttributeError: 'etree._ElementTree' object has no attribute 'foo' tree.getroot().party [<Element {http://www.fpml.org/2005/FpML-4-2}party at 401660>, <Element {http://www.fpml.org/2005/FpML-4-2}party at 401690>] tree.getroot().party[0] <Element {http://www.fpml.org/2005/FpML-4-2}party at 401660> tree.getroot().party[0].partyId <Element {http://www.fpml.org/2005/FpML-4-2}partyId at 401630> tree.getroot().party[0].partyId.foo = 187873 tree.getroot().party[0].partyId.foo <Element {http://www.fpml.org/2005/FpML-4-2}foo at 401690> tree.getroot().party[0].partyId.foo() 187873 etree.tostring(tree.getroot().party[0]) '<party id="PartyA">\n\t\t<partyId>Party A<foo>187873</foo></partyId>\n\t</party>\n\t'
I am now wondering how to implement such an API-layer a bit more non-intrusively, without changing the lxml base classes themselves.
Have a look here, that should get you going: http://codespeak.net/lxml/namespace_extensions.html
Der Inhalt dieser E-Mail ist vertraulich.
I don't think it is ... :) Stefan

Hi, Stefan Behnel <behnel_ml@gkec.informatik.tu-darmstadt.de> schrieb am 22.06.2006 12:57:22:
Hi Holger,
Holger Joukl wrote:
I'd like to add an (arguably :-) "even-more-pythonic" API layer on top of lxml, enabling the dot (.) operator syntax to navigate through the tree, similar to amara or gnosis.xml.objectify, plus the possibility to assign simple Python builtin types transparently. [...]
I am now wondering how to implement such an API-layer a bit more non-intrusively, without changing the lxml base classes themselves.
Have a look here, that should get you going: http://codespeak.net/lxml/namespace_extensions.html
Thanks! Simply overlooked that part of the documentation. Can I get this mechanism to always return custom elements regardless of the namespace, without having to register the custom element class for every possible namespace? I guess some ns-uri wildcard would have to be introduced ('*' ?) and the custom element registered for the wildcard, with an additional check for a wildcard registry in _find_element_class. Right? Thanks for the quick response btw. Cheers, H. Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empfänger sind oder falls diese E-Mail irrtümlich an Sie adressiert wurde, verständigen Sie bitte den Absender sofort und löschen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte Übermittlung sind nicht gestattet. Die Sicherheit von Übermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Bestätigung wünschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version.

Hi, I think it would be nice to modify lxml's fine namespace extensions mechanisms to allow for using a custom element class for every element regardless of its namespace, if no specialized namespace registry has been defined for a namespace. Something along the lines of (diff based on lxml 1.0.1 nsclasses.pxi): $ diff -c ../../lxml-1.0.1/src/lxml/nsclasses.pxi src/lxml/nsclasses.pxi *** ../../lxml-1.0.1/src/lxml/nsclasses.pxi Thu Jun 8 16:18:04 2006 --- src/lxml/nsclasses.pxi Fri Jun 23 12:16:59 2006 *************** *** 33,38 **** --- 33,43 ---- _ClassNamespaceRegistry(ns_uri) return registry + def DefaultNamespace(): + """Retrieve the namespace object associated with the default namespace URI + ('*'). Creates a new default if it does not yet exist.""" + return Namespace('*') + def FunctionNamespace(ns_uri): """Retrieve the function namespace object associated with the given URI. Creates a new one if it does not yet exist. A function namespace can *************** *** 188,194 **** dict_result = python.PyDict_GetItem( __NAMESPACE_REGISTRIES, None) if dict_result is NULL: ! return _Element registry = <_NamespaceRegistry>dict_result classes = registry._entries --- 193,203 ---- dict_result = python.PyDict_GetItem( __NAMESPACE_REGISTRIES, None) if dict_result is NULL: ! dict_result = python.PyDict_GetItemString( ! __NAMESPACE_REGISTRIES, '*') ! ! if dict_result is NULL: ! return _Element registry = <_NamespaceRegistry>dict_result classes = registry._entries Of course one would have to think about what default registry key/URI to use for the fallback, I simply took '*'. Comments? Holger Note: This stems from a question I wrote in thread '"even-more-pythonic" API on top of lxml' "Holger Joukl" <Holger.Joukl@LBBW.de> schrieb am 22.06.2006 16:35:40:
Hi,
Stefan Behnel <behnel_ml@gkec.informatik.tu-darmstadt.de> schrieb am 22.06.2006 12:57:22:
Have a look here, that should get you going: http://codespeak.net/lxml/namespace_extensions.html
Thanks! Simply overlooked that part of the documentation. Can I get this mechanism to always return custom elements regardless of the namespace, without having to register the custom element class for every possible namespace?
I guess some ns-uri wildcard would have to be introduced ('*' ?) and the custom element registered for the wildcard, with an additional check for a wildcard registry in _find_element_class. Right?
Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empfänger sind oder falls diese E-Mail irrtümlich an Sie adressiert wurde, verständigen Sie bitte den Absender sofort und löschen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte Übermittlung sind nicht gestattet. Die Sicherheit von Übermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Bestätigung wünschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version.

Hi Holger, Holger Joukl wrote:
I think it would be nice to modify lxml's fine namespace extensions mechanisms to allow for using a custom element class for every element regardless of its namespace, if no specialized namespace registry has been defined for a namespace.
Sure, why not.
+ def DefaultNamespace(): + """Retrieve the namespace object associated with the default namespace URI + ('*'). Creates a new default if it does not yet exist.""" + return Namespace('*') +
I don't think DefaultNamespace is a good name, as this is not related to namespaces at all.
Of course one would have to think about what default registry key/URI to use for the fallback, I simply took '*'.
Not a bad choice, as it's already used in getiterator(), findall() etc. Although no-one can prevent you from using '*' as a namespace URI... I think a better API would be an explicit module level function for setting the default element class. Stefan

Stefan Behnel <behnel_ml@gkec.informatik.tu-darmstadt.de> schrieb am 23.06.2006 12:54:26:
+ def DefaultNamespace(): + """Retrieve the namespace object associated with the default namespace URI + ('*'). Creates a new default if it does not yet exist.""" + return Namespace('*') +
I don't think DefaultNamespace is a good name, as this is not related to namespaces at all.
Right, that's probably similarly confusing as I found the heading "Implementing namespaces with the namespace class" in the lxml doc ;-) How about renaming this to "Associating custom element classes with namespaces"
Of course one would have to think about what default registry key/URI to use for the fallback, I simply took '*'.
Not a bad choice, as it's already used in getiterator(), findall() etc. Although no-one can prevent you from using '*' as a namespace URI...
Is there some string literal that is actually invalid as a URI? Maybe s.th. like this would be best. Or maybe remove the _utf8() conversion for the namespace registry keys and use some AnyNamespace object as the default dict key.
I think a better API would be an explicit module level function for setting the default element class.
The nice thing about staying close to the existing solution is that it is still possible to have per-tagname-registered custom element classes, if you wish. Holger Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empfänger sind oder falls diese E-Mail irrtümlich an Sie adressiert wurde, verständigen Sie bitte den Absender sofort und löschen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte Übermittlung sind nicht gestattet. Die Sicherheit von Übermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Bestätigung wünschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version.

Holger Joukl wrote:
Stefan Behnel <behnel_ml@gkec.informatik.tu-darmstadt.de> schrieb am 23.06.2006 12:54:26:
+ def DefaultNamespace(): + """Retrieve the namespace object associated with the default namespace URI + ('*'). Creates a new default if it does not yet exist.""" + return Namespace('*') + I don't think DefaultNamespace is a good name, as this is not related to namespaces at all.
Right, that's probably similarly confusing as I found the heading "Implementing namespaces with the namespace class" in the lxml doc ;-) How about renaming this to "Associating custom element classes with namespaces"
+1 Doc patches are welcome, too, of course. :) Regards, Martijn

Hi Martijn, Martijn Faassen <faassen@infrae.com> schrieb am 23.06.2006 14:03:08:
Holger Joukl wrote:
How about renaming this to "Associating custom element classes with namespaces"
+1
Doc patches are welcome, too, of course. :)
How do you prefer patch proposals generally, simply as posts on this mailing list? Inline diff output or attached diff files, and based on which file versions/revisions? And doc patches, especially: I assume the doc sources are the .txt files in the doc directory (I've never used rest...)? Holger Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empfänger sind oder falls diese E-Mail irrtümlich an Sie adressiert wurde, verständigen Sie bitte den Absender sofort und löschen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte Übermittlung sind nicht gestattet. Die Sicherheit von Übermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Bestätigung wünschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version.

Holger Joukl wrote:
Hi Martijn,
Martijn Faassen <faassen@infrae.com> schrieb am 23.06.2006 14:03:08:
Holger Joukl wrote:
How about renaming this to "Associating custom element classes with namespaces" +1
Doc patches are welcome, too, of course. :)
How do you prefer patch proposals generally, simply as posts on this mailing list?
If they're reasonably short, to the mailing list. If they don't seem to getting through, mail them to me and Stefan.
Inline diff output or attached diff files, and based on which file versions/revisions?
Small attachments easiest, I think. Either to trunk or 1.0 branch, depending on whether you're doing a feature change or something that can be construed as a "bug fix" (documentation clarifications are bugfixes to me, even if they would add a few doctest examples). We'll do the synching to trunk where needed.
And doc patches, especially: I assume the doc sources are the .txt files in the doc directory (I've never used rest...)?
Yes, they're the .txt files; they're in ReST format but if you just follow the pattern in the files you should be fine, and we can fix up any small problems. If you go changing the doctests, please make sure you run the testsuite (python test.py) to see whether they actually work. If you're planning on doing *lots* of stuff we can get you a codespeak SVN account, of course, and you can work on a branch. :) Regards, Martijn

Hi Holger, Martijn Faassen wrote:
Holger Joukl wrote:
Inline diff output or attached diff files, and based on which file versions/revisions?
Small attachments easiest, I think.
Text attachments are always better than inline patches. Most mail programs make them directly visible and, being attached files, they do not loose any character encodings or suffer from line wrapping. BTW: we try to limit lines to 80 characters, which also tends to pass some mail programs.
Either to trunk or 1.0 branch, depending on whether you're doing a feature change or something that can be construed as a "bug fix" (documentation clarifications are bugfixes to me, even if they would add a few doctest examples). We'll do the synching to trunk where needed.
Right. The 1.0 branch is in stable maintenance mode, meaning: no new features, conservative changes, etc.
And doc patches, especially: I assume the doc sources are the .txt files in the doc directory (I've never used rest...)?
Yes, they're the .txt files; they're in ReST format but if you just follow the pattern in the files you should be fine, and we can fix up any small problems.
Note that there is an "html" target in the Makefile that regenerates the HTML pages. If you have the ReST tools installed, you can use it to check the result before sending in patches.
If you go changing the doctests, please make sure you run the testsuite (python test.py) to see whether they actually work.
Yep! And, BTW, it's much more likely that a patch will be accepted if it comes with doctests and/or unit test cases in src/lxml/test_*.py.
If you're planning on doing *lots* of stuff we can get you a codespeak SVN account, of course, and you can work on a branch. :)
Still, you should keep talking to the list before you go for bigger stuff. It's not always easy to understand the way things work, so we may have some pointers for you. Stefan

Hi Holger, Holger Joukl wrote:
Stefan Behnel <behnel_ml@gkec.informatik.tu-darmstadt.de> schrieb am 23.06.2006 12:54:26:
+ def DefaultNamespace(): + """Retrieve the namespace object associated with the default namespace URI + ('*'). Creates a new default if it does not yet exist.""" + return Namespace('*') + I don't think DefaultNamespace is a good name, as this is not related to namespaces at all.
Right, that's probably similarly confusing as I found the heading "Implementing namespaces with the namespace class" in the lxml doc ;-) How about renaming this to "Associating custom element classes with namespaces"
True. I just changed that and adapted the doctests to the new API. http://codespeak.net/svn/lxml/trunk/doc/namespace_extensions.txt
I think a better API would be an explicit module level function for setting the default element class.
The nice thing about staying close to the existing solution is that it is still possible to have per-tagname-registered custom element classes, if you wish.
That's not a problem. Namespace specific classes obviously override the global default class, just as they already do for _Element. Stefan

Hi, Stefan Behnel <behnel_ml@gkec.informatik.tu-darmstadt.de> schrieb am 23.06.2006 14:31:20:
True. I just changed that and adapted the doctests to the new API.
http://codespeak.net/svn/lxml/trunk/doc/namespace_extensions.txt
I think a better API would be an explicit module level function for setting the default element class.
The nice thing about staying close to the existing solution is that it
is
still possible to have per-tagname-registered custom element classes, if you wish.
That's not a problem. Namespace specific classes obviously override the global default class, just as they already do for _Element.
Stefan
I meant that it was possible to have namespace-agnostic, per-tagname custom default classes (as opposed to namespace-aware per-tagname custom classes), but admit I don't really have any usecase for that. More of a side-effect of making just a minimal code change, to learn the workings. And admittedly your solution and API is nice and clean. You guys are really fast! Thanks, Holger Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empfänger sind oder falls diese E-Mail irrtümlich an Sie adressiert wurde, verständigen Sie bitte den Absender sofort und löschen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte Übermittlung sind nicht gestattet. Die Sicherheit von Übermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Bestätigung wünschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version.

Hi Holger, Holger Joukl wrote:
I meant that it was possible to have namespace-agnostic, per-tagname custom default classes (as opposed to namespace-aware per-tagname custom classes), but admit I don't really have any usecase for that.
And admittedly your solution and API is nice and clean.
I already had such a thing in the back of my head for a while, so it didn't come as a surprise. I just wasn't quite sure how to get it straight. Your proposal assured me that a utility function is the way to go. :) I don't think it's a good idea to support tag->element mappings without namespaces. The way it works now assures that element classes are associated only with qualified tag names and I think that's the right thing to do. Stefan

Hi, lxml.objectify crashes under python2.3: PYTHONPATH=/apps/pydev/gcc/3.4.4/lib/python2.4/site-packages/ python2.3 Python 2.3.4 (#6, Jul 20 2004, 11:09:38) [GCC 2.95.2 19991024 (release)] on sunos5 Type "help", "copyright", "credits" or "license" for more information.
import lxml.objectify Traceback (most recent call last): File "<stdin>", line 1, in ? ImportError: ld.so.1: python2.3: fatal: relocation error: file /apps/pydev/gcc/3.4.4/lib/python2.4/site-packages/lxml/objectify.so: symbol PyDict_Contains: referenced symbol not found
seems like PyDict_Contains is not available in python2.3: $ elfdump /apps/pydev/gcc/3.4.4/bin/python2.4 |grep -i pydict_cont [593] 0x0004c078 0x00000070 FUNC GLOB D 0 .text PyDict_Contains [3487] 0x0004c078 0x00000070 FUNC GLOB D 0 .text PyDict_Contains [593] PyDict_Contains 0 hjoukl@dev-a .../pytaf $ elfdump /apps/prod/bin/python2.3 |grep -i pydict_cont 1 hjoukl@dev-a .../pytaf $ Regards, Holger Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empfänger sind oder falls diese E-Mail irrtümlich an Sie adressiert wurde, verständigen Sie bitte den Absender sofort und löschen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte Übermittlung sind nicht gestattet. Die Sicherheit von Übermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Bestätigung wünschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version.

Hi Holger, Holger Joukl wrote:
lxml.objectify crashes under python2.3:
PYTHONPATH=/apps/pydev/gcc/3.4.4/lib/python2.4/site-packages/ python2.3 Python 2.3.4 (#6, Jul 20 2004, 11:09:38) [GCC 2.95.2 19991024 (release)] on sunos5 Type "help", "copyright", "credits" or "license" for more information.
import lxml.objectify Traceback (most recent call last): File "<stdin>", line 1, in ? ImportError: ld.so.1: python2.3: fatal: relocation error: file /apps/pydev/gcc/3.4.4/lib/python2.4/site-packages/lxml/objectify.so: symbol PyDict_Contains: referenced symbol not found
Besides the fact that you should not normally import modules that were compiled for a different Python version - you're right, thanks. That one slipped through accidentally. Here's the patch. Stefan Index: src/lxml/objectify.pyx =================================================================== --- src/lxml/objectify.pyx (Revision 31226) +++ src/lxml/objectify.pyx (Arbeitskopie) @@ -184,7 +184,7 @@ if c_ns is NULL and tree._getNs(child._c_node) is not NULL: continue name = child._c_node.name - if not python.PyDict_Contains(children, name): + if python.PyDict_GetItem(children, name) is NULL: python.PyDict_SetItem(children, name, child) return children Index: src/lxml/python.pxd =================================================================== --- src/lxml/python.pxd (Revision 31212) +++ src/lxml/python.pxd (Arbeitskopie) @@ -52,7 +52,6 @@ cdef void PyDict_Clear(object d) cdef object PyDict_Copy(object d) cdef Py_ssize_t PyDict_Size(object d) - cdef int PyDict_Contains(object d, object key) cdef object PySequence_List(object o) cdef object PySequence_Tuple(object o)

Hi Holger, Holger Joukl wrote:
I'd like to add an (arguably :-) "even-more-pythonic" API layer on top of lxml, enabling the dot (.) operator syntax to navigate through the tree, similar to amara or gnosis.xml.objectify, plus the possibility to assign simple Python builtin types transparently.
For my purposes, element.text is regarded as the element data, and ns-unqualified subelement access is allowed by simply using the parent ns-prefix, if no qualified name was given, i.e. getattr(elt, 'foo') ---> returns children of elt with tagname {<ns-qualification of elt>}foo getattr(elt, '{myURI}foo') --> returns children of elt with tagname {myURI}foo
E.g.:
tree <etree._ElementTree object at 0x403170> tree.foo Traceback (most recent call last): File "<stdin>", line 1, in ? AttributeError: 'etree._ElementTree' object has no attribute 'foo' tree.getroot().party [<Element {http://www.fpml.org/2005/FpML-4-2}party at 401660>, <Element {http://www.fpml.org/2005/FpML-4-2}party at 401690>] tree.getroot().party[0] <Element {http://www.fpml.org/2005/FpML-4-2}party at 401660> tree.getroot().party[0].partyId <Element {http://www.fpml.org/2005/FpML-4-2}partyId at 401630> tree.getroot().party[0].partyId.foo = 187873 tree.getroot().party[0].partyId.foo <Element {http://www.fpml.org/2005/FpML-4-2}foo at 401690> tree.getroot().party[0].partyId.foo() 187873 etree.tostring(tree.getroot().party[0]) '<party id="PartyA">\n\t\t<partyId>Party A<foo>187873</foo></partyId>\n\t</party>\n\t'
I think that's an interesting API to have, especially since a lot of Python XML libraries support this. I could imagine having a package "lxml.elementlib" as a collection of generic Element classes that implement certain extended APIs. Before I start writing something like this myself, could you contribute your implementation for this purpose? It shouldn't be very complex anyway. If you do, please provide it in pure Python. And, if you want be be really helpful, you could come up with some test cases similar to what you find in src/lxml/tests/test_*.py or even some doctests to proof that it works as expected. Stefan

Hi Stefan, Stefan Behnel <behnel_ml@gkec.informatik.tu-darmstadt.de> schrieb am 25.06.2006 21:05:16:
Hi Holger,
Holger Joukl wrote:
I'd like to add an (arguably :-) "even-more-pythonic" API layer on top of lxml, enabling the dot (.) operator syntax to navigate through the tree, similar to amara or gnosis.xml.objectify, plus the possibility to assign simple Python builtin types transparently.
For my purposes, element.text is regarded as the element data, and ns-unqualified subelement access is allowed by simply using the parent ns-prefix, if no qualified name was given, i.e. getattr(elt, 'foo') ---> returns children of elt with tagname {<ns-qualification of elt>}foo getattr(elt, '{myURI}foo') --> returns children of elt with tagname {myURI}foo
E.g.:
tree <etree._ElementTree object at 0x403170> tree.foo Traceback (most recent call last): File "<stdin>", line 1, in ? AttributeError: 'etree._ElementTree' object has no attribute 'foo' tree.getroot().party [<Element {http://www.fpml.org/2005/FpML-4-2}party at 401660>, <Element {http://www.fpml.org/2005/FpML-4-2}party at 401690>] tree.getroot().party[0] <Element {http://www.fpml.org/2005/FpML-4-2}party at 401660> tree.getroot().party[0].partyId <Element {http://www.fpml.org/2005/FpML-4-2}partyId at 401630> tree.getroot().party[0].partyId.foo = 187873 tree.getroot().party[0].partyId.foo <Element {http://www.fpml.org/2005/FpML-4-2}foo at 401690> tree.getroot().party[0].partyId.foo() 187873 etree.tostring(tree.getroot().party[0]) '<party id="PartyA">\n\t\t<partyId>Party A<foo>187873</foo></partyId>\n\t</party>\n\t'
I think that's an interesting API to have, especially since a lot of Python XML libraries support this. I could imagine having a package "lxml.elementlib" as a collection of generic Element classes that implement certain extended APIs.
Before I start writing something like this myself, could you contribute your implementation for this purpose? It shouldn't be very complex anyway. If you do, please provide it in pure Python. And, if you want be be really helpful, you could come up with some test cases similar to what you find in src/lxml/tests/test_*.py or even some doctests to proof that it works as expected.
Stefan
I'd be happy to contribute my implementation but currently this is just evaluation stadium. Many API issues are still open; e.g. - implement the special math methods to allow things like rootElt.subElt.a + rootElt.subElt.b, delegating the actual operation to the underlying simple python type? - for rootElt.subElt.a maybe even just return the simple python value it contains instead of the ElementBase-derived object instance a, iff it does not have children itself? - how to determine the simple python value form elt.text? I'm thinking of using a pluggable "guesser" here that will be set by a module level function and allows the user to implement the rules. This guesser will expect an Element and return the "simple python value of this element". ... My motivation: We want to migrate a python toolkit used for interfacing issues that is heavily based on the commercial TIB/Rendezvous messaging middleware. The internal data format are structured RvMsg data as provided by the TIB API. lxml would be a (hot!) candidate to come up with s.th. more powerful, as e.g. RvMsg does not support element attributes, plus all the great lxml features like XPath, XSLT,... However, there are downsides also: The RvMsg data practically can be used just like a simple python class instance, making use of the simple python builtin types. Also, it is very fast. In short: If we decide for the lxml way (which is likely) I can come up with all you mention, though it will take some time w.r.t testing as this will become production code in a banking environment. About pure python, though: My first tests with naive pyrex code and naive python code (practically just copy&paste) show the pyrex version about 3x faster than the pure python version, and speed will be an issue. If we drop lxml and go for another solution (currently unlikely) I can still give all my evaluation code to you. Would that be ok for you? Holger Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empfänger sind oder falls diese E-Mail irrtümlich an Sie adressiert wurde, verständigen Sie bitte den Absender sofort und löschen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte Übermittlung sind nicht gestattet. Die Sicherheit von Übermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Bestätigung wünschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version.
participants (3)
-
Holger Joukl
-
Martijn Faassen
-
Stefan Behnel