Thanks Vasudev! [1] xtopdf looks great! will check it out. [2] I've faced similar issues w.r.t.junk characters, which may happen when the PDF contains an incorrect ToUnicode map, though I still have to dig deeper and I'm not 100% sure. I've also faced an issue where duplicate strings are assigned to the same cell. You can check it out on Github <https://github.com/socialcopsdev/camelot/issues/103>. I suspect that since PDF is a canvas-based model and not a text-based one, like you said, text is just transposed a bit further to make it look like bold text. I'll probably write a detailed blog post about the issues I faced while development :) Thanks for checking it out! On Sat, Sep 29, 2018 at 1:26 AM Vasudev Ram <vasudevram@gmail.com> wrote:
Very interesting, and congrats, Vinayak.
As a person interested in both PDF generation [1] and PDF text extraction [2], I'm interested to know what issues you faced w.r.t. accuracy of text extraction and also formatting.
[1] I'm the creator of xtopdf, a Python toolkit for PDF generation from other file formats;
http://slides.com/vasudevram/xtopdf
http://bitbucket.org/vasudevram/xtopdf
[2] I worked on a project to extract text from PDF files. It was done using a C library (xpdf), though, not a Python one. However, the text extraction accuracy issues (some of which are technical issues inherent in the PDF format, according to the vendor of xpdf, Glyph and Cog) are language-independent. There were things like characters getting transposed, missing characters, junk characters sometimes, etc. (I also wrote a heuristics program to detect some such issues, but that too could only reject the bad extracts, not make them correct.)
So the extraction was not 100% accurate, at least in my project. Also, like I said, that vendor said the issues are inherent in PDF, partly related to it being a canvas-based model, not a text-based one.
I'll try to check out your project some time later.
Cheers, Vasudev -- vi quickstart: https://gumroad.com/l/vi_quick Web site: https://vasudevram.github.io Blog: https://jugad2.blogspot.com Products: https://gumroad.com/vasudevram
While Tabula either gives either good output or fails miserably, Camelot gives you complete control over the extraction process with various configuration parameters! You can check out this section of the README <https://github.com/socialcopsdev/camelot#why-camelot> for more information. Camelot also lets you plot various geometries like detected lines, intersections, tables in the PDF to debug and improve table extraction! You can check out this part of the documentation < https://camelot-py.readthedocs.io/en/latest/user/advanced.html#plot-geometry
for more information on that.
Hello everyone!
I recently released a Python library which lets users extract data tables out of PDF files, my first open source library! Here's the link: https://github.com/socialcopsdev/camelot
I've created a wiki page < https://github.com/socialcopsdev/camelot/wiki/Comparison-with-other-PDF-Tabl...
comparing it to other open source PDF table extraction tools. I'm currently working on porting it to Python3!
I would be really grateful if you could check it out and see if its useful to you and give me any feedback that may help me improve it, by replying here, opening an issue or a pull request!
Looking forward to hearing from you all!
Thanks for your time!
Vinayak