export sites/pages to PDF

norseman norseman at hughes.net
Wed Aug 13 02:11:22 CEST 2008

Nick Craig-Wood wrote:
> jvdb <streamservenl at gmail.com> wrote:
>>  My employer is asking for a solution that outputs the content of urls
>>  to pdf. It must be the content as seen within the browser.
>>  Can someone help me on this? It must be able to export several kind of
>>  pages with all kind of content (javascript, etc.)
> Sounds like you'd be best off scripting a browser.
> Eg under KDE you can print to PDF from Konqueror using dcop to remote
> control it.
> Here is a demo... start Konqueror, select the PDF printer manually
> before you start. (You can automate this I expect!)
> Run
>   dcop konq*
> to find the id of the running konqueror (in my case
> "konqueror-18286"), then open a URL
>   dcop konqueror-18286 konqueror-mainwindow#1 openURL http://www.google.com
> To print to a PDF file
>   dcop konqueror-18286 html-widget2 print 1
> Web site converted to PDF in ~/print.pdf ;-)
> Easy enough to script that with python.
> See here for some more info on dcop :-
>   http://www.ibm.com/developerworks/linux/library/l-dcop/

If you are running KDE - go with Nick's method.

If the project is as it sounds - an in-house thing.
Meaning the web stuff is created by "you".

IF (BIG IF) you have a limited amount of URLs to deal with
The pages are NOT going to change shape via the print command
   (some use one .css for screen and another for print)
you are using UNIX of some sort:

Open the page and print the postscript output to a file.
   One file per page.


with this in a script:
# ps2pdf.scr
# converts a single ps file to a pdf file
# april 2000
ofil=`basename $1 .ps`
gs -sDEVICE=pdfwrite -q \
    -dBATCH -dNOPAUSE -r300 \
    -sOutputFile=\|cat >$ofil.pdf $1

ps2pdf.scr file.ps

If you have a number of .ps files to convert:

for f in *.ps; do ps2pdf.scr $f; done

In Windows - set the default printer to PDF to file and just print.
              Don't expect to concat the PDFs into a single "book",
              without a third party program.

   If (in UNIX) you want the whole base-on in one file, set up the 
printer section to ">>" (append) each output to the single file.
Depending on browser you may need to do some header cleaning.

norseman at hughes.net

More information about the Python-list mailing list