From davidgshi at yahoo.co.uk Sat Jan 10 18:16:54 2009 From: davidgshi at yahoo.co.uk (David Shi) Date: Sat, 10 Jan 2009 17:16:54 +0000 (GMT) Subject: [Web-SIG] Looking for an efficient Python script to download and save a .zip file programmatically Message-ID: <513164.35651.qm@web26304.mail.ukl.yahoo.com> I am looking for an efficient Python script to download and save a .zip file programmatically (from http or https call). ? Regards. ? David -------------- next part -------------- An HTML attachment was scrubbed... URL: From girish.redekar at gmail.com Mon Jan 12 12:26:35 2009 From: girish.redekar at gmail.com (Girish Redekar) Date: Mon, 12 Jan 2009 16:56:35 +0530 Subject: [Web-SIG] HTML parsing - get text position and font size Message-ID: <94ece4190901120326p6a6d19c6w3e05a9b1a9c999e2@mail.gmail.com> I'm trying to build a search engine in python am stuck at the place where I parse HTML to get useful text. One should ideally be able to parse the text (out of HTML tags) along with its position (for phrase searches) and font-size (to weigh words appropriately). However, this part gets very tedious (especially with bad html and css) and my code is already unwieldy. It seems to me that this task should've been a part of any python based semi-sophisticated screen scraper and that it would be a commonly solved problem. Yet, no amount of googling has returned anything useful. Any ideas? -------------- next part -------------- An HTML attachment was scrubbed... URL: From noah.gift at gmail.com Mon Jan 12 12:29:11 2009 From: noah.gift at gmail.com (Noah Gift) Date: Tue, 13 Jan 2009 00:29:11 +1300 Subject: [Web-SIG] HTML parsing - get text position and font size In-Reply-To: <94ece4190901120326p6a6d19c6w3e05a9b1a9c999e2@mail.gmail.com> References: <94ece4190901120326p6a6d19c6w3e05a9b1a9c999e2@mail.gmail.com> Message-ID: 2009/1/13 Girish Redekar : > I'm trying to build a search engine in python am stuck at the place where I > parse HTML to get useful text. One should ideally be able to parse the text > (out of HTML tags) along with its position (for phrase searches) and > font-size (to weigh words appropriately). > > However, this part gets very tedious (especially with bad html and css) and > my code is already unwieldy. It seems to me that this task should've been a > part of any python based semi-sophisticated screen scraper and that it would > be a commonly solved problem. Yet, no amount of googling has returned > anything useful. > > Any ideas? I wrote this article a way back: http://www.ibm.com/developerworks/aix/library/au-threadingpython/ I didn't fully explore it, but it seems like thread pools and Beautiful Soup could work... > _______________________________________________ > Web-SIG mailing list > Web-SIG at python.org > Web SIG: http://www.python.org/sigs/web-sig > Unsubscribe: > http://mail.python.org/mailman/options/web-sig/noah.gift%40gmail.com > > From girish.redekar at gmail.com Mon Jan 12 13:07:37 2009 From: girish.redekar at gmail.com (Girish Redekar) Date: Mon, 12 Jan 2009 17:37:37 +0530 Subject: [Web-SIG] HTML parsing - get text position and font size In-Reply-To: References: <94ece4190901120326p6a6d19c6w3e05a9b1a9c999e2@mail.gmail.com> Message-ID: <94ece4190901120407m3fc855e3le8b155ec2af937e2@mail.gmail.com> Thanks Noah - Beautiful Soup does give a tree that can be used - however, getting from the tree to the result I desire is still a long way. I'm using lxml (for speed conerns) and it also returns a tree similar to BS .. I have even got as far as parsing the css and getting the attributes for each text element. However, getting from here to a simple list of the form: [ (word1, fontsize1, position1), (word2, fontsize2, position2), (word3, fontsize3, position3) ... ] is still tedious as font sizes in html/css can be expressed in multiple methods (like tags, sizes in pixels, relative sizes, default larger size for header etc). One can get down and code each of these cases, but I was hoping someone has already (and reliably) worked on the same Thanks, Girish On Mon, Jan 12, 2009 at 4:59 PM, Noah Gift wrote: > 2009/1/13 Girish Redekar : > > I'm trying to build a search engine in python am stuck at the place where > I > > parse HTML to get useful text. One should ideally be able to parse the > text > > (out of HTML tags) along with its position (for phrase searches) and > > font-size (to weigh words appropriately). > > > > However, this part gets very tedious (especially with bad html and css) > and > > my code is already unwieldy. It seems to me that this task should've been > a > > part of any python based semi-sophisticated screen scraper and that it > would > > be a commonly solved problem. Yet, no amount of googling has returned > > anything useful. > > > > Any ideas? > > I wrote this article a way back: > > http://www.ibm.com/developerworks/aix/library/au-threadingpython/ > > I didn't fully explore it, but it seems like thread pools and > Beautiful Soup could work... > > > > _______________________________________________ > > Web-SIG mailing list > > Web-SIG at python.org > > Web SIG: http://www.python.org/sigs/web-sig > > Unsubscribe: > > http://mail.python.org/mailman/options/web-sig/noah.gift%40gmail.com > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dirkjan at ochtman.nl Mon Jan 12 13:16:00 2009 From: dirkjan at ochtman.nl (Dirkjan Ochtman) Date: Mon, 12 Jan 2009 13:16:00 +0100 Subject: [Web-SIG] HTML parsing - get text position and font size In-Reply-To: <94ece4190901120407m3fc855e3le8b155ec2af937e2@mail.gmail.com> References: <94ece4190901120326p6a6d19c6w3e05a9b1a9c999e2@mail.gmail.com> <94ece4190901120407m3fc855e3le8b155ec2af937e2@mail.gmail.com> Message-ID: 2009/1/12 Girish Redekar : > is still tedious as font sizes in html/css can be expressed in multiple > methods (like tags, sizes in pixels, relative sizes, default larger > size for header etc). One can get down and code each of these cases, but I > was hoping someone has already (and reliably) worked on the same So basically you want a full-on headless browser? Pretty non-trivial. Your best bet would probably be to hook into a Mozilla instance somehow (PyXPCOM, anyone?) and try to read the styles from the DOM there. Cheers, Dirkjan From t.broyer at gmail.com Mon Jan 12 14:51:01 2009 From: t.broyer at gmail.com (Thomas Broyer) Date: Mon, 12 Jan 2009 14:51:01 +0100 Subject: [Web-SIG] HTML parsing - get text position and font size In-Reply-To: <94ece4190901120326p6a6d19c6w3e05a9b1a9c999e2@mail.gmail.com> References: <94ece4190901120326p6a6d19c6w3e05a9b1a9c999e2@mail.gmail.com> Message-ID: 2009/1/12 Girish Redekar: > I'm trying to build a search engine in python am stuck at the place where I > parse HTML to get useful text. One should ideally be able to parse the text > (out of HTML tags) along with its position (for phrase searches) and > font-size (to weigh words appropriately). Have a look at html5lib for HTML parsing: http://code.google.com/p/html5lib It builds on the HTML5 parsing rules, which are compatible with how the four most used browsers (IE, Firefox, Safari and Opera) actually parse HTML as of now (as those do not parse HTML exactly the same, the algorithm is generally the "less illogical" in these cases). The result can either be a html5lib-specific tree (SimpleTree) or a BeautifulSoup, ElementTree/lxml or minidom. This means that, for instance, you can replace your BeautifulSoup parsing code with html5lib and keep the processing code as-is. However, for font-size, you'd have to parse and "apply" CSS and for this I have no solution at hand (but I don't really understand the use-case either actually...) -- Thomas Broyer From manlio_perillo at libero.it Mon Jan 12 15:11:04 2009 From: manlio_perillo at libero.it (Manlio Perillo) Date: Mon, 12 Jan 2009 15:11:04 +0100 Subject: [Web-SIG] HTML parsing - get text position and font size In-Reply-To: <94ece4190901120326p6a6d19c6w3e05a9b1a9c999e2@mail.gmail.com> References: <94ece4190901120326p6a6d19c6w3e05a9b1a9c999e2@mail.gmail.com> Message-ID: <496B4F78.6020806@libero.it> Girish Redekar ha scritto: > I'm trying to build a search engine in python am stuck at the place > where I parse HTML to get useful text. One should ideally be able to > parse the text (out of HTML tags) along with its position (for phrase > searches) and font-size (to weigh words appropriately). > Words weight should be done using semantics, not style. However, if you really need it, for CSS parsing, there is cssutils package. I'm writing a CSS parser, too: http://hg.mperillo.ath.cx/pdfimg/file/tip/pdfimg/style/css/ using PLY, so it should easy to read/modify. It is still in very early stage. > [...] Regards Manlio Perillo From orsenthil at gmail.com Thu Jan 15 20:46:02 2009 From: orsenthil at gmail.com (Senthil Kumaran) Date: Fri, 16 Jan 2009 01:16:02 +0530 Subject: [Web-SIG] Looking for an efficient Python script to download and save a .zip file programmatically Message-ID: <20090115194602.GD7200@goofy> On Sat, Jan 10, 2009 at 05:16:54PM +0000, David Shi wrote: > > I am looking for an efficient Python script to download and save a .zip > file programmatically (from http or https call). > Does not import urllib zipfile = urllib.urlopen(url_to_zip_file_name).read() do that? -- Senthil From davidgshi at yahoo.co.uk Tue Jan 20 14:21:22 2009 From: davidgshi at yahoo.co.uk (David Shi) Date: Tue, 20 Jan 2009 13:21:22 +0000 (GMT) Subject: [Web-SIG] How to use IIS Session ID in Dojo and Python - any demo scripts? Message-ID: <217689.12816.qm@web26302.mail.ukl.yahoo.com> Hello, ? I am using Dojo at the client-side and Python at the server-side on a Windows server with IIS.?? I am looking for concise instruction on how to use session ID in Dojo and Python.? Working demo scripts in Dojo and Python are preferred. ? At the front-end/client-side, Dojo needs to tell user when a session is done and print an innerHTML containing a href to a?folder named with the session ID. ? Python receives variables/parameters from Dojo to do the job and produce a folder named with the session ID. ? Regards. ? David -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilk at flibuste.net Fri Jan 30 20:52:49 2009 From: wilk at flibuste.net (William Dode) Date: Fri, 30 Jan 2009 19:52:49 +0000 (UTC) Subject: [Web-SIG] how to test hunging socket ? Message-ID: Hi, I've a problem with a web app wich freeze periodicaly. I monitored my app and the hang doesn't seem to occur in it. So i think the problem is before, or after, a problem of socket i imagine... It append with wsgiref.simple_server and mod_wsgi. My app is not totaly thread safe so i didn't try a lot of servers... When it freeze, i have to restart the app manualy. With mod_wsgi it freeze the whole server. It doesn't append very often so it's difficult for me to reproduce the problem. So my question is, how can i simulate hunging socket ? or how can i see where the app freeze exactly ? In python-paste server i read the ian tried to handle some case of hunging socket... thx, and sorry for my english... -- William Dod? - http://flibuste.net Informaticien Ind?pendant From ianb at colorstudy.com Fri Jan 30 21:32:49 2009 From: ianb at colorstudy.com (Ian Bicking) Date: Fri, 30 Jan 2009 14:32:49 -0600 Subject: [Web-SIG] how to test hunging socket ? In-Reply-To: References: Message-ID: If you use the Paste HTTP server and Python 2.5 with ctypes installed, you can install the watchthreads app: http://svn.pythonpaste.org/Paste/trunk/paste/debug/watchthreads.py that will let you see the hung threads, and get a traceback of their current position. On Fri, Jan 30, 2009 at 1:52 PM, William Dode wrote: > Hi, > > I've a problem with a web app wich freeze periodicaly. I monitored my > app and the hang doesn't seem to occur in it. So i think the problem is > before, or after, a problem of socket i imagine... It append with > wsgiref.simple_server and mod_wsgi. My app is not totaly thread safe so > i didn't try a lot of servers... > When it freeze, i have to restart the app manualy. With mod_wsgi it > freeze the whole server. It doesn't append very often so it's difficult > for me to reproduce the problem. > > So my question is, how can i simulate hunging socket ? or how can i see > where the app freeze exactly ? > > In python-paste server i read the ian tried to handle some case of > hunging socket... > > thx, and sorry for my english... > > -- > William Dod? - http://flibuste.net > Informaticien Ind?pendant > > _______________________________________________ > Web-SIG mailing list > Web-SIG at python.org > Web SIG: http://www.python.org/sigs/web-sig > Unsubscribe: > http://mail.python.org/mailman/options/web-sig/ianb%40colorstudy.com > -- Ian Bicking | http://blog.ianbicking.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilk at flibuste.net Fri Jan 30 22:48:04 2009 From: wilk at flibuste.net (William Dode) Date: Fri, 30 Jan 2009 21:48:04 +0000 (UTC) Subject: [Web-SIG] how to test hunging socket ? References: Message-ID: On 30-01-2009, Ian Bicking wrote: > --===============1588093401== > Content-Type: multipart/alternative; boundary=0016364ee12abeaed40461b91b05 > > --0016364ee12abeaed40461b91b05 > Content-Type: text/plain; charset=UTF-8 > Content-Transfer-Encoding: quoted-printable > > If you use the Paste HTTP server and Python 2.5 with ctypes installed, you > can install the watchthreads app: > http://svn.pythonpaste.org/Paste/trunk/paste/debug/watchthreads.py > > that will let you see the hung threads, and get a traceback of their curren= > t > position. Fine, i should definitely give it a try. If my app is not thread safe but respond in a decent time, can i benefit from a multithread server (for a socket problem) if i use a lock for every page like that : I use webob... lock = Lock() def my_application(environ, start_response): req = webob.Request(environ) res = webob.Response() res.content_type = 'text/html' try: lock.acquire() # my app... res.write('ok') finally: lock.release() return res(environ, start_response) > > On Fri, Jan 30, 2009 at 1:52 PM, William Dode wrote: > >> Hi, >> >> I've a problem with a web app wich freeze periodicaly. I monitored my >> app and the hang doesn't seem to occur in it. So i think the problem is >> before, or after, a problem of socket i imagine... It append with >> wsgiref.simple_server and mod_wsgi. My app is not totaly thread safe so >> i didn't try a lot of servers... >> When it freeze, i have to restart the app manualy. With mod_wsgi it >> freeze the whole server. It doesn't append very often so it's difficult >> for me to reproduce the problem. >> >> So my question is, how can i simulate hunging socket ? or how can i see >> where the app freeze exactly ? >> >> In python-paste server i read the ian tried to handle some case of >> hunging socket... >> >> thx, and sorry for my english... >> >> -- >> William Dod=C3=A9 - http://flibuste.net >> Informaticien Ind=C3=A9pendant >> >> _______________________________________________ >> Web-SIG mailing list >> Web-SIG at python.org >> Web SIG: http://www.python.org/sigs/web-sig >> Unsubscribe: >> http://mail.python.org/mailman/options/web-sig/ianb%40colorstudy.com >> > > > > --=20 > Ian Bicking | http://blog.ianbicking.org > > --0016364ee12abeaed40461b91b05 > Content-Type: text/html; charset=UTF-8 > Content-Transfer-Encoding: quoted-printable > > If you use the Paste HTTP server and Python 2.5 with ctypes installed, you = > can install the watchthreads app: te/trunk/paste/debug/watchthreads.py">http://svn.pythonpaste.org/Paste/trun= > k/paste/debug/watchthreads.py
>
that will let you see the hung threads, and get a traceback of their cu= > rrent position.

On Fri, Jan 30, 2009 at 1= >:52 PM, William Dode < net">wilk at flibuste.net> wrote:
>
204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Hi,
>
> I've a problem with a web app wich freeze periodicaly. I monitored my r> > app and the hang doesn't seem to occur in it. So i think the problem is= >
> before, or after, a problem of socket i imagine... It append with
> wsgiref.simple_server and mod_wsgi. My app is not totaly thread safe so
> i didn't try a lot of servers...
> When it freeze, i have to restart the app manualy. With mod_wsgi it
> freeze the whole server. It doesn't append very often so it's diffi= > cult
> for me to reproduce the problem.
>
> So my question is, how can i simulate hunging socket ? or how can i see
> where the app freeze exactly ?
>
> In python-paste server i read the ian tried to handle some case of
> hunging socket...
>
> thx, and sorry for my english...
>
> --
> William Dod=C3=A9 - http:= > //flibuste.net
> Informaticien Ind=C3=A9pendant
>
> _______________________________________________
> Web-SIG mailing list
>Web-SIG at python.org
> Web SIG: h= > ttp://www.python.org/sigs/web-sig
> Unsubscribe: %40colorstudy.com" target=3D"_blank">http://mail.python.org/mailman/options= > /web-sig/ianb%40colorstudy.com
>



--
Ian Bicking  | &nb= > sp;http://blog.ianbicking.org> > > --0016364ee12abeaed40461b91b05-- > > --===============1588093401== > Content-Type: text/plain; charset="us-ascii" > MIME-Version: 1.0 > Content-Transfer-Encoding: 7bit > Content-Disposition: inline > > _______________________________________________ > Web-SIG mailing list > Web-SIG at python.org > Web SIG: http://www.python.org/sigs/web-sig > Unsubscribe: http://mail.python.org/mailman/options/web-sig/gcpw-web-sig%40m.gmane.org > > --===============1588093401==-- > -- William Dod? - http://flibuste.net Informaticien Ind?pendant From ianb at colorstudy.com Fri Jan 30 23:08:04 2009 From: ianb at colorstudy.com (Ian Bicking) Date: Fri, 30 Jan 2009 16:08:04 -0600 Subject: [Web-SIG] how to test hunging socket ? In-Reply-To: References: Message-ID: On Fri, Jan 30, 2009 at 3:48 PM, William Dode wrote: > Fine, i should definitely give it a try. > > If my app is not thread safe but respond in a decent time, can i benefit > from a multithread server (for a socket problem) if i use a lock for > every page like that : > > I use webob... > If your app isn't threadsafe, you should use a multiprocess server. mod_wsgi has options for this, and flup has forking options (you'd use flup behind Apache or another server). -- Ian Bicking | http://blog.ianbicking.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilk at flibuste.net Sat Jan 31 09:01:30 2009 From: wilk at flibuste.net (William Dode) Date: Sat, 31 Jan 2009 08:01:30 +0000 (UTC) Subject: [Web-SIG] how to test hunging socket ? References: Message-ID: On 30-01-2009, Ian Bicking wrote: > --===============1780478717== > Content-Type: multipart/alternative; boundary=00163646d5c46749ac0461ba70c5 > > --00163646d5c46749ac0461ba70c5 > Content-Type: text/plain; charset=UTF-8 > Content-Transfer-Encoding: 7bit > > On Fri, Jan 30, 2009 at 3:48 PM, William Dode wrote: > >> Fine, i should definitely give it a try. >> >> If my app is not thread safe but respond in a decent time, can i benefit >> from a multithread server (for a socket problem) if i use a lock for >> every page like that : >> >> I use webob... >> > > If your app isn't threadsafe, you should use a multiprocess server. > mod_wsgi has options for this, and flup has forking options (you'd use flup > behind Apache or another server). Yes, i also could use an async server. But i would like to identify (and reproduce) exactly the problem. I also use a lot of cached data in my app. Anyway i have to make it thread-safe... -- William Dod? - http://flibuste.net Informaticien Ind?pendant From fumanchu at aminus.org Sat Jan 31 17:23:00 2009 From: fumanchu at aminus.org (Robert Brewer) Date: Sat, 31 Jan 2009 08:23:00 -0800 Subject: [Web-SIG] how to test hunging socket ? In-Reply-To: References: Message-ID: William Dod? wrote: > On 30-01-2009, Ian Bicking wrote: > > On Fri, Jan 30, 2009 at 3:48 PM, William Dode > wrote: > > > >> Fine, i should definitely give it a try. > >> > >> If my app is not thread safe but respond in a decent time, can i > benefit > >> from a multithread server (for a socket problem) if i use a lock for > >> every page like that : > >> > >> I use webob... > >> > > > > If your app isn't threadsafe, you should use a multiprocess server. > > mod_wsgi has options for this, and flup has forking options (you'd > use flup > > behind Apache or another server). > > Yes, i also could use an async server. But i would like to identify > (and > reproduce) exactly the problem. > I also use a lot of cached data in my app. Anyway i have to make it > thread-safe... Try http://www.aminus.net/wiki/PyConquer to help identify the problem. Robert Brewer fumanchu at aminus.org From wilk at flibuste.net Sat Jan 31 20:53:35 2009 From: wilk at flibuste.net (William Dode) Date: Sat, 31 Jan 2009 19:53:35 +0000 (UTC) Subject: [Web-SIG] how to test hunging socket ? References: Message-ID: I think i finaly could catch the error... With wsgiref simple_server (and apache mod_proxy), i run an app without problem most of the time. I mean 100000 hits/day. Some times, not every day, the app freeze and i need to restart it manualy. If i don't the app stay like that and never answer more to requests. The traceback show a lot broken pipe. This traceback is repeated for each requests to the restart. Traceback (most recent call last): File "/usr/lib/python2.5/wsgiref/handlers.py", line 93, in run self.finish_response() File "/usr/lib/python2.5/wsgiref/handlers.py", line 134, in finish_response self.write(data) File "/usr/lib/python2.5/wsgiref/handlers.py", line 217, in write self.send_headers() File "/usr/lib/python2.5/wsgiref/handlers.py", line 273, in send_headers self.send_preamble() File "/usr/lib/python2.5/wsgiref/handlers.py", line 199, in send_preamble 'Date: %s\r\n' % format_date_time(time.time()) File "/usr/lib/python2.5/socket.py", line 274, in write self.flush() File "/usr/lib/python2.5/socket.py", line 261, in flush self._sock.sendall(buffer) error: (32, 'Broken pipe') I know that wsgiref should not be used in production, but i'm suprised that a broken pipe can freeze all the app... -- William Dod? - http://flibuste.net Informaticien Ind?pendant From wilk at flibuste.net Sat Jan 31 23:19:45 2009 From: wilk at flibuste.net (William Dode) Date: Sat, 31 Jan 2009 22:19:45 +0000 (UTC) Subject: [Web-SIG] how to test hunging socket ? References: Message-ID: On 31-01-2009, William Dode wrote: > I think i finaly could catch the error... > > With wsgiref simple_server (and apache mod_proxy), i run an app without > problem most of the time. I mean 100000 hits/day. Some times, not every > day, the app freeze and i need to restart it manualy. If i don't the app > stay like that and never answer more to requests. > > The traceback show a lot broken pipe. This traceback is repeated for > each requests to the restart. I should say that i use mod_proxy with a timeout. So maybe the broken pipe is because of that... -- William Dod? - http://flibuste.net Informaticien Ind?pendant