Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Python

urllib equivalent for HTTP requests

 

 

Python python RSS feed   Index | Next | Previous | View Threaded


kevinkhaw at gmail

Oct 7, 2008, 11:51 PM

Post #1 of 3 (808 views)
Permalink
urllib equivalent for HTTP requests

Hello everyone,

I understand that urllib and urllib2 serve as really simple page
request libraries. I was wondering if there is a library out there
that can get the HTTP requests for a given page.

Example:
URL: http://www.google.com/test.html

Something like: urllib.urlopen('http://www.google.com/
test.html').files()

Lists HTTP Requests attached to that URL:
=> http://www.google.com/test.html
=> http://www.google.com/css/google.css
=> http://www.google.com/js/js.css

The other fun part is the inclusion of JS within <script> tags, i.e.
the new Google Analytics script
=> http://www.google-analytics.com/ga.js

or css, @imports
=> http://www.google.com/css/import.css

I would like to keep track of that but I realize that py does not have
a JS engine. :( Anyone with ideas on how to track these items or am I
out of luck.

Thanks,
K
--
http://mail.python.org/mailman/listinfo/python-list


deets at nospam

Oct 8, 2008, 12:34 AM

Post #2 of 3 (736 views)
Permalink
Re: urllib equivalent for HTTP requests [In reply to]

K schrieb:
> Hello everyone,
>
> I understand that urllib and urllib2 serve as really simple page
> request libraries. I was wondering if there is a library out there
> that can get the HTTP requests for a given page.
>
> Example:
> URL: http://www.google.com/test.html
>
> Something like: urllib.urlopen('http://www.google.com/
> test.html').files()
>
> Lists HTTP Requests attached to that URL:
> => http://www.google.com/test.html
> => http://www.google.com/css/google.css
> => http://www.google.com/js/js.css


There are no "Requests attached" to an url. There is a HTML-document
behind it, that might contain further external references.

> The other fun part is the inclusion of JS within <script> tags, i.e.
> the new Google Analytics script
> => http://www.google-analytics.com/ga.js
>
> or css, @imports
> => http://www.google.com/css/import.css
>
> I would like to keep track of that but I realize that py does not have
> a JS engine. :( Anyone with ideas on how to track these items or am I
> out of luck.

You can use e.g. BeautifulSoup to extract all links from the site.

What you can't do though is to get the requests that are issued by
Javascript that is *running*.

Diez
--
http://mail.python.org/mailman/listinfo/python-list


luke.leighton at googlemail

Oct 13, 2008, 1:53 AM

Post #3 of 3 (719 views)
Permalink
Re: urllib equivalent for HTTP requests [In reply to]

On Oct 8, 7:34 am, "Diez B. Roggisch" <de...@nospam.web.de> wrote:

> > I would like to keep track of that but I realize that py does not have
> > a JS engine. :( Anyone with ideas on how to track these items or

yep.

> What you can't do though is to get the requests that are issued byJavascriptthat is *running*.

yes you can. see recent post i made just a few minutes ago giving
some insights into each of the available options.

look up pyv8; pykhtml; spidermonkey; webkit with the python bindings
to its glib bindings - pywebkitgtk - use my patched version and see
http://pyjd.sf.net to get precompiled versions; pyxpcomext and pydom
on developer.mozilla.org; webkit's DumpRenderTree with the --html
option, to name but a few.

there are _tons_ of options. they're just an absolute bastard to
track down, because javascript is such a popular keyword to search
for, the results are almost useless.

l.
--
http://mail.python.org/mailman/listinfo/python-list

Python python RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.