Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Bugs

[issue2776] urllib2.urlopen() gets confused with path with // in it

 

 

Python bugs RSS feed   Index | Next | Previous | View Threaded


report at bugs

May 6, 2008, 2:30 PM

Post #1 of 5 (267 views)
Permalink
[issue2776] urllib2.urlopen() gets confused with path with // in it

New submission from Ambarish Malpani <ambarish [at] yahoo>:

Try the following code:
import urllib
import urllib2

url =
'http://features.us.reuters.com//autos/news/95ED98EE-A837-11DC-BCB3-4F218271.html'

data = urllib.urlopen(url).read()
data2 = urllib2.urlopen(url).read()

The attempt to get it with urllib works fine. With urllib2, the request
is malformed and I get back a HTTP 404

Request in the 2nd case is:
GET //autos/news/95ED98EE-A837-11DC-BCB3-4F218271.html HTTP/1.1\r\n
Accept-Encoding: identity\r\n
Host: autos\r\n
Connection: close\r\n
....

The host line seems to be looking for the last // rather than the first.

----------
components: Extension Modules
messages: 66334
nosy: ambarish
severity: normal
status: open
title: urllib2.urlopen() gets confused with path with // in it
type: behavior
versions: Python 2.5

__________________________________
Tracker <report [at] bugs>
<http://bugs.python.org/issue2776>
__________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

May 6, 2008, 2:33 PM

Post #2 of 5 (262 views)
Permalink
[issue2776] urllib2.urlopen() gets confused with path with // in it [In reply to]

Ambarish Malpani <ambarish [at] yahoo> added the comment:

Sorry, should have added another line:
The reason this is important to fix, is I am getting that URL with a //
in a Moved (HTTP 302) message, so I can't just get rid of the //

__________________________________
Tracker <report [at] bugs>
<http://bugs.python.org/issue2776>
__________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

May 6, 2008, 4:22 PM

Post #3 of 5 (259 views)
Permalink
[issue2776] urllib2.urlopen() gets confused with path with // in it [In reply to]

Martin McNickle <mmcnickle [at] gmail> added the comment:

The problem lines are in AbstractHTTPHandler.do_request():

scheme, sel = splittype(request.get_selector())
sel_host, sel_path = splithost(sel)
if not request.has_header('Host'):
request.add_unredirected_header('Host', sel_host or host)

When there is a double '/' sel is something like '//path/to/resource'.
splithost(sel) then gives ('path', '/to/resource'). Therefore the
header 'Host' gets set to 'path'.

I don't understand why sel_host is used in preference for host. host
holds the correct value, even with the double slashes. Could someone
explain why sel_host is used at all?

----------
nosy: +BitTorment

__________________________________
Tracker <report [at] bugs>
<http://bugs.python.org/issue2776>
__________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

May 6, 2008, 4:22 PM

Post #4 of 5 (267 views)
Permalink
[issue2776] urllib2.urlopen() gets confused with path with // in it [In reply to]

Changes by Martin McNickle <mmcnickle [at] gmail>:


----------
components: +Library (Lib) -Extension Modules
versions: +Python 2.6, Python 3.0

__________________________________
Tracker <report [at] bugs>
<http://bugs.python.org/issue2776>
__________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

May 10, 2008, 8:50 AM

Post #5 of 5 (241 views)
Permalink
[issue2776] urllib2.urlopen() gets confused with path with // in it [In reply to]

Adrianna Pinska <adrianna.pinska [at] gmail> added the comment:

Ordinarily, request.get_selector() returns the portion of the url after
the host, and sel_host is None. However, if a proxy is set on the
request, the request's host is set to the proxy host, get_selector()
returns the original full url, and sel_host is the host from the
original url (and different to the host set on the request).

This bug is only triggered if the double slash comes directly after the
host and there is no proxy set on the request. do_request_ does not
check what get_selector() is returning, so the output is passed through
splithost even when this is not necessary, and in this particular case
it causes undesirable behaviour.

My patch causes do_request_ only to attempt to extract the host from
get_selector() if the proxy has been set.

----------
keywords: +patch
nosy: +confluence
Added file: http://bugs.python.org/file10255/double_slash_after_host.patch

__________________________________
Tracker <report [at] bugs>
<http://bugs.python.org/issue2776>
__________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

Python bugs RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.