The urlparse module

This module contains functions to process uniform resource locators (URLs), and to convert between URLs and platform-specific filenames.

 
Example: Using the urlparse module
# File: urlparse-example-1.py

import urlparse

print urlparse.urlparse("http://host/path;params?query#fragment")

('http', 'host', '/path', 'params', 'query', 'fragment')

A common use is to split an HTTP URLs into host and path components (an HTTP request involves asking the host to return data identified by the path):

 
Example: Using the urlparse module to parse HTTP locators
# File: urlparse-example-2.py

import urlparse

scheme, host, path, params, query, fragment =\
        urlparse.urlparse("http://host/path;params?query#fragment")

if scheme == "http":
    print "host", "=>", host
    if params:
        path = path + ";" + params
    if query:
        path = path + "?" + query
    print "path", "=>", path

host => host
path => /path;params?query

Alternatively, you can use the urlunparse function to put the URL back together again:

 
Example: Using the urlparse module to parse HTTP locators
# File: urlparse-example-3.py

import urlparse

scheme, host, path, params, query, fragment =\
        urlparse.urlparse("http://host/path;params?query#fragment")

if scheme == "http":
    print "host", "=>", host
    print "path", "=>", urlparse.urlunparse((None, None, path, params, query, None))

host => host
path => /path;params?query

The urljoin function is used to combine an absolute URL with a second, possibly relative URL:

Example: Using the urlparse module to combine relative locators
# File: urlparse-example-4.py

import urlparse

base = "http://spam.egg/my/little/pony"

for path in "/index", "goldfish", "../black/cat":
    print path, "=>", urlparse.urljoin(base, path)

/index => http://spam.egg/index
goldfish => http://spam.egg/my/little/goldfish
../black/cat => http://spam.egg/my/black/cat
 

[comment on/vote for this article]

Comment:

Really nifty module, thanks. I'm rolling my own URIs in addition to the standard http, ftp and file ones and one feature that would be really useful would be the ability to add and/or modify how specific schemes are parsed. To that end, insert the following code somewhere in urlparse.py

  def edit_protocol(scheme, relative=1, netloc=1, hierarchical=1, params=1, query=1, fragment=1):
    """Add, remove or update a specific scheme, and how it's handled"""
    clear_cache()
    
    global uses_relative, uses_netloc, non_hierarchical
    global uses_params, uses_query, uses_fragment
    
    if relative == 1:    
        uses_relative.append(scheme)
    elif relative == 0:
        uses_relative = [x for x in uses_relative if x != scheme]

    if netloc == 1:      
        uses_netloc.append(scheme)
    elif netloc == 0:
        uses_netloc = [x for x in uses_netloc if x != scheme]
    
    if hierarchical == 0: 
        non_hierarchical.append(scheme) 
    elif hierarchical == 1:
        non_hierarchical= [x for x in non_hierarchical if x != scheme]
        
    if params == 1:      
        uses_params.append(scheme)
    elif params == 0:
        uses_params = [x for x in uses_params if x != scheme]
        
    if query == 1:       
        uses_query.append(scheme)
    elif query == 0:
        uses_query = [x for x in uses_query if x != scheme]    
    
    if fragment == 1:    
        uses_fragment.append(scheme)
    elif fragment == 0:
        uses_fragment = [x for x in uses_fragment if x != scheme]

use it like so:

  from urlparse import urlparse, edit_protocol

  uris = [
        'data://tabledata/data/XML',
        'http://127.0.0.1/hello.html',
        'file:///C:/windows/system_file.txt',        
        ]

  for uri in uris:
    print urlparse(uri)

  edit_protocol("data", 1,1,1,1,1,1)

(or 

  edit_protocol("data",  relative=1, 
                       netloc=1, 
                       hierarchical=1, 
                       params=1, 
                       query=1, 
                       fragment=1)
)

for uri in uris:
    print urlparse(uri)

Output:

  ('data', '', '//tabledata/data/XML', '', '', '')
  ...
  ('data', 'tabledata', '/data/XML', '', '', '')

Posted by Robin Macharg (2007-03-01)

A Django site. this page was rendered by a django application in 0.03s 2010-09-02 14:50:01.958425. hosted by webfaction.