See All Titles |
![]() ![]() Web Surfing with Python: Creating Simple Web ClientsOne thing to keep in mind is that a browser is only one type of Web client. Any application that makes a request for data from a Web server is considered a "client." Yes, it is possible to create other clients which retrieve documents or data off the Internet. One important reason to do this is that a browser provides only limited capacity, i.e., it is used primarily for viewing and interacting with Web sites. A client program, on the other hand, has the ability to do more—it can not only download data, it can also store it, manipulate it, or perhaps even transmit it to another location or application. Applications which use the urllib module to download or access information on the Web [using either urllib.urlopen() or urllib.urlretrieve()] can be considered a simple Web client. All you need to do is provide a valid Web address. Uniform Resource LocatorsSimple Web surfing involves using Web addresses called Uniform Resource Locators (URLs). Such addresses are used to locate a document on the Web or to call a CGI program to generate a document for your client. URLs are part of a larger set of identifiers known as URIs (Uniform Resource Identifiers). This superset was created in anticipation of other naming conventions which have yet to be developed. A URL is simply a URI which uses an existing protocol or scheme (i.e., http, ftp, etc.) as part of its addressing. To complete this picture, we'll add that non-URL URIs are sometimes known as URNs (Uniform Resource Names), but because URLs are the only URIs in use today, you really don't hear much about URIs or URNs. Like street addresses, Web addresses have some structure. An American street address usually is of the form "number street designation," i.e., 123 Main Street. It differs from other countries, which have their own rules. A URL is of the format: prot_sch://net_loc/path;params?query#frag Table 19.1 describes each of the components.
net_loc can be broken down into several more components, some required, others optional. The net_loc string looks like this: user:passwd@host:port These individual components are described in Table 19.2.
Of the four, the host name is the most important. The port number is necessary only if the Web server is running on a different port number from the default. (If you aren't sure what a port number is, go back to Chapter 16.) User names and perhaps passwords are used only when making FTP connections, and even then, they usually aren't necessary because the majority of such connections are "anonymous." Python supplies two different modules, each dealing with URLs in completely different functionality and capacities. One is urlparse, and the other is urllib. We will briefly introduce some of their functions here. urlparse ModuleThe urlparse module provides basic functionality with which to manipulate URL strings. These functions include urlparse(), urlunparse(), and urljoin(). urlparse.urlparse()urlparse() breaks up a URL string into some of the major components described above and has the following syntax: urlparse(urlstr, defProtSch=None, allowFrag=None) urlparse() parses urlstr into a 6-tuple (prot_sch, net_loc, path, params, query, frag). Each of these components has been described above. defProtSch indicates a default network protocol or download scheme in case one is not provided in urlstr. allowFrag is a flag that signals whether or not a fragment part of a URL is allowed. Here is what urlparse() outputs when given a URL: >>> urlparse.urlparse('http://www.python.org/doc/FAQ.html') ('http', 'www.python.org', '/doc/FAQ.html', '', '', '') urlparse.urlunparse()urlunparse() does the exact opposite of urlparse()—it merges a 6-tuple (prot_sch, net_loc, path, params, query, frag)—urltup, which could be the output of urlparse(), into a single URL string and returns it. Accordingly, we state the following equivalence: urlunparse(urlparse(urlstr)) = urlstr You may have already surmised that the syntax of urlunparse() is as follows: urlunparse(urltup) urlparse.urljoin()The urljoin() function is useful in cases where many related URLs are needed, for example, the URLs for a set of pages to be generated for a Web site. The syntax for urljoin() is: urljoin(baseurl, newurl, allowFrag=None) urljoin() takes baseurl and joins its base path (net_loc plus the full path up to, but not including, a file at the end) with newurl. For example: >>> urlparse.urljoin('http://www.python.org/doc/FAQ.html', \ … 'current/lib/lib.htm') 'http://www.python.org/doc/current/lib/lib.html' A summary of the functions in urlparse can be found in Table 19.3
urllib ModuleNOTE Unless you are planning on writing a more lower-level network client, the urllib module provides all the functionality you need. urllib provides a high-level Web communication library, supporting the basic Web protocols, HTTP, FTP, and Gopher, as well as providing access to local files. Specifically, the functions of the urllib module are designed to download data (from the Internet, local network, or local host) using the aforementioned protocols. Use of this module generally obviates the need for using the httplib, ftplib, and gopherlib modules unless you desire their lower-level functionality. In those cases, such modules can be considered as alternatives. (Note: most modules named *lib are generally for developing clients of the corresponding protocols. This is not always the case, however, as perhaps urllib should then be renamed as "internetlib" or something similar!) The urllib module provides functions to download data from given URLs as well as encoding and decoding strings to make them suitable for including as part of valid URL strings. The functions functions which we will be looking at in this upcoming section include: urlopen(), urlretrieve(), quote(), quote_plus(), unquote(), unquote_plus(), and urlencode(). We will also look at some of the methods available to the file-like object returned by urlopen(). urllib.urlopen()urlopen() opens a Web connection to the given URL string and returns a file-like object. It has the following syntax: urlopen(urlstr, postQueryData=None) urlopen() opens the URL pointed to by urlstr. If no protocol or download scheme is given, or if a "file" scheme is passed in, urlopen() will open a local file. For all HTTP requests, the normal request type is "GET." In these cases, the query string provided to the Web server (key-value pairs encoded or "quoted," such as the string output of the urlencode() function [see below]), should be given as part of urlstr. If the "POST" request method is desired, then the query string (again encoded) should be placed in the postQueryData variable. (For more information regarding the GET and POST request methods, refer to any general documentation or texts on programming CGI applications—which we will also discuss below. GET and POST requests are the two ways to "upload" data to a Web server. When a successful connection is made, urlopen() returns a file-like object as if the destination was a file opened in read mode. If our file object is f, for example, then our "handle" would support the expected read methods such as f.read(), f.readline(), f.readlines(), f.close(), and f.fileno(). In addition, a f.info() method is available which returns the MIME (Multipurpose Internet Mail Extension) headers. Such headers give the browser information regarding which application can view returned file types. For example, the browser itself can view HTML (Hypertext Markup Language) or plain text type files as well as GIF (Graphics Interchange Format) and JPEG (Joint Photographic Experts Group) graphics files. Other files such as multimedia or specific document types require external applications in order to view. Finally, a geturl() method exists to obtain the true URL of the final opened destination, taking into consideration any redirection which may have occurred. A summary of these file-like object methods is given in Table 19.4.
urllib.urlretrieve()urlretrieve() will do some quick and dirty work for you if you are interested in working with a URL document as a whole. Here is the syntax for urlretrieve(): urlretrieve(urlstr, localfile=None, downloadStatusHook=None) Rather than reading from the URL like urlopen() does, urlretrieve() will simply download the entire HTML file located at urlstr to your local disk. It will store the downloaded data into localfile if given or a temporary file if not. If the file has already been copied from the Internet or if the file is local, no subsequent downloading will occur. The downloadStatusHook, if provided, is a function that is called after each block of data has been downloaded and delivered. It is called with the following three arguments: number of blocks read so far, the block size in bytes, and the total (byte) size of the file. This is very useful if you are implementing "download status" information to the user in a text-based or graphical display. urlretrieve() returns a 2-tuple, (filename, mime_hdrs). filename is the name of the local file containing the downloaded data. mime_hdrs is the set of MIME headers returned by the responding Web server. For more information, see the Message class of the mimetools module. mime_hdrs is None for local files. For an example using urlretrieve(), take a look at Example 11.2 (grabweb.py). urllib.quote() and urllib.quote_plus()The quote*() functions take URL data and "encodes" them so that they are "fit" for inclusion as part of a URL string. In particular, certain special characters that are unprintable or cannot be part of valid URLs acceptable to a Web server must be converted. This is what the quote*() functions do for you. Both quote*() functions have the following syntax: quote(urldata, safe='/') Characters that are never converted include commas, underscores, periods and dashes as well as alphanumerics. All others are subject to conversion. In particular, the disallowed characters are changed to their hexadecimal ordinal equivalents prepended with a percent sign ( % ), i.e., "%xx" where "xx" is the hexadecimal representation of a character's ASCII value. When calling quote*(), the urldata string is converted to an equivalent string that can be part of a URL string. The safe string should contain a set of characters which should also not be converted. The default is the slash ( / ). quote_plus() is similar to quote() except that it also encodes spaces to plus signs ( + ). Here is an example using quote() vs. quote_plus(): >>> name = 'joe mama' >>> number = 6 >>> base = 'http://www/~foo/cgi-bin/s.py' >>> final = '%s?name=%s&num=%d' % (base, name, number) >>> final 'http://www/~foo/cgi-bin/s.py?name=joe mama&num=6' >>> >>> urllib.quote(final) 'http:%3a//www/%7efoo/cgi-bin/s.py%3fname%3djoe%20mama%26num%3d6' >>> >>> urllib.quote_plus(final) 'http%3a//www/%7efoo/cgi-bin/s.py%3fname%3djoe+mama%26num%3d6' urllib.unquote() and urllib.unquote_plus()As you have probably guessed, the unquote*() functions do the exact opposite of the quote*() functions—they convert all characters encoded in the "%xx" fashion to their ASCII equivalents. The syntax of unquote*() is as follows: unquote*(urldata) Calling unquote() will decode all URL-encoded characters in urldata and return the resulting string. unquote_plus() will also convert plus signs back to space characters. urllib.urlencode()urlencode(), recently added to Python (as of version 1.5.2) takes a dictionary of key-value pairs and encodes them to be included as part of a query in a CGI request URL string. The pairs are in "key=value" format and are delimited by ampersands ( & ). Furthermore, the keys and their values are sent to quote_plus() for proper encoding. Here is an example output from urlencode(): >>> aDict = { 'name': 'Georgina Garcia', 'hmdir': '~ggarcia' } >>> urllib.urlencode(aDict) 'name=Georgina+Garcia&hmdir=%7eggarcia' There are other functions in urllib and urlparse which we did not have the opportunity to cover here. Refer to the documentation for more information. Secure Socket Layer supportThe urllib module has been modified for Python 1.6 so that it now supports opening HTTP connections using the Secure Socket Layer (SSL). The core change to add SSL is implemented in the socket module. Consequently, the urllib and httplib modules were updated to support URLs using the "https" connection scheme. Note, however, that as of time of publication, only HTTP requests using SSL have been implemented. The future may see additional updates to the other protocols supported by the urllib module, such as FTP. A summary of the urllib functions discussed in this section can be found in Table 19.5.
|
© 2002, O'Reilly & Associates, Inc. |