Skip to main content

skip to main content

developerWorks  >  SOA and Web services  >

The Python Web services developer: The real world, Part 1

The Google Web APIs

developerWorks
Document options

Document options requiring JavaScript are not displayed

Discuss


Rate this page

Help us improve this content


Level: Introductory

Scott Archer, Software Architect, GlowingOrb, Inc.
Uche Ogbuji (uche@ogbuji.net), Principal Consultant, Fourthought, Inc.

14 Oct 2003

This column has covered the major Python APIs available for Web services processing, demonstrating basic facilities and approaches through the use of simple clients and servers. All of this has laid the groundwork for utilizing 'real-world' Web services. The authors will now apply their tools and understanding to several real-world Web services applications. Their focus here is on the Google Web APIs -- to which they will connect over SOAP so that they can programmatically search the Web and fetch cached Web pages.

Setting up

For the examples below, we'll be using Python version 2.2 and SOAPpy version 0.10.2. It's possible that other versions of SOAPpy would work with the Google API, although you would need to adapt the examples listed here to match the recent changes to imports and method signatures in SOAPpy. Installation of SOAPpy is simple: untar / unzip the package (use winzip or cygwin on Windows), then use the standard Python package installation tools (typically 'install' is run as root). See Listing 1. See also the Resources section at the end of the column for download information.


Listing 1. Using the standard python package installation tools
[scott@daedalus SOAPpy-0.10.2]$ python setup.py build
running build
...
[root@daedalus SOAPpy-0.10.2]# python setup.py install
running install
...



Back to top


The Google API

In order to use Google's API, you must first register with Google to obtain a license-key. You'll need to give a valid e-mail address, to which your key will be sent. This key costs nothing and gets you up to 1000 queries per day. At the time of this writing, Google's API service is in beta, and no plans for commercial (free or otherwise) availability have been announced. Read the license file that comes with the API download or on the Google Web site for details and other info on valid uses and limitations (see Resources).

Download the Google API toolkit. It doesn't contain any Python, but there are good encapsulation libraries and examples for Java and .Net (in C# and Visual Basic). What the toolkit download does contain that is of immediate use for us are the GoogleSearch WSDL file and a Google API reference guide in HTML. As you can see from the WSDL in Listing 2, the Google API provides three operations:

  1. doGoogleSearch( ), for performing Web queries against Google (the service doesn't support searching images or groups at this time)
  2. doGetCachedPage( ), which will retrieve Web pages that are cached by Google as its Web-crawlers encounter them
  3. doSpellingSuggest( ), which will return a corrected spelling for a list of terms submitted.

Also note that the returned GoogleSearchResult is a complex type.



Back to top


Let's go

For our first venture into the land of Google, we're going to write a Python client to directly construct the SOAP request as an XML string and send an HTTP message using Python's httplib module (see Listing 3). Our search will be for the two terms "spotted owl," requesting up to 10 results items ('maxResults') starting with the first ('start', zero-indexed) -- and we will not use any of the filters or restrictions. It is possible to restrict the search to specific languages ('lr') and to filter adult content from the results ('safeSearch'). See the Google Web APIs reference document for details on these other features (see Resources). Here you can begin to see the simplicity of connecting to the service.


Listing 3. Direct XML access to the API using python's httplib
import sys, httplib 
 
_post = '/search/beta2'
_host = 'api.google.com'
_port = 80 

# envelope_template is a simple string template that matches the required
# Google API SOAP envelope as described in the WSDL
envelope_template = """<SOAP-ENV:Envelope 
 xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" 
 xmlns:xsi="http://www.w3.org/1999/XMLSchema-instance" 
 xmlns:xsd="http://www.w3.org/1999/XMLSchema">
  <SOAP-ENV:Body>
    <ns1:doGoogleSearch xmlns:ns1="urn:GoogleSearch" 
         SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/">
      <key xsi:type="xsd:string">%s</key>
      <q xsi:type="xsd:string">%s</q>
      <start xsi:type="xsd:int">%d</start>
      <maxResults xsi:type="xsd:int">%d</maxResults>
      <filter xsi:type="xsd:boolean">%s</filter>
      <restrict xsi:type="xsd:string">%s</restrict>
      <safeSearch xsi:type="xsd:boolean">%s</safeSearch>
      <lr xsi:type="xsd:string">%s</lr>
      <ie xsi:type="xsd:string"></ie>
      <oe xsi:type="xsd:string"></oe>
    </ns1:doGoogleSearch>
  </SOAP-ENV:Body>
</SOAP-ENV:Envelope>""" 
 
# Google search options - for populating the envelope-template
_license_key = 'INSERT YOUR KEY HERE' 
_query = 'spotted owl'
_start = 0
_maxResults = 10
_filter = 'false'
_restrict = ''
_safeSearch = 'false'
_lang_restrict = ''

# populate the outbound SOAP envelope
envelope = envelope_template%( _license_key, _query,
                               _start, _maxResults,
                               _filter, _restrict, 
                               _safeSearch, _lang_restrict ) 

# now, we open an HTTP connection, set required headers, and send the SOAP envelope
envlen = len(envelope) 
http_conn = httplib.HTTP(_host, _port) 
http_conn.putrequest('POST', _post) 
http_conn.putheader('Host', _host) 
http_conn.putheader('Content-Type', 'text/xml; charset="utf-8"') 
http_conn.putheader('Content-Length', str(envlen)) 
http_conn.putheader('SOAPAction', '') 
http_conn.endheaders() 
http_conn.send(envelope) 

# fetch HTTP reply headers and the response
(status_code, message, reply_headers) = http_conn.getreply() 
response = http_conn.getfile().read() 

# dump raw xml
print "----------------------------------------"
print "send headers:\n", http_conn.headers
print "----------------------------------------"
print "send body:\n", envelope
print "----------------------------------------"
print "   status code: ", status_code 
print "status message: ", message 
print " reply headers:\n", reply_headers
print "----------------------------------------"
print "response body:\n", response

Comments in the source code in Listing 3 break the functionality into several steps. First, after some initial imports, we create a string that contains a template for a SOAP-Envelope that matches the structure defined in the WSDL from Listing 2. Next, for clarity, we set some variables that populate the envelop_template using Python's string processing. We then open an HTTP connection for a POST request, passing normal headers, and then send the request envelope. Finally, we open the response and dump the resulting XML to the screen. You can see in Listing 4 the XML SOAP response from the GoogleSearch service.


Listing 4. Partial output of Listing 2 -- showing only the first of ten result items. (indentation added)
<?xml version='1.0' encoding='UTF-8'?>
<SOAP-ENV:Envelope 
 xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" 
 xmlns:xsi="http://www.w3.org/1999/XMLSchema-instance" 
 xmlns:xsd="http://www.w3.org/1999/XMLSchema">
  <SOAP-ENV:Body>
    <ns1:doGoogleSearchResponse xmlns:ns1="urn:GoogleSearch" 
      SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/">
      <return xsi:type="ns1:GoogleSearchResult">
        <documentFiltering xsi:type="xsd:boolean">false</documentFiltering>
        <estimatedTotalResultsCount xsi:type="xsd:int">
                 117000
        </estimatedTotalResultsCount>
        <directoryCategories 
         xmlns:ns2="http://schemas.xmlsoap.org/soap/encoding/" 
         xsi:type="ns2:Array" ns2:arrayType="ns1:DirectoryCategory[1]">
            <item xsi:type="ns1:DirectoryCategory">
              <specialEncoding xsi:type="xsd:string"></specialEncoding>
              <fullViewableName xsi:type="xsd:string">
              Top/Regional/North_America/United_States/Oregon/Localities/S/Sweet_Home
              </fullViewableName>
            </item>
        </directoryCategories>
        <searchTime xsi:type="xsd:double">0.074759</searchTime>
        <resultElements
         xmlns:ns3="http://schemas.xmlsoap.org/soap/encoding/" 
         xsi:type="ns3:Array" ns3:arrayType="ns1:ResultElement[10]">
            <item xsi:type="ns1:ResultElement">
              <cachedSize xsi:type="xsd:string">2k</cachedSize>
              <hostName xsi:type="xsd:string"></hostName>
              <snippet xsi:type="xsd:string">
                 A presentation of bird photographs, songs, identification tips, 
                 distribution maps, and life history information for North American 
                 birds, and a forum for <b>...</b> 
               </snippet>
               <directoryCategory xsi:type="ns1:DirectoryCategory">
                 <specialEncoding xsi:type="xsd:string"></specialEncoding>
                 <fullViewableName xsi:type="xsd:string"></fullViewableName>
               </directoryCategory>
               <relatedInformationPresent xsi:type="xsd:boolean">
                    true
               </relatedInformationPresent>
               <directoryTitle xsi:type="xsd:string"></directoryTitle>
               <summary xsi:type="xsd:string"></summary>
               <URL xsi:type="xsd:string">
               http://www.mbr-pwrc.usgs.gov/id/framlst/i3690id.html
               </URL>
               <title xsi:type="xsd:string">
                  <b>Spotted</b> <b>Owl</b>
               </title>
            </item>

                    ...

          </resultElements>
          <endIndex xsi:type="xsd:int">10</endIndex>
          <searchTips xsi:type="xsd:string"></searchTips>
          <searchComments xsi:type="xsd:string"></searchComments>
          <startIndex xsi:type="xsd:int">1</startIndex>
          <estimateIsExact xsi:type="xsd:boolean">false</estimateIsExact>
          <searchQuery xsi:type="xsd:string">spotted owl</searchQuery>
       </return>
     </ns1:doGoogleSearchResponse>
   </SOAP-ENV:Body>
  </SOAP-ENV:Envelope>



Back to top


There's a better way

Despite the relative simplicity of directly constructing SOAP requests using XML and subsequently parsing the XML SOAP responses, there are several higher-level libraries that offer much more powerful and intuitive interfaces. Several of these have been explored in previous issues of this column. Although the Google Web API is simple enough that any SOAP library supporting complex types will probably work, we'll use the SOAPpy library for Python to talk with the Google Web Service.

In this approach, we use SOAPProxy to create a proxy for the Web service (see Listing 5). Through this SOAP proxy, we can directly call the exposed methods on the Google Web APIs -- doGoogleSearch( ) in this case. The SOAPpy library transparently handles the marshalling of the string and integer parameters (key, q, start, maxResults, restrict, lr, ie, and oe), but we must explicitly marshall the boolean values (filter and safeSearch). Note that according to the Google Web APIs documentation, the last two parameters (ie and oe) are no longer used and so are required but ignored. Upon successfully calling the remote method, SOAPProxy automatically parses the results-containing SOAP object, which creates a Python object that matches the structure described in the WSDL. Access to the results object is then made through direct Python attributes.

Additionally, SOAPpy has a debug mode that can be turned on by changing the value of self.debug to 1 on line 68 (in the 0.10.2 version of SOAPpy) of Config.py in the SOAPpy installation (on *NIXes, this is located at: /usr/lib/Python2.2/site-packages/SOAPpy). Unfortunately there's no other way to turn this feature on. With debug mode enabled, all HTTP headers and raw SOAP XML that is exchanged during invocation gets dumped to stdout.


Listing 5. Using SOAPpy to access the Google API
from SOAPpy import SOAPProxy
from SOAPpy import Types

# CONSTANTS
_url = 'http://api.google.com/search/beta2'
_namespace = 'urn:GoogleSearch'

# need to marshall into SOAP types
SOAP_FALSE = Types.booleanType(0)
SOAP_TRUE = Types.booleanType(1)

# create SOAP proxy object
google = SOAPProxy(_url, _namespace)

# Google search options
_license_key = 'INSERT YOUR KEY HERE' 
_query = 'spotted owl'
_start = 0
_maxResults = 10
_filter = SOAP_FALSE
_restrict = ''
_safeSearch = SOAP_FALSE
_lang_restrict = ''

# call search method over SOAP proxy
results = google.doGoogleSearch( _license_key, _query, 
                                 _start, _maxResults, 
                                 _filter, _restrict,
                                 _safeSearch, _lang_restrict, '', '' )
           
# display results
print 'google search for  " ' + _query + ' "\n'
print 'estimated result count: ' + str(results.estimatedTotalResultsCount)
print '           search time: ' + str(results.searchTime) + '\n'
print 'results ' + str(_start + 1) + ' - ' + str(_start + _maxResults) +':\n'
                                                       
numresults = len(results.resultElements)
for i in range(numresults):
    title = results.resultElements[i].title
    noh_title = title.replace('<b>', '').replace('</b>', '')
    print 'title: ' + noh_title
    print '  url: ' + results.resultElements[i].URL + '\n'

This time, instead of explicitly creating a SOAP envelope and sending it over HTTP, we simply used the SOAPProxy object from SOAPpy. Given the callable URL and appropriate namespace for a Web service, SOAPpy uses Python's dynamic binding to marshall the call parameters and even method name, doGoogleSearch( ), into a remote procedure call on the proxied service object. The response is then accessible as an aggregate object that matches the GoogleSearchResult structure as specified in the WSDL. In displaying the results, we directly access the properties of the results object.


Listing 6. Output of Listing 5
[scott@daedalus google_test]$ python google_test.py
google search for  " spotted owl "
estimated result count: 117000
           search time: 0.070122
results 1 - 10:
title: Spotted Owl
  url: http://www.mbr-pwrc.usgs.gov/id/framlst/i3690id.html
title: Mexican Spotted Owl - Home
  url: http://mso.fws.gov/
title: EO Study: Spotting the Spotted Owl
  url: http://earthobservatory.nasa.gov/Study/SpottedOwls/
title: Western Spotted Owl Printout- EnchantedLearning.com
  url: 
  http://www.enchantedlearning.com/subjects/birds/printouts/Spottedowlprintout.shtml
title: Northern Spotted Owl
  url: http://biology.usgs.gov/s+t/SNT/noframe/pn172.htm
title: AMNH - Expedition : Endangered
  url: http://www.amnh.org/nationalcenter/Endangered/owl/owl.html
title: Northern Spotted Owl - USFS History - Forest History Society
  url: http://www.lib.duke.edu/forest/usfscoll/policy/northern_spotted_owl/
title: North American Owl Identification Guide
  url: http://www.owlinstitute.org/owls/spotted.html
title: Spotted Owls - Strix occidentalis
  url: http://www.owlpages.com/species/strix/occidentalis/Default.htm
title: The Northern Spotted Owl Debate
  url: http://www.spa3.k12.sc.us/WebQuests/endangeredanimals/endangered.htm

Listing 6 is our pretty text output from running the code in Listing 5. After showing some stats from the search, we iterate through each result -- here displaying the title and URL of these results. The Google Web APIs limit us to fetching 10 results for any search. Note the search time, returned as part of the SOAP response -- less than 1/10th of a second to locate over 100,000 (estimated) records out of over 3.3 billion Web pages. The Google Web API offers an easy-to-use interface to a very powerful Web service.



Back to top


Google is more than search

As we mentioned earlier, the Google Web APIs offer more than search -- spelling suggestions and retrieval of cached Web pages can also be performed through the SOAP interface. Continuing on our previous 'owl' theme, in Listing 7 we use the SOAPProxy from SOAPpy to retrieve a cached version of 'www.owl.org'. Notice that both the request and response (doGetCachedPage() and results, respectively) are much simpler for fetching cached Web pages than the request and response for searching the Web. We simply call the doGetCachedPage() through the proxy, passing our license key and the requested Web page.


Listing 7. Using SOAP.py to fetch a cached Web page through the Google API
from SOAPpy import SOAPProxy
from SOAPpy import Types

# CONSTANTS
_url = 'http://api.google.com/search/beta2'
_namespace = 'urn:GoogleSearch'

# create SOAP proxy object
google = SOAPProxy(_url, _namespace)

# Google search options
_license_key = 'INSERT YOUR KEY HERE' 
_query = 'www.owls.org'

# call search method over SOAP proxy
results = google.doGetCachedPage( _license_key, _query )
           
# store results
of = open('cached_page_response.html', 'w')
of.write(results)
of.close()

The Google API returns the cached page base64-encoded. SOAPpy does a very handy job of decoding this into html upon receiving the response -- which we then save to the filesystem and can view with a browser. This is a simple interface to what is probably the world's largest (and most complete) cache of the entire Web. Figure 1 shows the returned page, loaded into a browser. Note that Google attaches a header identifying the page as cached by Google.


Figure 1. Output of Listing 7
Output of Listing 7


Resources



About the authors

Scott Archer is a software architect and co-founder of GlowingOrb, Inc., a software tools developer focusing on model-driven solutions and their integration into core business processes. Mr. Archer holds an M.Phil in Computational Molecular Biology from the University of Hong Kong. You can contact Mr. Archer at scott.archer at glowingorb.com.


Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management applications. Fourthought develops 4Suite, open source platforms for XML middleware. Mr. Ogbuji is a Computer Engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at uche@ogbuji.net.




Rate this page


Please take a moment to complete this form to help us better serve you.



YesNoDon't know
 


 


12345
Not
useful
Extremely
useful
 


Back to top