Browsing webpages in Python with Mechanize

posted on November 01 2013 - 14:07
python python-mechanize mechanize web-scrapping

So, you know a page that contains some data you want, and there is not other way for you to obtain that data, there are some tools that can help you obtain that data by doing some web scraping.

Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the World Wide Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding a fully-fledged web browser, such as Internet Explorer or Mozilla Firefox. (wikipedia)

In this post, I'll teach you how to obtain data from a webpage using Mechanize. It gives you a browser like object to ineract with web pages.

First of all, just create your python project, it can be a single folder called mechanize_example and a file called main.py. Also you must download the Mechanize module from its download page. There are many ways to get it, with easyinstall_, git, or you can download the source code. After you downloaded it, paste the Mechanize module folder inside your project directory.

So let's work on our python script. In your main.py file import the Browser class from Mechanize module. The mechanize.Browser implements the interface urllib2.OpenerDirector, so any URL can be opened, not just http. In this post we will try to perform a serch on stackoverflow website for python.

This is how the main.py script should be in the end.

from mechanize import Browser

# create a new instance of Browser class
br = Browser()
# Ignore robots.txt. Do not do this without thought and consideration.
br.set_handle_robots(False)

# fetch the stackoverflow.com url
br.open("http://www.stackoverflow.com")

# now we need to get the search form, so we get the first form
# in page, which is the search form, in position 0
br.select_form(nr=0)
# now we set value of the search field with name property as 'q'.
br.form['q'] = 'python'

# submit the request
response1 = br.submit()

# print the response, by calling the read() method
print response1.read()

OK. So explaining the code above, after you create a Browser instance, we set the method set_handle_robots to False.

Why did I mention:

Do not do this without thought and consideration.

Well, robots.txt file to give instructions about their site to web robots. It defines if a web robot must access or not a website, so, it's worth the consideration about ignoring robots.txt. To learn more about robots.txt see the website.

OK. Back to the subject, when we call the method br.open("http://www.stackoverflow.com"), we are asking mechanize to open the URL. Mechanize exports the complete interface of urllib2, so, when using mechanize, anything you would normally import from urllib2 should be imported from mechanize instead.

After that, we must use the methods from Browser instance to handle the request on the page. When we call br.select_form(nr=0), we are telling the Browser object to get the first form in a list of forms obtained from the page. As I mentioned in the beginning, you must know the page you want to extract the data, so in the case of stackoverflow, I know that the first form is the search form, that's why I've told mechanize to get the first form. You could also search for a form by its name, by doing br.select_form('form-name'), in our case, the form didn't have a name property, so we got it bia its position.

OK, after we have the form selected, we also have it's input fields. You can't extract the input field without selecting its parent form first. So, now we want to perform a search, so we need to set the value we want to search on the search field. So when we call br.form['q'] = 'python', we are telling mechanize to fill the input field with name property 'q' (which is the name of the input field.)

After that, we should just perform the submit on the form by calling br.submit(). We are assigning the response to response1 variable, so if we want to read the data, we call response1.read(). This will return the retrieved HTML page.

In our case, the HTML is too long, so I'm not gonna print the entire page here, but if you check the result on your console, you should see the HTML rendered there.

<!DOCTYPE html>
<html>
<head>

    <title>Unanswered 'python' Questions - Stack Overflow</title>
    <link rel="shortcut icon" href="//cdn.sstatic.net/stackoverflow/img/favicon.ico">
    <link rel="apple-touch-icon image_src" href="//cdn.sstatic.net/stackoverflow/img/apple-touch-icon.png">
    <link rel="search" type="application/opensearchdescription+xml" title="Stack Overflow" href="/opensearch.xml">
    <meta name="twitter:card" content="summary">
    <meta name="twitter:domain" content="stackoverflow.com"/>
    <meta name="og:type" content="website" />
    <meta name="og:image" content="http://cdn.sstatic.net/stackoverflow/img/apple-touch-icon@2.png?v=fde65a5a78c6"/>
    <meta name="og:title" content="Unanswered 'python' Questions" />
    <meta name="og:description" content="Q&A for professional and enthusiast programmers" />
    <meta name="og:url" content="http://stackoverflow.com/questions/tagged/python"/>

    <script type="text/javascript" src="//ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js"></script>
    <script src="//cdn.sstatic.net/Js/stub.en.js?v=f7b42019ec56" type="text/javascript"></script>
    <link rel="stylesheet" type="text/css" href="//cdn.sstatic.net/stackoverflow/all.css?v=6ea2c7349565">

        <link rel="alternate" type="application/atom+xml" title="Feed of questions tagged python" href="/feeds/tag/python" />

    <script type="text/javascript">
        StackExchange.ready(function () {
            StackExchange.realtime.init('wss://qa.sockets.stackexchange.com,ws://qa.sockets.stackexchange.com');
            StackExchange.realtime.subscribeToInboxNotifications();
                    StackExchange.realtime.subscribeToReputationNotifications('1');
                });
    </script>
    <script type="text/javascript">
        StackExchange.init({"locale":"en","stackAuthUrl":"https://stackauth.com","serverTime":1383314061,"styleCode":true,"enableUserHovercards":true,"site":{"name":"Stack Overflow","description":"Q&A for professional and enthusiast programmers","isNoticesTabEnabled":true,"recaptchaPublicKey":"6LdchgIAAAAAAJwGpIzRQSOFaO0pU6s44Xt8aTwc","enableSocialMediaInSharePopup":true},"user":{"fkey":"a0a9574243208bc0f0737b40eaca5707","isRegistered":true,"userType":3,"userId":399459,"accountId":171801,"gravatar":"<div class=\"\"><img src=\"https://www.gravatar.com/avatar/356e94abf22a9f5ddeee7910f6c232cd?s=32&d=identicon&r=PG\" alt=\"\" width=\"32\" height=\"32\"></div>","profileUrl":"http://stackoverflow.com/users/399459/rogcg","notificationsUnviewedCount":0,"inboxUnviewedCount":0}});
        StackExchange.using.setCacheBreakers({"js/prettify-full.en.js":"e0bbd4760e83","js/moderator.en.js":"1a411fd265fe","js/full-anon.en.js":"236d9835907d","js/full.en.js":"0164124d6fc4","js/wmd.en.js":"080b03871ae9","js/third-party/jquery.autocomplete.min.js":"e5f01e97f7c3","js/third-party/jquery.autocomplete.min.en.js":"","js/mobile.en.js":"40ac412781cb","js/help.en.js":"d3cc74d8a93a","js/tageditor.en.js":"ecd9cbf86481","js/tageditornew.en.js":"1e8db6b7af9d","js/inline-tag-editing.en.js":"f951bd09dc69","js/revisions.en.js":"1dead817b481","js/review.en.js":"428132870f9d","js/tagsuggestions.en.js":"a7d0f3ff530a","js/post-validation.en.js":"0927fa1bae70","js/explore-qlist.en.js":"73825bd006fc","js/events.en.js":"130d4e07b47b"});
    </script>
    <script type="text/javascript">
        StackExchange.using("gps", function() {
             StackExchange.gps.init(true);
        });
    </script>

</head>
<body class="tagged-questions-page">
    <noscript><div id="noscript-padding"></div></noscript>
    <div id="notify-container"></div>
    <div id="overlay-header"></div>
    <div id="custom-header"></div>
    <div class="container">
        <div id="header" >
...

An important note, if you want to extract the data from the HTML string, you should use some library to do that. There are two modules that you could use to do that, they are lxml or Beautiful Soup. I personally use Beautiful Soup, but you can use what best fits for you. Maybe in another post, I'll teach you how to retrieve data from a HTML string like this we received in this post.

I hope this post helped you understand how to use mechanize.

Download Source Code

Please share and comment. Also if you want to contribute to this blog, please click the Flattr button on top of this page.