Browsing webpages in Python with Mechanizepython python-mechanize mechanize web-scrapping
November 01 2013 - 14:07
So, you know a page that contains some data you want, and there is not other way for you to obtain that data, there are some tools that can help you obtain that data by doing some web scraping.
Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the World Wide Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding a fully-fledged web browser, such as Internet Explorer or Mozilla Firefox. (wikipedia)
In this post, I'll teach you how to obtain data from a webpage using Mechanize. It gives you a browser like object to ineract with web pages.
First of all, just create your python project, it can be a single folder called mechanize_example and a file called
main.py. Also you must download the Mechanize module from its
download page. There are many ways to get it, with easyinstall_, git, or you can download the source code. After you downloaded it, paste the Mechanize module folder inside your project directory.
So let's work on our python script. In your
main.py file import the Browser class from Mechanize module. The
mechanize.Browser implements the interface urllib2.OpenerDirector, so any URL can be opened, not just http.
In this post we will try to perform a serch on stackoverflow website for python.
This is how the
main.py script should be in the end.
from mechanize import Browser # create a new instance of Browser class br = Browser() # Ignore robots.txt. Do not do this without thought and consideration. br.set_handle_robots(False) # fetch the stackoverflow.com url br.open("http://www.stackoverflow.com") # now we need to get the search form, so we get the first form # in page, which is the search form, in position 0 br.select_form(nr=0) # now we set value of the search field with name property as 'q'. br.form['q'] = 'python' # submit the request response1 = br.submit() # print the response, by calling the read() method print response1.read()
OK. So explaining the code above, after you create a Browser instance, we set the method
Why did I mention:
Do not do this without thought and consideration.
Well, robots.txt file to give instructions about their site to web robots. It defines if a web robot must access or not a website, so, it's worth the consideration about ignoring robots.txt. To learn more about robots.txt see the website.
OK. Back to the subject, when we call the method
br.open("http://www.stackoverflow.com"), we are asking mechanize to open the URL. Mechanize exports the complete interface of urllib2, so,
when using mechanize, anything you would normally import from urllib2 should be imported from mechanize instead.
After that, we must use the methods from Browser instance to handle the request on the page. When we call
br.select_form(nr=0), we are telling the Browser object to get the first
form in a list of forms obtained from the page. As I mentioned in the beginning, you must know the page you want to extract the data, so in the case of stackoverflow, I know that the first
form is the search form, that's why I've told mechanize to get the first form. You could also search for a form by its name, by doing
br.select_form('form-name'), in our case, the form didn't
have a name property, so we got it bia its position.
OK, after we have the form selected, we also have it's input fields. You can't extract the input field without selecting its parent form first. So, now we want to perform a search, so we need
to set the value we want to search on the search field. So when we call
br.form['q'] = 'python', we are telling mechanize to fill the input field with name property 'q' (which is the name of the input field.)
After that, we should just perform the submit on the form by calling
br.submit(). We are assigning the response to
response1 variable, so if we want to read the data, we call
This will return the retrieved HTML page.
In our case, the HTML is too long, so I'm not gonna print the entire page here, but if you check the result on your console, you should see the HTML rendered there.
An important note, if you want to extract the data from the HTML string, you should use some library to do that. There are two modules that you could use to do that, they are lxml or Beautiful Soup. I personally use Beautiful Soup, but you can use what best fits for you. Maybe in another post, I'll teach you how to retrieve data from a HTML string like this we received in this post.
I hope this post helped you understand how to use mechanize.
Please share and comment. Also if you want to contribute to this blog, please click the Flattr button on top of this page.