Ruby Mechanize – Web Scraping Made Easy

by gerald on June 25, 2012

web scraping
In this guide, I’ll show you the basics of using the Mechanize library for scraping websites.

The Mechanize library is used for automating interaction with websites. It does not render pages like other web automation tools such as Watir, making it a lot faster for scraping. You should be aware though that Mechanize also has some limitations, such as not being able to process Javascript content that well.

If you’re new to Ruby, I recommend you check out some Ruby tutorials first. Don’t worry, it’s very easy to learn Ruby and a few searches on Google will be enough. If you know just a bit of Ruby, that’s great.

Before we start, I’m assuming you’ve already installed Ruby on your machine. If you’re on Windows, this can be done using the one-click installer. If you’re using other operating systems such as MAC OS or Ubuntu, I recommend using RVM.

Installing the Mechanize Gem

To install the mechanize gem, simply enter this line to the command prompt.

gem install mechanize

This command will install the mechanize gem into your system, which might take a few minutes.

Now that we have Ruby and Mechanize set up, let’s get started!

In this example, we’ll search Google for ‘Mechanize’.

Open up your editor, you can use notepad or notepad++, whatever you like. Create a new file and place these lines on the top to require our libraries.

require 'rubygems'
require 'mechanize'

First, we create a Mechanize object, think of it as the agent we’re going to use to fetch pages.

agent = Mechanize.new

Now, we can use this agent to fetch a page. We then store this page to a variable which we can work on later.

page = agent.get(‘http://www.google.com/’)

Working with Forms

Submitting forms is important especially if you’re working with a lot of search functionality. To use a form, you should first fetch the form from the page, fill up the fields and submit the form. Add the following to your code.

google_form = page.form(‘f’)
google_form.q = ‘mechanize’
page = agent.submit(google_form, google_form.buttons.first)

Let’s go over what the lines do. First, it fetches the form element with the name ‘f’. Next, it sets the value ‘mechanize’ to the input field with the name ‘q’. And lastly, it submits the form and stores it in the page variable.

Collecting Links

Now, how do we know if it really searches for ‘Mechanize’? Let’s try to check the results page and look at the result links. To find links on a page, you can call links on the page object like this.

page.links.each do |link|
puts link.text
end

What this does is that it loops over all the links on the page and prints the link text on the screen. You can now run the code using the terminal by running ‘ruby filename.rb’. Make sure you’re in the same directory as your script file. After running the script, you will see a bunch of items, including those links in the results page.

Parsing Data

When you are scraping data, you don’t necessarily want to get the entire page. You’ll usually just want to get a certain text under a section on a page. To do this, you need to parse the page.

Mechanize uses Nokogiri to parse HTML.This means you can use Nokogiri methods on the page object to search for something.

puts page.search('#ires ol').to_html

This line only prints what’s inside the results list itself, excluding those items outside that area, such as the navigation links and footer area. You can use either css or xpath selectors using the search method.

There’s a lot more you can do with Mechanize, check out the Mechanize Guide for more.

You can also check out a script I made, it scrapes the email addresses from a job site. It’s quick and dirty but if you’re working on a similar thing, it might help you out.

{ 1 comment… read it below or add one }

João Hornburg March 16, 2013 at 12:08 pm

Thank you!

Reply

Leave a Comment

Previous post:

Next post: