In this guide, I’ll show you the basics of using the Mechanize library for scraping websites.
If you’re new to Ruby, I recommend you check out some Ruby tutorials first. Don’t worry, it’s very easy to learn Ruby and a few searches on Google will be enough. If you know just a bit of Ruby, that’s great.
Before we start, I’m assuming you’ve already installed Ruby on your machine. If you’re on Windows, this can be done using the one-click installer. If you’re using other operating systems such as MAC OS or Ubuntu, I recommend using RVM.
Installing the Mechanize Gem
To install the mechanize gem, simply enter this line to the command prompt.
gem install mechanize
This command will install the mechanize gem into your system, which might take a few minutes.
Now that we have Ruby and Mechanize set up, let’s get started!
In this example, we’ll search Google for ‘Mechanize’.
Open up your editor, you can use notepad or notepad++, whatever you like. Create a new file and place these lines on the top to require our libraries.
First, we create a Mechanize object, think of it as the agent we’re going to use to fetch pages.
agent = Mechanize.new
Now, we can use this agent to fetch a page. We then store this page to a variable which we can work on later.
page = agent.get(‘http://www.google.com/’)
Working with Forms
Submitting forms is important especially if you’re working with a lot of search functionality. To use a form, you should first fetch the form from the page, fill up the fields and submit the form. Add the following to your code.
google_form = page.form(‘f’)
google_form.q = ‘mechanize’
page = agent.submit(google_form, google_form.buttons.first)
Let’s go over what the lines do. First, it fetches the form element with the name ‘f’. Next, it sets the value ‘mechanize’ to the input field with the name ‘q’. And lastly, it submits the form and stores it in the page variable.
Now, how do we know if it really searches for ‘Mechanize’? Let’s try to check the results page and look at the result links. To find links on a page, you can call links on the page object like this.
page.links.each do |link|
What this does is that it loops over all the links on the page and prints the link text on the screen. You can now run the code using the terminal by running ‘ruby filename.rb’. Make sure you’re in the same directory as your script file. After running the script, you will see a bunch of items, including those links in the results page.
When you are scraping data, you don’t necessarily want to get the entire page. You’ll usually just want to get a certain text under a section on a page. To do this, you need to parse the page.
Mechanize uses Nokogiri to parse HTML.This means you can use Nokogiri methods on the page object to search for something.
puts page.search('#ires ol').to_html
This line only prints what’s inside the results list itself, excluding those items outside that area, such as the navigation links and footer area. You can use either css or xpath selectors using the search method.
There’s a lot more you can do with Mechanize, check out the Mechanize Guide for more.
You can also check out a script I made, it scrapes the email addresses from a job site. It’s quick and dirty but if you’re working on a similar thing, it might help you out.