Extracting information from a website in Ruby

We want to integrate news items from third party news web sites into the news feed of our own web site (with proper attribution of the source, of course) and link them to the original source. For this we need to save at least the following attributes of an item: <ul> <li>Title</li> <li>Short Description</li> <li>Link to the original source</li> <li>(optional) Image</li> </ul> This information extraction should be done in Ruby. The process should be explained for every generic news website, but for a first solution the news website http://www.nachrichtenleicht.de/ suffices.
1 answer

Information Extraction Process for News Websites in Ruby

In order to extract and save the information that is according to the challenge we need to follow these high level steps that will be covered in detail in the following:

  1. Analyze website's source code and identify proper tags that need to be extracted
  2. In our Ruby code, use a XML/HTML parser that is able to extract information based on proper selectors (either XPath or CSS selectors)
  3. Save extracted information in a data storage

In the following we will apply these steps to the news website: http://www.nachrichtenleicht.de/uebersicht/nachrichten/

1. Analyse source code, Identify tags
Looking at the source code of our example page, we can quickly identify that all news items share the same class, namely: .entry-teaser

Following this, we can extract the child elements of this top element with class '.entry-teaser'.
The title is in the h2 element with class '.entry-title' and the short description is in all the underlying p elements.

2. Extract Information based tag identifcation in step 1

For this we use the library Nokigiri[1] for Ruby - it is an HTML, XML, SAX, and Reader parser. Among Nokogiri’s many features is the ability to search documents via XPath or CSS3 selectors. In this particular example, we will use CSS3 selectors:

(Beware: indentation is not properly adjusted in this view)

require 'nokogiri'
require 'open-uri'
url = 'http://www.nachrichtenleicht.de/uebersicht/nachrichten/'
image_url_prefix = 'http://www.nachrichtenleicht.de'
doc = Nokogiri::HTML(open(url))
doc.css(".entry-teaser").each do |item|
link = item.at_css('.image').at_css('a')
href = link[:href]
image_url = image_url_prefix + link.at_css('img')[:src]
title = item.at_css('h2.entry-title').text
short_description = item.at_css('p a').text.strip
end

3. Save extraction information in database

For this we create a new model class[2] News (title, image_url, short_description, external_url, source)

We also extend our code example from before to save our new news item to the database:

news = News.new
news.title = title
news.image_url = image_url
news.short_description = short_description
news.external_url = href
news.source = url
news.save!

[1] http://nokogiri.org/
[2] This model class uses the Ruby Gem for Active Record: http://rubygems.org/gems/activerecord

Taggings: