In order to extract and save the information that is according to the challenge we need to follow these high level steps that will be covered in detail in the following:
In the following we will apply these steps to the news website: http://www.nachrichtenleicht.de/uebersicht/nachrichten/
1. Analyse source code, Identify tags
Looking at the source code of our example page, we can quickly identify that all news items share the same class, namely: .entry-teaser
Following this, we can extract the child elements of this top element with class '.entry-teaser'.
The title is in the h2 element with class '.entry-title' and the short description is in all the underlying p elements.
2. Extract Information based tag identifcation in step 1
For this we use the library Nokigiri[1] for Ruby - it is an HTML, XML, SAX, and Reader parser. Among Nokogiri’s many features is the ability to search documents via XPath or CSS3 selectors. In this particular example, we will use CSS3 selectors:
(Beware: indentation is not properly adjusted in this view)
require 'nokogiri'
require 'open-uri'
url = 'http://www.nachrichtenleicht.de/uebersicht/nachrichten/'
image_url_prefix = 'http://www.nachrichtenleicht.de'
doc = Nokogiri::HTML(open(url))
doc.css(".entry-teaser").each do |item|
link = item.at_css('.image').at_css('a')
href = link[:href]
image_url = image_url_prefix + link.at_css('img')[:src]
title = item.at_css('h2.entry-title').text
short_description = item.at_css('p a').text.strip
end
3. Save extraction information in database
For this we create a new model class[2] News (title, image_url, short_description, external_url, source)
We also extend our code example from before to save our new news item to the database:
news = News.new
news.title = title
news.image_url = image_url
news.short_description = short_description
news.external_url = href
news.source = url
news.save!
[1] http://nokogiri.org/
[2] This model class uses the Ruby Gem for Active Record: http://rubygems.org/gems/activerecord