Content crawling - Amazon

What I need to think about, when I am programming an application, which will download the content from Amazon and store it in a database. I have a program, which has his own web crawler for downloading web content. The problem is that this program cannot recognize the important parto of every webpage and we need to somehow help with important part recognition. This issue is important to solve for Amazon webpage.
1 answer

Custom crawler for Amazon

First of all we need to know how the web crawler works in general:

For every web crawler we need to set the first page. It is the crawler's strat page and it is then crawling all other webpages through the links in this starting page and through the links in all visited pages. In general the web crawler download the whole website as a content and few metadata from the webpage header.

When we know how does it works in general, we can discuss now all problems and solutions for downloading information about books from Amazon.

1) Which links we need to visit?

There are three types of links, which are important for us:
- start page
- paths (links to webpages, which can send us to webpage about sone book)
- book pages (webpages with the content about some book)

For the start page we can easy use the
http://www.amazon.com

The paths are all pages, which are connected with the Amazon content.
http://www.amazon.com/*
This will let our web crawler crawling only webpages from the amazon.com domain, but it is still a lot of pages without any book content. In Amazon is problem with the URL paths that they are very similar to each other and you cannot use any much more clever heuristic to decrease this amount of visited webpages.

Web pages, which are visited just like a paths are not necessary to be downloaded. We can write an exception for this, because you can see that the webpages about books have URL like this:
http://www.amazon.com/*/dp/[0-9]?*qid=*

2) Which parts of the webpage are important for us?
Now we are downloading all the webpages about books, but we know that we do not need the whole webpage content and it will be nice to have more valuable metadata information about the book on this page.

We can take this content and we can extract all valuable information from that and make an metadata from them. Which pars are important for us and how we can find them:

  • body (content)
  • - for our body or content information about the book we can extract the part called DESCRIPTION, which is easy to find on every webpage with some book information

  • author (metadata)
  • you can find the author name of this book after the title of this webpage ending with this text "(Author)"

  • date (metadata)
  • Web crawler usualy use the date of download for the downloaded document, but it is not good for us. The date of downloading this content is not the release date of this book and it will make us problem to search in the book by date. We can find in the content of the webpage the release date in this format "Release Date: November 13, 2012". We can extract it change the format depending on the web crawler settings and we can change the actual date value for this one.

  • all other information
  • There is in the web page content part called "Product details" where you can very easy extract many other important information about this book and store it as a metadata in your database.

Taggings: