crawler

Custom crawler for Amazon

First of all we need to know how the web crawler works in general:

For every web crawler we need to set the first page. It is the crawler's strat page and it is then crawling all other webpages through the links in this starting page and through the links in all visited pages. In general the web crawler download the whole website as a content and few metadata from the webpage header.

When we know how does it works in general, we can discuss now all problems and solutions for downloading information about books from Amazon.

1) Which links we need to visit?

There are three types of links, which are important for us:
- start page
- paths (links to webpages, which can send us to webpage about sone book)
- book pages (webpages with the content about some book)

For the start page we can easy use the
http://www.amazon.com

The paths are all pages, which are connected with the Amazon content.
http://www.amazon.com/*
This will let our web crawler crawling only webpages from the amazon.com domain, but it is still a lot of pages without any book content. In Amazon is problem with the URL paths that they are very similar to each other and you cannot use any much more clever heuristic to decrease this amount of visited webpages.

Web pages, which are visited just like a paths are not necessary to be downloaded. We can write an exception for this, because you can see that the webpages about books have URL like this:
http://www.amazon.com/*/dp/[0-9]?*qid=*

2) Which parts of the webpage are important for us?
Now we are downloading all the webpages about books, but we know that we do not need the whole webpage content and it will be nice to have more valuable metadata information about the book on this page.

We can take this content and we can extract all valuable information from that and make an metadata from them. Which pars are important for us and how we can find them:

  • body (content)
  • - for our body or content information about the book we can extract the part called DESCRIPTION, which is easy to find on every webpage with some book information

  • author (metadata)
  • you can find the author name of this book after the title of this webpage ending with this text "(Author)"

  • date (metadata)
  • Web crawler usualy use the date of download for the downloaded document, but it is not good for us. The date of downloading this content is not the release date of this book and it will make us problem to search in the book by date. We can find in the content of the webpage the release date in this format "Release Date: November 13, 2012". We can extract it change the format depending on the web crawler settings and we can change the actual date value for this one.

  • all other information
  • There is in the web page content part called "Product details" where you can very easy extract many other important information about this book and store it as a metadata in your database.

Taggings:

Content crawling - Amazon

What I need to think about, when I am programming an application, which will download the content from Amazon and store it in a database. I have a program, which has his own web crawler for downloading web content. The problem is that this program cannot recognize the important parto of every webpage and we need to somehow help with important part recognition. This issue is important to solve for Amazon webpage.

Utilize Google Alerts to track content thieves

Compared to the flagships Google Mail, Calendar or Search, Google Alerts is a really simple tool. Nevertheless it's extremely useful. The core usage of Google Alerts is pretty straight forward and implied by its name: You get an alert when a specific search term is found by the Google Web Crawler. Typically Google, they don't offer to much details on the technical aspects, reliability or internals of this service but anyhow: at least my practical experiments were rather successful.

Google Alerts is quite easy to set up: You go to http://www.google.at/alerts, enter a search term and the details about how you want to get notified and that's it. You will then receive a mail (or an RSS item) in the selected interval for your specific inquiry.

Google Alerts can easily be used to detect duplicated content on the web. You just enter some keyword combinations from your original content and let Google Alerts do the rest. Note that however, this will only work for content that is actually found by the Google Crawler, which typically excludes Intranets, password protected sites or file formats different then HTML, CHM, PDF, etc.

Creating database for text classfication

Internet seams the best choice because we are interested in choosing different types of data. The restriction for my test is that I want that the positive data contains only onelines(short jokes). The negative data in order to have a good classification has to have the same structure(short sentences).
Subscribe to crawler