Track content usage

Having some stuff on the internet means that other people (or more commonly: the computer programs of other people) copy that content. This might be bad for you if you are worried of the specific copyrights or loose revenue. This might be good if you want to distribute information. However, in both ways you might be interested in who is copying this information from you. Some sites might hotlink or embed your content in their websites (e.g. this might be the case when it comes to images). You can typically detect such things in the logs of your webserver or your web analytics tool. It get‘s more complicated when somebody simply copy & pastes the content from your site. For images, you can use watermarking but for plain text, you have no chance (at least no one that I can imagine), either for a visual or digital signature, since it‘s just ASCII. I‘m looking for a way or tool which crawls the internet for specific keyword combinations, such that i‘m enabled to find people who actually copy textutal content that was originally produced by myself.
1 answer

Utilize Google Alerts to track content thieves

Compared to the flagships Google Mail, Calendar or Search, Google Alerts is a really simple tool. Nevertheless it's extremely useful. The core usage of Google Alerts is pretty straight forward and implied by its name: You get an alert when a specific search term is found by the Google Web Crawler. Typically Google, they don't offer to much details on the technical aspects, reliability or internals of this service but anyhow: at least my practical experiments were rather successful.

Google Alerts is quite easy to set up: You go to http://www.google.at/alerts, enter a search term and the details about how you want to get notified and that's it. You will then receive a mail (or an RSS item) in the selected interval for your specific inquiry.

Google Alerts can easily be used to detect duplicated content on the web. You just enter some keyword combinations from your original content and let Google Alerts do the rest. Note that however, this will only work for content that is actually found by the Google Crawler, which typically excludes Intranets, password protected sites or file formats different then HTML, CHM, PDF, etc.