Quick & Dirty website parsing

In many situation one might want to collect information from a website without opening a browser and go to a specific website and lookup the desired piece of information. A other problem/challenge might be that a program needs some information from a website. The challenge here is that sometimes the content-provider might change the layout and thus a simple parser won't work. However i'd assume that the content-provider is not making changes to tags (eg div) where the location is stored.


1 answer

Python lxml

There is a simple library lxml in python. With this a sample script for looking up information in a website could look like this:

from urllib import urlopen, urlencode
from httplib import HTTPConnection
from lxml import etree

sock= urlopen(url)
html= sock.read()

f= open(fn,"w")

htmlParser= etree.HTMLParser()
tree= etree.parse(fn, htmlParser)
info= tree.xpath(" path ")