Skip to content Skip to sidebar Skip to footer

How To Extract Certain Parts Of A Web Page In Python

Target web page: http://www.immi.gov.au/skilled/general-skilled-migration/estimated-allocation-times.htm The section I want to extract: Skilled –

Solution 1:

"Beau--ootiful Soo--oop!

Beau--ootiful Soo--oop!

Soo--oop of the e--e--evening,

Beautiful, beauti--FUL SOUP!"

--Lewis Carroll, Alice's Adventures in Wonderland

I think this is exactly what he had in mind!

The Mock Turtle would probably do something like this:

>>>from BeautifulSoup import BeautifulSoup>>>import urllib2>>>url = 'http://www.immi.gov.au/skilled/general-skilled-migration/estimated-allocation-times.htm'>>>page = urllib2.urlopen(url)>>>soup = BeautifulSoup(page)>>>for row in soup.html.body.findAll('tr'):...    data = row.findAll('td')...if data and'subclass 885online'in data[0].text:...print data[4].text... 
15 May 2011

But I'm not sure it would help, since that date has already passed!

Good luck with the application!

Solution 2:

You might want to use this as a starting point:

Python 2.6.7 (r267:88850, Jun 132011, 22:03:32) 
[GCC 4.6.120110608 (prerelease)] on linux2
Type"help", "copyright", "credits"or"license"for more information.
>>> import urllib2, re
>>> from BeautifulSoup import BeautifulSoup
>>> urllib2.urlopen('http://www.immi.gov.au/skilled/general-skilled-migration/estimated-allocation-times.htm')
<addinfourl at 139158380 whose fp = <socket._fileobject object at 0x84aa2ac>>
>>> html = _.read()
>>> soup = BeautifulSoup(html)
>>> soup.find(text = re.compile('\\bsubclass 885\\b')).parent.parent.find('td', text = re.compile(' [0-9]{4}$'))
u'15 May 2011'

Solution 3:

There is a library called Beautiful Soup which does the job you asked for. http://www.crummy.com/software/BeautifulSoup/

Post a Comment for "How To Extract Certain Parts Of A Web Page In Python"