Have Htmlparser Differentiate Between Link-text And Other Data?
Stuff I don't want
Using HTMLParser's handle_data doesn't diSolution 1:
Basically you have to write a handle_starttag()
method as well. Just save off every tag you see as self.lasttag
or something. Then, in your handle_data()
method, just check self.lasttag
and see if it's 'a'
(indicating that the last tag you saw was an HTML anchor tag and therefore you're in a link).
Something like this (untested) should work:
from HTMLParser import HTMLParser
classMyHTMLParser(HTMLParser):
lasttag = Nonedefhandle_starttag(self, tag, attr):
self.lasttag = tag.lower()
defhandle_data(self, data):
if self.lasttag == "a"and data.strip():
print data
In fact it's permissible in HTML to have other tags inside an <a...> ... </a>
container. And there can also be anchors that contain text but aren't links (no href=
attribute). These cases can both be handled if desired. Again, this code is untested:
from HTMLParser import HTMLParser
classMyHTMLParser(HTMLParser):
inlink = False
data = []
defhandle_starttag(self, tag, attr):
if tag.lower() == "a"and"href"in (k.lower() for k, v in attr):
self.inlink = True
self.data = []
defhandle_endtag(self, tag):
if tag.lower() == "a":
self.inlink = Falseprint"".join(self.data)
defhandle_data(self, data):
if self.inlink:
self.data.append(data)
HTMLParser is what you'd call a SAX-style parser, which notifies you of the tags going by but makes you keep track of the tag hierarchy yourself. You can see how complicated this can get just by the differences between the first and second versions here.
DOM-style parsers are easier to work with for these kinds of tasks because they read the whole document into memory and produce a tree that is easily navigated and searched. DOM-style parsers tend to use more memory and be slower than SAX-style parsers, but this is much less important now than it was ten years ago.
Post a Comment for "Have Htmlparser Differentiate Between Link-text And Other Data?"