Skip to content Skip to sidebar Skip to footer

Have Htmlparser Differentiate Between Link-text And Other Data?

Say I have html code similar to this: Stuff I do want

Stuff I don't want

Using HTMLParser's handle_data doesn't di

Solution 1:

Basically you have to write a handle_starttag() method as well. Just save off every tag you see as self.lasttag or something. Then, in your handle_data() method, just check self.lasttag and see if it's 'a' (indicating that the last tag you saw was an HTML anchor tag and therefore you're in a link).

Something like this (untested) should work:

from HTMLParser import HTMLParser

classMyHTMLParser(HTMLParser):

    lasttag = Nonedefhandle_starttag(self, tag, attr):
        self.lasttag = tag.lower()

    defhandle_data(self, data):
        if self.lasttag == "a"and data.strip():
            print data

In fact it's permissible in HTML to have other tags inside an <a...> ... </a> container. And there can also be anchors that contain text but aren't links (no href= attribute). These cases can both be handled if desired. Again, this code is untested:

from HTMLParser import HTMLParser

classMyHTMLParser(HTMLParser):

    inlink = False
    data   = []

    defhandle_starttag(self, tag, attr):
        if tag.lower() == "a"and"href"in (k.lower() for k, v in attr):
           self.inlink = True
           self.data   = []

    defhandle_endtag(self, tag):
        if tag.lower() == "a":
            self.inlink = Falseprint"".join(self.data)

    defhandle_data(self, data):
        if self.inlink:
            self.data.append(data)

HTMLParser is what you'd call a SAX-style parser, which notifies you of the tags going by but makes you keep track of the tag hierarchy yourself. You can see how complicated this can get just by the differences between the first and second versions here.

DOM-style parsers are easier to work with for these kinds of tasks because they read the whole document into memory and produce a tree that is easily navigated and searched. DOM-style parsers tend to use more memory and be slower than SAX-style parsers, but this is much less important now than it was ten years ago.

Post a Comment for "Have Htmlparser Differentiate Between Link-text And Other Data?"