Skip to content Skip to sidebar Skip to footer

Extract Tags From A Html File Using Jsoup

I am doing a structural analysis on web documents. For this i need to extract only the structure of a web document(only the tags). I found a html parser for java called Jsoup. But

Solution 1:

Sound like a depth-first traversal:

publicclassJsoupDepthFirst {

    privatestaticStringhtmlTags(Document doc) {
        StringBuilder sb = newStringBuilder();
        htmlTags(doc.children(), sb);
        return sb.toString();
    }

    privatestaticvoidhtmlTags(Elements elements, StringBuilder sb) {
        for(Elementel:elements) {
            if(sb.length() > 0){
                sb.append(",");
            }
            sb.append(el.nodeName());
            htmlTags(el.children(), sb);
            sb.append(",").append(el.nodeName());
        }
    }

    publicstaticvoidmain(String... args){
        String s = "<html><head>this is head </head><body>this is body</body></html>";
        Document doc = Jsoup.parse(s);
        System.out.println(htmlTags(doc));
    }
}

another solution is to use jsoup NodeVisitor as follows:

   SecondSolution ss = new SecondSolution();
   doc.traverse(ss);
   System.out.println(ss.sb.toString());

class:

publicstaticclassSecondSolutionimplementsNodeVisitor {

        StringBuildersb=newStringBuilder();

        @Overridepublicvoidhead(Node node, int depth) {
            if (node instanceof Element && !(node instanceof Document)) {
                if (sb.length() > 0) {
                    sb.append(",");
                }
                sb.append(node.nodeName());
            }
        }

        @Overridepublicvoidtail(Node node, int depth) {
            if (node instanceof Element && !(node instanceof Document)) {
                sb.append(",").append(node.nodeName());
            }
        }
    }

Post a Comment for "Extract Tags From A Html File Using Jsoup"