Skip to content Skip to sidebar Skip to footer

Extracting Title From Html Not Working

I'm performing some text analytics on a large number of novels downloaded from Gutenberg. I want to keep as much metadata as a I can, so I'm downloading as html then later converti

Solution 1:

You can use other BS4 methods, like this one:

title_data = soup.find('title').get_text()

Solution 2:

Try this One :

title_data = soup.find(".//title").text

or

title_data = soup.findtext('.//title')

Solution 3:

Try to use html.parser instead of lxml

e.g:

from bs4 import BeautifulSoup

### Opens html file
html = open("filepath/Jane_Eyre.htm")

### Cleans html file
soup = BeautifulSoup(html, 'html.parser')

title_data = soup.title.string

Your html tag has a namespace, so if you try to parse it with lxml you should respect the namespaces.

Solution 4:

Why not simply use lxml?

from lxml importhtmlpage= html.fromstring(source_string)
title = page.xpath("/title/text()")[0]

Solution 5:

The following approach works to extract the titles from html file of Gutenberg ebooks.

>>>from urllib.request import Request, urlopen>>>from bs4 import BeautifulSoup>>>url = 'http://www.gutenberg.org/ebooks/subject/99'>>>req = Request(url,headers={'User-Agent': 'Mozilla/5.0'})>>>webpage = urlopen(req).read()>>>soup = BeautifulSoup(webpage, "html.parser")>>>required = soup.find_all("span", {"class": "title"})>>>x1 = []>>>for i in required:...    x1.append(i.get_text())...>>>for i in x1:...print(i)...
Sort Alphabetically
Sort by Release Date
Great Expectations
Jane Eyre: An Autobiography
Les Misérables
Oliver Twist
Anne of Green Gables
David Copperfield
The Secret Garden
Anne of the Island
Anne of Avonlea
A Little Princess
Kim
Anne's House of Dreams
Heidi
The Mysteries of Udolpho
Of Human Bondage
The Secret Garden
Daddy-Long-Legs
Les misérables Tome I: Fantine (French)
Jane Eyre
Rose in Bloom
Further Chronicles of Avonlea
The Children of the New Forest
Oliver Twist; or, The Parish Boy's Progress. Illustrated
The Personal History of David Copperfield
Heidi
>>>

Post a Comment for "Extracting Title From Html Not Working"