Skip to content Skip to sidebar Skip to footer

Python Beautiful Soup Can't Find Specific Table

I'm having issues with scraping basketball-reference.com. I'm trying to access the 'Team Per Game Stats' table but can't seem to target the correct div/table. I'm trying to capture

Solution 1:

As Jarett mentioned above, BeautifulSoup can't parse your tag. In this case it's because it's commented out in the source. While this is admittedly an amateurish approach, it works for your data.

table_src = html.text.split('<div class="overthrow table_container" 
id="div_team-stats-per_game">')[1].split('</table>')[0] + '</table>'table = BeautifulSoup(table_src, 'lxml')

Solution 2:

The tables are rendered after, so you'd need to use Selenium to let it render or as mentioned above. But that isn't necessary as most of the tables are within the comments. You could use BeautifulSoup to pull out the comments, then search through those for the table tags.

import requests
from bs4 import BeautifulSoup
from bs4 import Comment
import pandas as pd

#NBA season
year = 2019

url = 'https://www.basketball-reference.com/leagues/NBA_2019.html#all_team-stats-base'.format(year)
response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')

comments = soup.find_all(string=lambda text: isinstance(text, Comment))

tables = []
for each in comments:
    if'table'in each:
        try:
            tables.append(pd.read_html(each)[0])
        except:
            continue

This will return you a list of dataframes, so just pull out the table you want from wherever it is located by its index position:

Output:

print(tables[3])RkTeamGMPFG...STLBLKTOVPFPTS01.0MilwaukeeBucks*82197803555...6154861137  1608  968612.0GoldenStateWarriors*82198053612...6255251169  1757  965023.0NewOrleansPelicans82197553581...6104411215  1732  946634.0Philadelphia76ers*82198053407...6064321223  1745  944545.0LosAngelesClippers*82198303384...5613851193  1913  944256.0PortlandTrailBlazers*82198553470...5464131135  1669  940267.0OklahomaCityThunder*82198553497...7664251145  1839  938778.0TorontoRaptors*82198803460...6804371150  1724  938489.0SacramentoKings82197303541...6793631095  1751  9363910.0WashingtonWizards82199303456...6833791154  1701  93501011.0HoustonRockets*82198303218...7004051094  1803  93411112.0AtlantaHawks82198553392...6754191397  1932  92941213.0MinnesotaTimberwolves82198303413...6834111074  1664  92231314.0BostonCeltics*82197803451...7064351052  1670  92161415.0BrooklynNets*82199803301...5393391236  1763  92041516.0LosAngelesLakers82197803491...6184401284  1701  91651617.0UtahJazz*82197553314...6634831240  1728  91611718.0SanAntonioSpurs*82198053468...5013869921487  91561819.0CharlotteHornets82198303297...5914051001  1550  90811920.0DenverNuggets*82197303439...6343631102  1644  90752021.0DallasMavericks82197803182...5333511167  1650  89272122.0IndianaPacers*82197053390...7134041122  1594  88572223.0PhoenixSuns82198803289...7354181279  1932  88152324.0OrlandoMagic*82197803316...5434451082  1526  88002425.0DetroitPistons*82198553185...5693311135  1811  87782526.0MiamiHeat82197303251...6274481208  1712  86682627.0ChicagoBulls82199053266...6033511159  1663  86052728.0NewYorkKnicks82197803134...5574221151  1713  85752829.0ClevelandCavaliers82197553189...5341951106  1642  85672930.0MemphisGrizzlies82198803113...6844481147  1801  849030NaNLeagueAverage82198153369...6264061155  1714  9119

[31rowsx25columns]

Solution 3:

As other answers mentioned this is basically because the content of page is being loaded by help of JavaScript and getting source code with help of urlopener or request will not load that dynamic part.

So here I have a way around of it, actually you can make use of selenium to let the dynamic content load and then get the source code from there and find for the table. Here is the code that actually give the result you expected. But you will need to setup selenium web driver

from lxml import htmlfrom bs4 import  BeautifulSoup
fromtime import sleep
from selenium import webdriver


def parse(url):
    response = webdriver.Firefox()
    response.get(url)
    sleep(3)
    sourceCode=response.page_source
    return  sourceCode


year =2019
soup = BeautifulSoup(parse("https://www.basketball-reference.com/leagues/NBA_2019.html#all_team-stats-base".format(year)),'lxml')
x = soup.find("table", id="team-stats-per_game")
print(x)

Hope this helped you with your problem and feel free to ask any further doubts.

Happy Coding:)

Post a Comment for "Python Beautiful Soup Can't Find Specific Table"