Table Web Scraping Issues With Python

February 10, 2023 Post a Comment

I am having issues scraping data from this website: https://fantasy.premierleague.com/player-list I am interested in getting access to the player's names and points from the differ

Solution 1:

You can use to get all the table data webdriver, pandas and BeautifulSoup.

from bs4 import BeautifulSoup
import requests
from selenium import webdriver
import pandas as pd
url = "https://fantasy.premierleague.com/player-list"

driver = webdriver.Firefox()
driver.get(url)

html = driver.page_source
soup = BeautifulSoup(html)
table = soup.find_all('table', {'class': 'Table-ziussd-1 fVnGhl'})

df = pd.read_html(str(table))

print(df)

Output will be:

[             Player            Team  Points  Cost
0           Alisson       Liverpool      99  £6.2
1           Ederson        Man City      89  £6.0
2              Kepa         Chelsea      72  £5.4
3        Schmeichel       Leicester     122  £5.4
4            de Gea         Man Utd     105  £5.3
5            Lloris           Spurs      56  £5.3
6         Henderson   Sheffield Utd     135  £5.3
7          Pickford         Everton      93  £5.2
8          Patrício          Wolves     122  £5.2
9          Dubravka       Newcastle     124  £5.1
10             Leno         Arsenal     114  £5.0
11           Guaita  Crystal Palace     122  £5.0
12             Pope         Burnley     129  £4.9
13           Foster         Watford     113  £4.9
14        Fabianski        West Ham      61  £4.9
15        Caballero         Chelsea       7  £4.8
16             Ryan        Brighton     105  £4.7
17            Bravo        Man City      11  £4.7
18            Grant         Man Utd       0  £4.7
19           Romero         Man Utd       0  £4.6
20             Krul         Norwich      94  £4.6
21         Mignolet       Liverpool       0  £4.5
22         McCarthy     Southampton      74  £4.5
23         Ramsdale     Bournemouth      97  £4.5
24         Fahrmann         Norwich       1  £4.4




and so on........................................]

Solution 2:

The table you want to scrape is generated using Javascript, which is not executed when you do html = urlopen(url) and thus not in the soup either.
There are many methods as how to get dynamically generated data. Check here for example.

Solution 3:

https://fantasy.premierleague.com/player-list uses Javascript to generate data to html. BeautifulSoup cannot scrape Javascript so we need to emulate real browser to load data. To do this you can use Selenium - In below code I user Firefox but you can use Chrome for example. Please check Selenium's documentation on how to get it running.

Script opens Firefox browser, pauses for 1 second ( to make sure that all Javascript data has loaded) and passes html to BeautifulSoup. You might need to pip install lxml parser for script to run.

Then we look for all div', {'class':'Layout__Main-eg6k6r-1 cSyfD' as those contain all 4 tables on the website. You may want to use Inspect Element tool in your browser to check names of tables, div's to target your search.

Then you can call any of 4 divs and search for tr in each.

from selenium import webdriver
import time
from bs4 import BeautifulSoup 

browser = webdriver.Firefox()
browser.set_window_size(700,900)

url = 'https://fantasy.premierleague.com/player-list'

browser.get(url)
time.sleep(1)

html = browser.execute_script('return document.documentElement.outerHTML')


all_html = BeautifulSoup(html,'lxml')
all_tables = all_html.find_all('div', {'class':'Layout__Main-eg6k6r-1 cSyfD'})
print('Found '+ str(len(all_tables)) + 'tables')

table1_goalkeepers = all_tables[0]
rows_goalkeeper = table1_goalkeepers.tbody
print('Goalkeepers: \n')
print(rows_goalkeeper)

table3_defenders = all_tables[1]
print('Defenders \n')
rows_defencders = table3_defenders.tbody
print(rows_defencders)


browser.quit()

Sample output:

Goalkeepers: 

<tbody><tr><td>Alisson</td><td>Liverpool</td><td>99</td><td>£6.2</td></tr><tr><td>Ederson</td><td>Man City</td><td>88</td><td>£6.0</td></tr><tr><td>Kepa</td><td>Chelsea</td><td>72</td><td>£5.4</td></tr><tr><td>Schmeichel</td><td>Leicester</td><td>122</td><td>£5.4</td></tr><tr><td>de Gea</td><td>Man Utd</td><td>105</td><td>£5.3</td></tr><tr><td>Lloris</td><td>Spurs</td><td>56</td><td>£5.3</td></tr><tr><td>Henderson</td><td>Sheffield Utd</td><td>135</td><td>£5.3</td></tr><tr><td>Pickford</td><td>Everton</td><td>93</td><td>£5.2</td></tr><tr><td>Patrício</td><td>Wolves</td><td>122</td><td>£5.2</td></tr><tr><td>Dubravka</td><td>Newcastle</td><td>124</td><td>£5.1</td></tr><tr><td>Leno</td><td>Arsenal</td><td>114</td><td>£5.0</td></tr><tr><td>Guaita</td><td>Crystal Palace</td><td>122</td><td>£5.0</td></tr><tr><td>Pope</td><td>Burnley</td><td>128</td><td>£4.9</td></tr><tr><td>Foster</td><td>Watford</td><td>113</td><td>£4.9</td></tr><tr><td>Fabianski</td><td>West Ham</td><td>61</td><td>£4.9</td></tr><tr><td>Caballero</td><td>Chelsea</td><td>7</td><td>£4.8</td></tr><tr><td>Ryan</td><td>Brighton</td><td>105</td><td>£4.7</td></tr><tr><td>Bravo</td><td>Man City</td><td>11</td><td>£4.7</td></tr><tr><td>Grant</td><td>Man Utd</td><td>0</td><td>£4.7</td></tr><tr><td>Romero</td><td>Man Utd</td><td>0</td><td>£4.6</td></tr><tr><td>Krul</td><td>Norwich</td><td>94</td><td>£4.6</td></tr><tr><td>Mignolet</td><td>Liverpool</td><td>0</td><td>£4.5</td></tr><tr><td>McCarthy</td><td>Southampton</td><td>74</td><td>£4.5</td></tr><tr><td>Ramsdale</td><td>Bournemouth</td><td>97</td><td>£4.5</td></tr><tr><td>Fahrmann</td><td>Norwich</td><td>1</td><td>£4.4</td></tr><tr><td>Roberto</td><td>West Ham</td><td>18</td><td>£4.4</td></tr><tr><td>Verrips</td><td>Sheffield Utd</td><td>0</td><td>£4.4</td></tr><tr><td>Kelleher</td><td>Liverpool</td><td>0</td><td>£4.4</td></tr><tr><td>Reina</td><td>Aston Villa</td><td>19</td><td>£4.4</td></tr><tr><td>Nyland</td><td>Aston Villa</td><td>11</td><td>£4.3</td></tr><tr><td>Heaton</td><td>Aston Villa</td><td>59</td><td>£4.3</td></tr><tr><td>Darlow</td><td>Newcastle</td><td>0</td><td>£4.3</td></tr><tr><td>Eastwood</td><td>Sheffield Utd</td><td>0</td><td>£4.3</td></tr><tr><td>Steer</td><td>Aston Villa</td><td>1</td><td>£4.3</td></tr><tr><td>Moore</td><td>Sheffield Utd</td><td>1</td><td>£4.3</td></tr><tr><td>Peacock-Farrell</td><td>Burnley</td><td>0</td><td>£4.3</td></tr></tbody>

Solution 4:

This page uses JavaScript to add data but BeautifulSoup can't run JavaScript.

You can use Selenium to control web browser which can run JavaScript

Or you can check in DevTools in Firefox/Chrome (tab: Network) what url is used by JavaScript to get data from server and use it with urllib to get these data.

I choose this method (manually searching in DevTools).

I found that JavaScript gets these data in JSON format from

https://fantasy.premierleague.com/api/bootstrap-static/

Because I get data in JSON so I can convert to Python list/dictionary using module json and I don't need BeautifulSoup.

It needs more manual work to recognize structure of data but it gives more data then table on page.

Here all data about first player on the list Alisson

 chance_of_playing_next_round = 100
 chance_of_playing_this_round = 100
 code = 116535
 cost_change_event = 0
 cost_change_event_fall = 0
 cost_change_start = 2
 cost_change_start_fall = -2
 dreamteam_count = 1
 element_type = 1
 ep_next = 11.0
 ep_this = 11.0
 event_points = 10
 first_name = Alisson
 form = 10.0
 id = 189
 in_dreamteam = False
 news = 
 news_added = 2020-03-06T14:00:17.901193Z
 now_cost = 62
 photo = 116535.jpg
 points_per_game = 4.7
 second_name = Ramses Becker
 selected_by_percent = 9.2
 special = False
 squad_number = None
 status = a
 team = 10
 team_code = 14
 total_points = 99
 transfers_in = 767780
 transfers_in_event = 9339
 transfers_out = 2033680
 transfers_out_event = 2757
 value_form = 1.6
 value_season = 16.0
 web_name = Alisson
 minutes = 1823
 goals_scored = 0
 assists = 1
 clean_sheets = 11
 goals_conceded = 12
 own_goals = 0
 penalties_saved = 0
 penalties_missed = 0
 yellow_cards = 0
 red_cards = 1
 saves = 48
 bonus = 9
 bps = 439
 influence = 406.2
 creativity = 10.0
 threat = 0.0
 ict_index = 41.7
 influence_rank = 135
 influence_rank_type = 18
 creativity_rank = 411
 creativity_rank_type = 8
 threat_rank = 630
 threat_rank_type = 71
 ict_index_rank = 294
 ict_index_rank_type = 18

There are also information about teams, etc.

Code:

from urllib.request import urlopen
import json

#url = 'https://fantasy.premierleague.com/player-list'
url = 'https://fantasy.premierleague.com/api/bootstrap-static/'

text = urlopen(url).read().decode()
data = json.loads(text)

print('\n--- element type ---\n')        

#print(data['element_types'][0])
for item in data['element_types']:
    print(item['id'], item['plural_name'])

print('\n--- Goalkeepers ---\n')        

number = 0
for item in data['elements']:
        
    if item['element_type'] == 1: # Goalkeepers
        number += 1
        print('---', number, '---')
        print('type        :', data['element_types'][item['element_type']-1]['plural_name'])
        print('first_name  :', item['first_name'])
        print('second_name :', item['second_name'])
        print('total_points:', item['total_points'])
        print('team        :', data['teams'][item['team']-1]['name'])
        print('cost        :', item['now_cost']/10)

        if item['first_name'] == 'Alisson':
            for key, value in item.items():
                print('    ', key, '=',value)

Result:

--- element type ---

1 Goalkeepers
2 Defenders
3 Midfielders
4 Forwards

--- Goalkeepers ---

--- 1 ---
type        : Goalkeepers
first_name  : Bernd
second_name : Leno
total_points: 114
team        : Arsenal
cost        : 5.0
--- 2 ---
type        : Goalkeepers
first_name  : Emiliano
second_name : Martínez
total_points: 1
team        : Arsenal
cost        : 4.2
--- 3 ---
type        : Goalkeepers
first_name  : Ørjan
second_name : Nyland
total_points: 11
team        : Aston Villa
cost        : 4.3
--- 4 ---
type        : Goalkeepers
first_name  : Tom
second_name : Heaton
total_points: 59
team        : Aston Villa
cost        : 4.3

Code gives data in different order then table but if you put it all in list or better in pandas DataFrame then you can sort it in different orders.

EDIT:

You can use pandas to get data from JSON

from urllib.request import urlopen
import json
import pandas as pd

#url = 'https://fantasy.premierleague.com/player-list'
url = 'https://fantasy.premierleague.com/api/bootstrap-static/'

# read data from url and convert to Python's list/dictionary
text = urlopen(url).read().decode()
data = json.loads(text)

# create DataFrames
players = pd.DataFrame.from_dict(data['elements'])
teams   = pd.DataFrame.from_dict(data['teams'])

# divide by 10 to get `6.2` instead of `62`
players['now_cost'] = players['now_cost'] / 10

# convert team's number to its name
players['team'] = players['team'].apply(lambda x: teams.iloc[x-1]['name'])

# filter players
goalkeepers = players[ players['element_type'] == 1 ]
defenders   = players[ players['element_type'] == 2 ]
# etc.

# some informations
print('\n--- goalkeepers columns ---\n')

print(goalkeepers.columns)

print('\n--- goalkeepers sorted by name ---\n')

sorted_data = goalkeepers.sort_values(['first_name'])

print(sorted_data[['first_name', 'team', 'now_cost']].head())

print('\n--- goalkeepers sorted by cost ---\n')

sorted_data = goalkeepers.sort_values(['now_cost'], ascending=False)

print(sorted_data[['first_name', 'team', 'now_cost']].head())

print('\n--- teams columns ---\n')

print(teams.columns)

print('\n--- teams ---\n')

print(teams['name'].head())

# etc.

Results

--- goalkeepers columns ---

Index(['chance_of_playing_next_round', 'chance_of_playing_this_round', 'code',
       'cost_change_event', 'cost_change_event_fall', 'cost_change_start',
       'cost_change_start_fall', 'dreamteam_count', 'element_type', 'ep_next',
       'ep_this', 'event_points', 'first_name', 'form', 'id', 'in_dreamteam',
       'news', 'news_added', 'now_cost', 'photo', 'points_per_game',
       'second_name', 'selected_by_percent', 'special', 'squad_number',
       'status', 'team', 'team_code', 'total_points', 'transfers_in',
       'transfers_in_event', 'transfers_out', 'transfers_out_event',
       'value_form', 'value_season', 'web_name', 'minutes', 'goals_scored',
       'assists', 'clean_sheets', 'goals_conceded', 'own_goals',
       'penalties_saved', 'penalties_missed', 'yellow_cards', 'red_cards',
       'saves', 'bonus', 'bps', 'influence', 'creativity', 'threat',
       'ict_index', 'influence_rank', 'influence_rank_type', 'creativity_rank',
       'creativity_rank_type', 'threat_rank', 'threat_rank_type',
       'ict_index_rank', 'ict_index_rank_type'],
      dtype='object')

--- goalkeepers sorted by name ---

    first_name         team  now_cost
94       Aaron  Bournemouth       4.5
305     Adrián    Liverpool       4.0
485       Alex  Southampton       4.5
533      Alfie        Spurs       4.0
291    Alisson    Liverpool       6.2

--- goalkeepers sorted by cost ---

    first_name       team  now_cost
291    Alisson  Liverpool       6.2
323    Ederson   Man City       6.0
263     Kasper  Leicester       5.4
169       Kepa    Chelsea       5.4
515       Hugo      Spurs       5.3

--- teams columns ---

Index(['code', 'draw', 'form', 'id', 'loss', 'name', 'played', 'points',
       'position', 'short_name', 'strength', 'team_division', 'unavailable',
       'win', 'strength_overall_home', 'strength_overall_away',
       'strength_attack_home', 'strength_attack_away', 'strength_defence_home',
       'strength_defence_away', 'pulse_id'],
      dtype='object')

--- teams ---

0        Arsenal
1    Aston Villa
2    Bournemouth
3       Brighton
4        Burnley
Name: name, dtype: object

Html5 Manual

Table Web Scraping Issues With Python

Solution 1:

Solution 2:

Solution 3:

Solution 4:

Post a Comment for "Table Web Scraping Issues With Python"