Table Web Scraping Issues With Python
Solution 1:
You can use to get all the table data webdriver
, pandas
and BeautifulSoup
.
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
import pandas as pd
url = "https://fantasy.premierleague.com/player-list"
driver = webdriver.Firefox()
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html)
table = soup.find_all('table', {'class': 'Table-ziussd-1 fVnGhl'})
df = pd.read_html(str(table))
print(df)
Output will be:
[ Player Team Points Cost
0 Alisson Liverpool 99 £6.2
1 Ederson Man City 89 £6.0
2 Kepa Chelsea 72 £5.4
3 Schmeichel Leicester 122 £5.4
4 de Gea Man Utd 105 £5.3
5 Lloris Spurs 56 £5.3
6 Henderson Sheffield Utd 135 £5.3
7 Pickford Everton 93 £5.2
8 Patrício Wolves 122 £5.2
9 Dubravka Newcastle 124 £5.1
10 Leno Arsenal 114 £5.0
11 Guaita Crystal Palace 122 £5.0
12 Pope Burnley 129 £4.9
13 Foster Watford 113 £4.9
14 Fabianski West Ham 61 £4.9
15 Caballero Chelsea 7 £4.8
16 Ryan Brighton 105 £4.7
17 Bravo Man City 11 £4.7
18 Grant Man Utd 0 £4.7
19 Romero Man Utd 0 £4.6
20 Krul Norwich 94 £4.6
21 Mignolet Liverpool 0 £4.5
22 McCarthy Southampton 74 £4.5
23 Ramsdale Bournemouth 97 £4.5
24 Fahrmann Norwich 1 £4.4
and so on........................................]
Solution 2:
The table you want to scrape is generated using Javascript, which is not executed when you do html = urlopen(url)
and thus not in the soup either.
There are many methods as how to get dynamically generated data. Check here for example.
Solution 3:
https://fantasy.premierleague.com/player-list uses Javascript to generate data to html. BeautifulSoup cannot scrape Javascript so we need to emulate real browser to load data. To do this you can use Selenium - In below code I user Firefox but you can use Chrome for example. Please check Selenium's documentation on how to get it running.
Script opens Firefox browser, pauses for 1 second ( to make sure that all Javascript data has loaded) and passes html to BeautifulSoup. You might need to pip install lxml
parser for script to run.
Then we look for all div', {'class':'Layout__Main-eg6k6r-1 cSyfD'
as those contain all 4 tables on the website. You may want to use Inspect Element
tool in your browser to check names of tables, div's to target your search.
Then you can call any of 4 divs and search for tr
in each.
from selenium import webdriver
import time
from bs4 import BeautifulSoup
browser = webdriver.Firefox()
browser.set_window_size(700,900)
url = 'https://fantasy.premierleague.com/player-list'
browser.get(url)
time.sleep(1)
html = browser.execute_script('return document.documentElement.outerHTML')
all_html = BeautifulSoup(html,'lxml')
all_tables = all_html.find_all('div', {'class':'Layout__Main-eg6k6r-1 cSyfD'})
print('Found '+ str(len(all_tables)) + 'tables')
table1_goalkeepers = all_tables[0]
rows_goalkeeper = table1_goalkeepers.tbody
print('Goalkeepers: \n')
print(rows_goalkeeper)
table3_defenders = all_tables[1]
print('Defenders \n')
rows_defencders = table3_defenders.tbody
print(rows_defencders)
browser.quit()
Sample output:
Goalkeepers:
<tbody><tr><td>Alisson</td><td>Liverpool</td><td>99</td><td>£6.2</td></tr><tr><td>Ederson</td><td>Man City</td><td>88</td><td>£6.0</td></tr><tr><td>Kepa</td><td>Chelsea</td><td>72</td><td>£5.4</td></tr><tr><td>Schmeichel</td><td>Leicester</td><td>122</td><td>£5.4</td></tr><tr><td>de Gea</td><td>Man Utd</td><td>105</td><td>£5.3</td></tr><tr><td>Lloris</td><td>Spurs</td><td>56</td><td>£5.3</td></tr><tr><td>Henderson</td><td>Sheffield Utd</td><td>135</td><td>£5.3</td></tr><tr><td>Pickford</td><td>Everton</td><td>93</td><td>£5.2</td></tr><tr><td>Patrício</td><td>Wolves</td><td>122</td><td>£5.2</td></tr><tr><td>Dubravka</td><td>Newcastle</td><td>124</td><td>£5.1</td></tr><tr><td>Leno</td><td>Arsenal</td><td>114</td><td>£5.0</td></tr><tr><td>Guaita</td><td>Crystal Palace</td><td>122</td><td>£5.0</td></tr><tr><td>Pope</td><td>Burnley</td><td>128</td><td>£4.9</td></tr><tr><td>Foster</td><td>Watford</td><td>113</td><td>£4.9</td></tr><tr><td>Fabianski</td><td>West Ham</td><td>61</td><td>£4.9</td></tr><tr><td>Caballero</td><td>Chelsea</td><td>7</td><td>£4.8</td></tr><tr><td>Ryan</td><td>Brighton</td><td>105</td><td>£4.7</td></tr><tr><td>Bravo</td><td>Man City</td><td>11</td><td>£4.7</td></tr><tr><td>Grant</td><td>Man Utd</td><td>0</td><td>£4.7</td></tr><tr><td>Romero</td><td>Man Utd</td><td>0</td><td>£4.6</td></tr><tr><td>Krul</td><td>Norwich</td><td>94</td><td>£4.6</td></tr><tr><td>Mignolet</td><td>Liverpool</td><td>0</td><td>£4.5</td></tr><tr><td>McCarthy</td><td>Southampton</td><td>74</td><td>£4.5</td></tr><tr><td>Ramsdale</td><td>Bournemouth</td><td>97</td><td>£4.5</td></tr><tr><td>Fahrmann</td><td>Norwich</td><td>1</td><td>£4.4</td></tr><tr><td>Roberto</td><td>West Ham</td><td>18</td><td>£4.4</td></tr><tr><td>Verrips</td><td>Sheffield Utd</td><td>0</td><td>£4.4</td></tr><tr><td>Kelleher</td><td>Liverpool</td><td>0</td><td>£4.4</td></tr><tr><td>Reina</td><td>Aston Villa</td><td>19</td><td>£4.4</td></tr><tr><td>Nyland</td><td>Aston Villa</td><td>11</td><td>£4.3</td></tr><tr><td>Heaton</td><td>Aston Villa</td><td>59</td><td>£4.3</td></tr><tr><td>Darlow</td><td>Newcastle</td><td>0</td><td>£4.3</td></tr><tr><td>Eastwood</td><td>Sheffield Utd</td><td>0</td><td>£4.3</td></tr><tr><td>Steer</td><td>Aston Villa</td><td>1</td><td>£4.3</td></tr><tr><td>Moore</td><td>Sheffield Utd</td><td>1</td><td>£4.3</td></tr><tr><td>Peacock-Farrell</td><td>Burnley</td><td>0</td><td>£4.3</td></tr></tbody>
Solution 4:
This page uses JavaScript
to add data but BeautifulSoup
can't run JavaScript
.
You can use Selenium to control web browser which can run JavaScript
Or you can check in DevTools
in Firefox
/Chrome
(tab: Network
) what url is used by JavaScript
to get data from server and use it with urllib
to get these data.
I choose this method (manually searching in DevTools
).
I found that JavaScript
gets these data in JSON
format from
https://fantasy.premierleague.com/api/bootstrap-static/
Because I get data in JSON
so I can convert to Python list/dictionary using module json
and I don't need BeautifulSoup
.
It needs more manual work to recognize structure of data but it gives more data then table on page.
Here all data about first player on the list Alisson
chance_of_playing_next_round = 100
chance_of_playing_this_round = 100
code = 116535
cost_change_event = 0
cost_change_event_fall = 0
cost_change_start = 2
cost_change_start_fall = -2
dreamteam_count = 1
element_type = 1
ep_next = 11.0
ep_this = 11.0
event_points = 10
first_name = Alisson
form = 10.0
id = 189
in_dreamteam = False
news =
news_added = 2020-03-06T14:00:17.901193Z
now_cost = 62
photo = 116535.jpg
points_per_game = 4.7
second_name = Ramses Becker
selected_by_percent = 9.2
special = False
squad_number = None
status = a
team = 10
team_code = 14
total_points = 99
transfers_in = 767780
transfers_in_event = 9339
transfers_out = 2033680
transfers_out_event = 2757
value_form = 1.6
value_season = 16.0
web_name = Alisson
minutes = 1823
goals_scored = 0
assists = 1
clean_sheets = 11
goals_conceded = 12
own_goals = 0
penalties_saved = 0
penalties_missed = 0
yellow_cards = 0
red_cards = 1
saves = 48
bonus = 9
bps = 439
influence = 406.2
creativity = 10.0
threat = 0.0
ict_index = 41.7
influence_rank = 135
influence_rank_type = 18
creativity_rank = 411
creativity_rank_type = 8
threat_rank = 630
threat_rank_type = 71
ict_index_rank = 294
ict_index_rank_type = 18
There are also information about teams, etc.
Code:
from urllib.request import urlopen
import json
#url = 'https://fantasy.premierleague.com/player-list'
url = 'https://fantasy.premierleague.com/api/bootstrap-static/'
text = urlopen(url).read().decode()
data = json.loads(text)
print('\n--- element type ---\n')
#print(data['element_types'][0])
for item in data['element_types']:
print(item['id'], item['plural_name'])
print('\n--- Goalkeepers ---\n')
number = 0
for item in data['elements']:
if item['element_type'] == 1: # Goalkeepers
number += 1
print('---', number, '---')
print('type :', data['element_types'][item['element_type']-1]['plural_name'])
print('first_name :', item['first_name'])
print('second_name :', item['second_name'])
print('total_points:', item['total_points'])
print('team :', data['teams'][item['team']-1]['name'])
print('cost :', item['now_cost']/10)
if item['first_name'] == 'Alisson':
for key, value in item.items():
print(' ', key, '=',value)
Result:
--- element type ---
1 Goalkeepers
2 Defenders
3 Midfielders
4 Forwards
--- Goalkeepers ---
--- 1 ---
type : Goalkeepers
first_name : Bernd
second_name : Leno
total_points: 114
team : Arsenal
cost : 5.0
--- 2 ---
type : Goalkeepers
first_name : Emiliano
second_name : Martínez
total_points: 1
team : Arsenal
cost : 4.2
--- 3 ---
type : Goalkeepers
first_name : Ørjan
second_name : Nyland
total_points: 11
team : Aston Villa
cost : 4.3
--- 4 ---
type : Goalkeepers
first_name : Tom
second_name : Heaton
total_points: 59
team : Aston Villa
cost : 4.3
Code gives data in different order then table but if you put it all in list or better in pandas DataFrame then you can sort it in different orders.
EDIT:
You can use pandas
to get data from JSON
from urllib.request import urlopen
import json
import pandas as pd
#url = 'https://fantasy.premierleague.com/player-list'
url = 'https://fantasy.premierleague.com/api/bootstrap-static/'
# read data from url and convert to Python's list/dictionary
text = urlopen(url).read().decode()
data = json.loads(text)
# create DataFrames
players = pd.DataFrame.from_dict(data['elements'])
teams = pd.DataFrame.from_dict(data['teams'])
# divide by 10 to get `6.2` instead of `62`
players['now_cost'] = players['now_cost'] / 10
# convert team's number to its name
players['team'] = players['team'].apply(lambda x: teams.iloc[x-1]['name'])
# filter players
goalkeepers = players[ players['element_type'] == 1 ]
defenders = players[ players['element_type'] == 2 ]
# etc.
# some informations
print('\n--- goalkeepers columns ---\n')
print(goalkeepers.columns)
print('\n--- goalkeepers sorted by name ---\n')
sorted_data = goalkeepers.sort_values(['first_name'])
print(sorted_data[['first_name', 'team', 'now_cost']].head())
print('\n--- goalkeepers sorted by cost ---\n')
sorted_data = goalkeepers.sort_values(['now_cost'], ascending=False)
print(sorted_data[['first_name', 'team', 'now_cost']].head())
print('\n--- teams columns ---\n')
print(teams.columns)
print('\n--- teams ---\n')
print(teams['name'].head())
# etc.
Results
--- goalkeepers columns ---
Index(['chance_of_playing_next_round', 'chance_of_playing_this_round', 'code',
'cost_change_event', 'cost_change_event_fall', 'cost_change_start',
'cost_change_start_fall', 'dreamteam_count', 'element_type', 'ep_next',
'ep_this', 'event_points', 'first_name', 'form', 'id', 'in_dreamteam',
'news', 'news_added', 'now_cost', 'photo', 'points_per_game',
'second_name', 'selected_by_percent', 'special', 'squad_number',
'status', 'team', 'team_code', 'total_points', 'transfers_in',
'transfers_in_event', 'transfers_out', 'transfers_out_event',
'value_form', 'value_season', 'web_name', 'minutes', 'goals_scored',
'assists', 'clean_sheets', 'goals_conceded', 'own_goals',
'penalties_saved', 'penalties_missed', 'yellow_cards', 'red_cards',
'saves', 'bonus', 'bps', 'influence', 'creativity', 'threat',
'ict_index', 'influence_rank', 'influence_rank_type', 'creativity_rank',
'creativity_rank_type', 'threat_rank', 'threat_rank_type',
'ict_index_rank', 'ict_index_rank_type'],
dtype='object')
--- goalkeepers sorted by name ---
first_name team now_cost
94 Aaron Bournemouth 4.5
305 Adrián Liverpool 4.0
485 Alex Southampton 4.5
533 Alfie Spurs 4.0
291 Alisson Liverpool 6.2
--- goalkeepers sorted by cost ---
first_name team now_cost
291 Alisson Liverpool 6.2
323 Ederson Man City 6.0
263 Kasper Leicester 5.4
169 Kepa Chelsea 5.4
515 Hugo Spurs 5.3
--- teams columns ---
Index(['code', 'draw', 'form', 'id', 'loss', 'name', 'played', 'points',
'position', 'short_name', 'strength', 'team_division', 'unavailable',
'win', 'strength_overall_home', 'strength_overall_away',
'strength_attack_home', 'strength_attack_away', 'strength_defence_home',
'strength_defence_away', 'pulse_id'],
dtype='object')
--- teams ---
0 Arsenal
1 Aston Villa
2 Bournemouth
3 Brighton
4 Burnley
Name: name, dtype: object
Post a Comment for "Table Web Scraping Issues With Python"