[FBref] Handle missing Scores & Fixtures page on the Big 5 European Leagues Stats page #284

lorenzodb1 · 2023-07-07T08:15:12Z

For every n rows, the website adds a row in a table that replicates the table header. This caused read_schedule to fail as the number of rows in df_table would be higher than the one of the list of match URLs obtained (see
[FBref] Non-data rows in the table body should be removed #277). I added the logic to remove those replicated headers when found.
The website has no specific Scores & Fixtures on the Big 5 European Leagues Stats page. Thus it'd go to the generic Scores & Fixtures page, which shows games currently being played. Because of this, I had to move the optimisation that combines the top five leagues under that label in read_leagues, as read_schedule necessarily needs the five top leagues separately rather than in their combined form.

… in read_leagues

lorenzodb1 · 2023-07-11T05:43:43Z

~~Tests seem to be failing due to the issue fixed in #286~~

EDIT: found the bug causing the failures. Pushing a fix in a bit.

…ined" is not present

lorenzodb1 · 2023-07-18T07:04:14Z

@probberechts is there a way we can get this and #286 merged and released any time soon? I'd really appreciate that as those bugs are blocking me in my personal project :)

probberechts · 2023-07-18T07:39:26Z

Hi @lorenzodb1

I'll have to take a closer look at these and I'll try to make some time during the weekend. What's blocking me from merging them immediately is:

[FBref] Handle missing Scores & Fixtures page on the Big 5 European Leagues Stats page #284 complicates the API of the read_leagues and read_seasons methods. I would like to check whether there isn't another way to implement this that does not require adding these vague parameters to public methods.
The example given in Changed logic in _concat to fix error thrown on forfeited matches #286 refers to an unsupported league. If it is something specific to that league, I might not even fix it.
I am worried that the changes made in Changed logic in _concat to fix error thrown on forfeited matches #286 would break other things. It might not throw an error anymore, but it looks like it will no longer be able to merge columns anymore in some cases.

lorenzodb1 · 2023-07-18T16:17:42Z

Hi @probberechts

At the moment, read_leagues is only called by read_seasons, so having to pass a list of leagues shouldn't be an issue per se. The parameter optimise in read_seasons defaults as True, so you can call read_seasons just as you always do and obtain the same behaviour. Also, fwiw, I don't think read_seasons and read_leagues should be public methods anyway, considering they're not even described in the API documentation.
The same issue fixed in Changed logic in _concat to fix error thrown on forfeited matches #286 arises for Hellas Verona-Roma. I don't know if Serie A is one of the supported leagues (I suppose it is), but that issue affects any game awarded with a default 3-0 score, regardless of the competition.
I'm happy to discuss any of these cases and adjust the PR accordingly.

probberechts · 2023-07-23T20:50:59Z

For dealing with the missing Scores & Fixtures on the Big 5 European Leagues Stats page, I would be ok with two alternative approaches:

raise an exception when read_schedule is called for the 'Big 5 European Leagues Combined', suggesting creating a new instance of the FBref scraper with leagues = ['ENG-Premier League', 'FRA-Ligue 1', ...]
follow your approach of modifying the read_leagues method, but I would add a more informative "split_up_big5 = False" parameter to both read_leagues and read_seasons

lorenzodb1 · 2023-07-24T02:58:50Z

follow your approach of modifying the read_leagues method, but I would add a more informative "split_up_big5 = False" parameter to both read_leagues and read_seasons

This was my first approach to fixing that issue. I changed it to the current implementation because it would mean adding the same logic of splitting the Big5 leagues in both methods. Since read_leagues is only called by read_seasons, it could make sense to change the signature to just passing the leagues that need to be considered. If you have strong opinions against that, though, I'll be more than happy to change the approach to what you described.

Out of curiosity, what's the reason behind keeping both read_leagues and read_seasons methods public?

probberechts · 2023-07-24T09:00:02Z

Unless I am missing something, the only thing that would have to be changed in read_seasons are L162 to df_leagues = self.read_leagues(split_up_big_5) and drop the self.leagues filter in L196.

Out of curiosity, what's the reason behind keeping both read_leagues and read_seasons methods public?

These contain relevant data as well, such as the season's champion and top scorer. Additionaly, making them private would be a breaking change.

lorenzodb1 · 2023-07-24T17:09:41Z

Unless I am missing something, the only thing that would have to be changed in read_seasons are L162 to df_leagues = self.read_leagues(split_up_big_5) and drop the self.leagues filter in L196.

If the filter in L196 can be dropped then that'd definitely be a better solution.

These contain relevant data as well, such as the season's champion and top scorer. Additionaly, making them private would be a breaking change.

I see. Maybe it'd be worth adding them to the API docs then.

probberechts · 2023-07-25T12:05:32Z

I see. Maybe it'd be worth adding them to the API docs then.

Yes, that makes sense. I've added them in #307.

lorenzodb1 · 2023-07-27T04:18:49Z

@probberechts I applied your suggestion to these changes. I couldn't drop the self.leagues filter in L196 as it'd make all tests fail, but I changed it so I could still stop using self.leagues in it.

soccerdata/fbref.py

codecov-commenter · 2023-07-27T17:08:01Z

Codecov Report

Patch coverage: 92.85% and project coverage change: +0.12% 🎉

Comparison is base (d5043e6) 61.28% compared to head (2ef428c) 61.40%.
Report is 2 commits behind head on master.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #284      +/-   ##
==========================================
+ Coverage   61.28%   61.40%   +0.12%     
==========================================
  Files          10       10              
  Lines        1431     1433       +2     
  Branches      290      292       +2     
==========================================
+ Hits          877      880       +3     
  Misses        508      508              
+ Partials       46       45       -1

Files Changed	Coverage Δ
soccerdata/fbref.py	`86.22% <92.85%> (+0.28%)`	⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

probberechts · 2023-07-27T18:35:21Z

The current implementation would be a breaking change and it still does not work as I would expect. Now it would scrape the data from the "Big 5 European Leagues Combined" page by default. The code below would combine the top-5 leagues, but it should return 5 leagues:

fbref = sd.FBref(["ENG-Premier League", "FRA-Ligue 1", "ITA-Serie A", "ESP-La Liga", "GER-Bundesliga"], "2223")
fbref.read_leagues()

lorenzodb1 · 2023-07-27T19:49:34Z

@probberechts I ended up changing it to split_up_big5: bool = False like you recommended. Hopefully, that will be enough to merge and release this fix. Let me know if you have any other issues with these changes.

probberechts · 2023-07-28T09:04:40Z

Looks good now. Thanks for all your work on improving the FBref scraper!

lorenzodb1 added 2 commits July 7, 2023 01:10

Fixed issue in read_schedule by moving the Top 5 Leagues optimisation…

a41adba

… in read_leagues

Fixed issue in read_schedule by moving the Top 5 Leagues optimisation…

99c0dcb

… in read_leagues

This was referenced Jul 7, 2023

Fixes for issues affecting the FBref scraper #281

Closed

[FBref] Non-data rows in the table body should be removed #277

Closed

lorenzodb1 added 3 commits July 10, 2023 16:06

Merge branch 'master' into lorenzodb1-read-schedule-fix

af93780

Fixed bug affecting read_leagues when not optimised

88ed9a9

Changed logic to extend fix to read_schedule

dc392b1

lorenzodb1 mentioned this pull request Jul 13, 2023

[FBref] Summary stats shows error when concatenating different matches #292

Closed

Fixed bug affecting list of leagues when "Big 5 European Leagues Comb…

e9c22af

…ined" is not present

probberechts changed the title ~~Fixed issues in read_schedule related to scraping long tables and scores~~ [FBref] Handle missing Scores & Fixtures page on the Big 5 European Leagues Stats page Jul 23, 2023

probberechts added bug Something isn't working FBref Issue or pull request related to the FBref scraper labels Jul 23, 2023

lorenzodb1 added 3 commits July 26, 2023 21:06

Changed parameter to bool as per suggestion

7cd9ae5

Merge branch 'master' into lorenzodb1-read-schedule-fix

70a61fb

Empty commit

e7f3dfe

probberechts requested changes Jul 27, 2023

View reviewed changes

soccerdata/fbref.py Outdated Show resolved Hide resolved

soccerdata/fbref.py Outdated Show resolved Hide resolved

Added logic to cover missed case when optimising the big 5 leagues

d659a27

lorenzodb1 requested a review from probberechts July 27, 2023 16:48

Changed optimise_big5 to split_up_big5 and default behaviour

2ef428c

probberechts approved these changes Jul 28, 2023

View reviewed changes

probberechts merged commit 11ac472 into probberechts:master Jul 28, 2023
10 checks passed

lorenzodb1 deleted the lorenzodb1-read-schedule-fix branch July 28, 2023 15:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FBref] Handle missing Scores & Fixtures page on the Big 5 European Leagues Stats page #284

[FBref] Handle missing Scores & Fixtures page on the Big 5 European Leagues Stats page #284

lorenzodb1 commented Jul 7, 2023

lorenzodb1 commented Jul 11, 2023 •

edited

Loading

lorenzodb1 commented Jul 18, 2023

probberechts commented Jul 18, 2023

lorenzodb1 commented Jul 18, 2023

probberechts commented Jul 23, 2023

lorenzodb1 commented Jul 24, 2023

probberechts commented Jul 24, 2023

lorenzodb1 commented Jul 24, 2023 •

edited

Loading

probberechts commented Jul 25, 2023

lorenzodb1 commented Jul 27, 2023

codecov-commenter commented Jul 27, 2023 •

edited

Loading

probberechts commented Jul 27, 2023

lorenzodb1 commented Jul 27, 2023 •

edited

Loading

probberechts commented Jul 28, 2023

[FBref] Handle missing Scores & Fixtures page on the Big 5 European Leagues Stats page #284

[FBref] Handle missing Scores & Fixtures page on the Big 5 European Leagues Stats page #284

Conversation

lorenzodb1 commented Jul 7, 2023

lorenzodb1 commented Jul 11, 2023 • edited Loading

lorenzodb1 commented Jul 18, 2023

probberechts commented Jul 18, 2023

lorenzodb1 commented Jul 18, 2023

probberechts commented Jul 23, 2023

lorenzodb1 commented Jul 24, 2023

probberechts commented Jul 24, 2023

lorenzodb1 commented Jul 24, 2023 • edited Loading

probberechts commented Jul 25, 2023

lorenzodb1 commented Jul 27, 2023

codecov-commenter commented Jul 27, 2023 • edited Loading

Codecov Report

probberechts commented Jul 27, 2023

lorenzodb1 commented Jul 27, 2023 • edited Loading

probberechts commented Jul 28, 2023

lorenzodb1 commented Jul 11, 2023 •

edited

Loading

lorenzodb1 commented Jul 24, 2023 •

edited

Loading

codecov-commenter commented Jul 27, 2023 •

edited

Loading

lorenzodb1 commented Jul 27, 2023 •

edited

Loading