Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FBref] Handle missing Scores & Fixtures page on the Big 5 European Leagues Stats page #284

Merged
merged 11 commits into from
Jul 28, 2023
Merged

[FBref] Handle missing Scores & Fixtures page on the Big 5 European Leagues Stats page #284

merged 11 commits into from
Jul 28, 2023

Conversation

lorenzodb1
Copy link
Contributor

  • For every n rows, the website adds a row in a table that replicates the table header. This caused read_schedule to fail as the number of rows in df_table would be higher than the one of the list of match URLs obtained (see
    [FBref] Non-data rows in the table body should be removed #277). I added the logic to remove those replicated headers when found.
  • The website has no specific Scores & Fixtures on the Big 5 European Leagues Stats page. Thus it'd go to the generic Scores & Fixtures page, which shows games currently being played. Because of this, I had to move the optimisation that combines the top five leagues under that label in read_leagues, as read_schedule necessarily needs the five top leagues separately rather than in their combined form.

@lorenzodb1
Copy link
Contributor Author

lorenzodb1 commented Jul 11, 2023

Tests seem to be failing due to the issue fixed in #286

EDIT: found the bug causing the failures. Pushing a fix in a bit.

@lorenzodb1
Copy link
Contributor Author

@probberechts is there a way we can get this and #286 merged and released any time soon? I'd really appreciate that as those bugs are blocking me in my personal project :)

@probberechts
Copy link
Owner

Hi @lorenzodb1

I'll have to take a closer look at these and I'll try to make some time during the weekend. What's blocking me from merging them immediately is:

@lorenzodb1
Copy link
Contributor Author

Hi @probberechts

  • At the moment, read_leagues is only called by read_seasons, so having to pass a list of leagues shouldn't be an issue per se. The parameter optimise in read_seasons defaults as True, so you can call read_seasons just as you always do and obtain the same behaviour. Also, fwiw, I don't think read_seasons and read_leagues should be public methods anyway, considering they're not even described in the API documentation.
  • The same issue fixed in Changed logic in _concat to fix error thrown on forfeited matches #286 arises for Hellas Verona-Roma. I don't know if Serie A is one of the supported leagues (I suppose it is), but that issue affects any game awarded with a default 3-0 score, regardless of the competition.
  • I'm happy to discuss any of these cases and adjust the PR accordingly.

@probberechts
Copy link
Owner

For dealing with the missing Scores & Fixtures on the Big 5 European Leagues Stats page, I would be ok with two alternative approaches:

  • raise an exception when read_schedule is called for the 'Big 5 European Leagues Combined', suggesting creating a new instance of the FBref scraper with leagues = ['ENG-Premier League', 'FRA-Ligue 1', ...]
  • follow your approach of modifying the read_leagues method, but I would add a more informative "split_up_big5 = False" parameter to both read_leagues and read_seasons

@probberechts probberechts changed the title Fixed issues in read_schedule related to scraping long tables and scores [FBref] Handle missing Scores & Fixtures page on the Big 5 European Leagues Stats page Jul 23, 2023
@probberechts probberechts added bug Something isn't working FBref Issue or pull request related to the FBref scraper labels Jul 23, 2023
@lorenzodb1
Copy link
Contributor Author

follow your approach of modifying the read_leagues method, but I would add a more informative "split_up_big5 = False" parameter to both read_leagues and read_seasons

This was my first approach to fixing that issue. I changed it to the current implementation because it would mean adding the same logic of splitting the Big5 leagues in both methods. Since read_leagues is only called by read_seasons, it could make sense to change the signature to just passing the leagues that need to be considered. If you have strong opinions against that, though, I'll be more than happy to change the approach to what you described.

Out of curiosity, what's the reason behind keeping both read_leagues and read_seasons methods public?

@probberechts
Copy link
Owner

Unless I am missing something, the only thing that would have to be changed in read_seasons are L162 to df_leagues = self.read_leagues(split_up_big_5) and drop the self.leagues filter in L196.

Out of curiosity, what's the reason behind keeping both read_leagues and read_seasons methods public?

These contain relevant data as well, such as the season's champion and top scorer. Additionaly, making them private would be a breaking change.

@lorenzodb1
Copy link
Contributor Author

lorenzodb1 commented Jul 24, 2023

Unless I am missing something, the only thing that would have to be changed in read_seasons are L162 to df_leagues = self.read_leagues(split_up_big_5) and drop the self.leagues filter in L196.

If the filter in L196 can be dropped then that'd definitely be a better solution.

These contain relevant data as well, such as the season's champion and top scorer. Additionaly, making them private would be a breaking change.

I see. Maybe it'd be worth adding them to the API docs then.

@probberechts
Copy link
Owner

I see. Maybe it'd be worth adding them to the API docs then.

Yes, that makes sense. I've added them in #307.

@lorenzodb1
Copy link
Contributor Author

@probberechts I applied your suggestion to these changes. I couldn't drop the self.leagues filter in L196 as it'd make all tests fail, but I changed it so I could still stop using self.leagues in it.

soccerdata/fbref.py Outdated Show resolved Hide resolved
soccerdata/fbref.py Outdated Show resolved Hide resolved
@codecov-commenter
Copy link

codecov-commenter commented Jul 27, 2023

Codecov Report

Patch coverage: 92.85% and project coverage change: +0.12% 🎉

Comparison is base (d5043e6) 61.28% compared to head (2ef428c) 61.40%.
Report is 2 commits behind head on master.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #284      +/-   ##
==========================================
+ Coverage   61.28%   61.40%   +0.12%     
==========================================
  Files          10       10              
  Lines        1431     1433       +2     
  Branches      290      292       +2     
==========================================
+ Hits          877      880       +3     
  Misses        508      508              
+ Partials       46       45       -1     
Files Changed Coverage Δ
soccerdata/fbref.py 86.22% <92.85%> (+0.28%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@probberechts
Copy link
Owner

The current implementation would be a breaking change and it still does not work as I would expect. Now it would scrape the data from the "Big 5 European Leagues Combined" page by default. The code below would combine the top-5 leagues, but it should return 5 leagues:

fbref = sd.FBref(["ENG-Premier League", "FRA-Ligue 1", "ITA-Serie A", "ESP-La Liga", "GER-Bundesliga"], "2223")
fbref.read_leagues()

@lorenzodb1
Copy link
Contributor Author

lorenzodb1 commented Jul 27, 2023

@probberechts I ended up changing it to split_up_big5: bool = False like you recommended. Hopefully, that will be enough to merge and release this fix. Let me know if you have any other issues with these changes.

@probberechts
Copy link
Owner

Looks good now. Thanks for all your work on improving the FBref scraper!

@probberechts probberechts merged commit 11ac472 into probberechts:master Jul 28, 2023
10 checks passed
@lorenzodb1 lorenzodb1 deleted the lorenzodb1-read-schedule-fix branch July 28, 2023 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working FBref Issue or pull request related to the FBref scraper
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants