Fixes for issues affecting the FBref scraper #281

lorenzodb1 · 2023-07-05T22:28:14Z

This PR fixes the following issues:

The filename for the cached team season stats was appended with "all" regardless of the type of stats queried. This caused an issue as the cache might not have contained the table needed. It now caches these tables in different files.
For every n rows, the website adds a row in a table that replicates the table header. This caused read_schedule to fail as the number of rows in df_table would be higher than the one of the list of match URLs obtained (see [FBref] Non-data rows in the table body should be removed #277). I added the logic to remove those replicated headers when found.
The website has no specific Scores & Fixtures on the Big 5 European Leagues Stats page. Thus it'd go to the generic Scores & Fixtures page, which shows games currently being played. Because of this, I had to move the optimisation that combines the top five leagues under that label in read_leagues, as read_schedule necessarily needs the five top leagues separately rather than in their combined form.
The method _fix_nation_col throws an IndexError, supposedly when no flag is present. I fixed this by changing the logic to use regular expressions instead so that when the flag is missing no error is thrown.

Additionally, it moves pretty-error to the dev dependencies group, as it would otherwise be installed in repositories importing this library (which should not be the case). I'm not sure I've done this correctly, and I had to remove some imports, so please let me know if this breaks previous behaviour and advise me on what I should do instead. It also updates pandas to v2.0.

… fail

… in read_leagues

probberechts · 2023-07-07T07:26:52Z

Thanks a lot. These are some great contributions. Could you just split them up into multiple pull requests? That makes it easier to review them + pull requests are used to automatically create the release notes.

lorenzodb1 · 2023-07-07T07:52:06Z

Closing this PR as I'll break it down into individual PRs for each issue fixed.

#282
#283
#284
#285

lorenzodb1 added 4 commits July 5, 2023 11:46

Moved pretty-error to dev dependency and fixed bug making FBref tests…

7494536

… fail

Fixed issue affecting cached team season stats

383d904

Fixed issue in read_schedule by moving the Top 5 Leagues optimisation…

6c1e69a

… in read_leagues

Fixed IndexError in _fix_nation_col

ec15e08

lorenzodb1 closed this Jul 7, 2023

lorenzodb1 deleted the lorenzodb1-fixes branch July 7, 2023 08:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes for issues affecting the FBref scraper #281

Fixes for issues affecting the FBref scraper #281

lorenzodb1 commented Jul 5, 2023 •

edited

Loading

probberechts commented Jul 7, 2023

lorenzodb1 commented Jul 7, 2023 •

edited

Loading

Fixes for issues affecting the FBref scraper #281

Fixes for issues affecting the FBref scraper #281

Conversation

lorenzodb1 commented Jul 5, 2023 • edited Loading

probberechts commented Jul 7, 2023

lorenzodb1 commented Jul 7, 2023 • edited Loading

lorenzodb1 commented Jul 5, 2023 •

edited

Loading

lorenzodb1 commented Jul 7, 2023 •

edited

Loading