Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes for issues affecting the FBref scraper #281

Closed
wants to merge 4 commits into from
Closed

Fixes for issues affecting the FBref scraper #281

wants to merge 4 commits into from

Conversation

lorenzodb1
Copy link
Contributor

@lorenzodb1 lorenzodb1 commented Jul 5, 2023

This PR fixes the following issues:

  1. The filename for the cached team season stats was appended with "all" regardless of the type of stats queried. This caused an issue as the cache might not have contained the table needed. It now caches these tables in different files.
  2. For every n rows, the website adds a row in a table that replicates the table header. This caused read_schedule to fail as the number of rows in df_table would be higher than the one of the list of match URLs obtained (see [FBref] Non-data rows in the table body should be removed #277). I added the logic to remove those replicated headers when found.
  3. The website has no specific Scores & Fixtures on the Big 5 European Leagues Stats page. Thus it'd go to the generic Scores & Fixtures page, which shows games currently being played. Because of this, I had to move the optimisation that combines the top five leagues under that label in read_leagues, as read_schedule necessarily needs the five top leagues separately rather than in their combined form.
  4. The method _fix_nation_col throws an IndexError, supposedly when no flag is present. I fixed this by changing the logic to use regular expressions instead so that when the flag is missing no error is thrown.

Additionally, it moves pretty-error to the dev dependencies group, as it would otherwise be installed in repositories importing this library (which should not be the case). I'm not sure I've done this correctly, and I had to remove some imports, so please let me know if this breaks previous behaviour and advise me on what I should do instead. It also updates pandas to v2.0.

@probberechts
Copy link
Owner

Thanks a lot. These are some great contributions. Could you just split them up into multiple pull requests? That makes it easier to review them + pull requests are used to automatically create the release notes.

@lorenzodb1
Copy link
Contributor Author

lorenzodb1 commented Jul 7, 2023

Closing this PR as I'll break it down into individual PRs for each issue fixed.

#282
#283
#284
#285

@lorenzodb1 lorenzodb1 closed this Jul 7, 2023
@lorenzodb1 lorenzodb1 deleted the lorenzodb1-fixes branch July 7, 2023 08:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants