-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scraping always pauses and doesn't finish #51
Comments
Will look into this, thanks! |
It's sure looking like a bot detector - starting fresh (no attempts in last 12-24h) it will scrape 100-250 games, then stop. I removed the joblib parallel loop, making it sequential, then ran the debugger eventually got a request returning a 503. When I open the that url in a browser, it also shows an error. But when I browse some other pages and try that page again later, it will start working both in the browser and the scraper (presumably it identifies my IP address as being a human browsing again?). Some mitigations might be
I could possibly contribute if time allows. In the meantime is this data downloadable in bulk anywhere (at least 2010-2024 seasons)? I've looked and haven't yet found a free source with that whole time span and including pbp. |
@jnmiller interesting... the scraper uses rotating headers that have helped with the bot detection to the point where I've never had it block any of my scrapes. I haven't had the chance to run it since you raised this issue, so it's definitely possible that they've added more robust bot detection, but I don't see any issues raised on the cousin package for R (ncaahoopR), so I'm thinking this might be something different. let me try scraping a season when I get a second, but in the meantime I do have some data I can send you. what's your email? |
Thanks, that would be great! G-mail: |
@jnmiller sent. I scraped the 23-24 season last night without issue, so I'm not sure what could be causing this issue. I'll still add some of these mitigations, but I'll have to do some more digging to figure out what might be causing this issue to pop up selectively |
I am having the same issue unfortunately. Any chance you'd have data from 2017 to 2023 handy? |
@Mstolte02 @jnmiller could you both tell me what versions of python as well as the packages cbbpy, pandas, numpy, python-dateutil, pytz, tqdm, lxml, joblib, beautifulsoup4, and requests you're using? want to see if I can replicate this issue @Mstolte02 what's your email? I can send you data |
***@***.***
…On Tue, Mar 19, 2024 at 8:58 PM Daniel Cowan ***@***.***> wrote:
@Mstolte02 <https://github.com/Mstolte02> @jnmiller
<https://github.com/jnmiller> could you both tell me what versions of
python as well as the packages cbbpy, pandas, numpy, python-dateutil, pytz,
tqdm, lxml, joblib, beautifulsoup4, and requests you're using? want to see
if I can replicate this issue
@Mstolte02 <https://github.com/Mstolte02> what's your email? I can send
you data
—
Reply to this email directly, view it on GitHub
<#51 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AYXD4HLPHCES6JEEG5TX7H3YZDNK3AVCNFSM6AAAAABETJRJOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBYGQ3TKMZSGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@Mstolte02 github obfuscates email addresses - send me an email (you can find mine at the bottom of CBBpy's README) and I'll reply with the data |
I'm on date 11/13/23, looks like it just takes a long **s (not an email) time. @dcstats could you open & assign me the issue of speeding up the method? I could use multi-threading and a rate-limiter. Once you do, I'll email you on the CBB.py email. Thanks for making this repo! Looking forward to working together :) |
@dgilmore33 could you tell me what versions of python and the required packages you're using? I want to replicate this issue first, because locally I'm able to scrape entire seasons in around 30 minutes |
@dcstats honestly I don't have an "issue", I'm used to long times to load data. I'll live. Also, the more I think about it, better to keep a full season scrape at the current timeframe version : 3.9.6 pip
conda==23.7.4
|
@dgilmore33 how long scraping taking for you? if it's anything longer than 30 seconds per day, I think it's worth looking into speeding it up. I could also do something as simple as increasing the number of concurrently running jobs. I'm using multiprocessing, but you mentioned multithreading - would multithreading be better for this than multiprocessing? |
Have you updated to the latest version? This might be the issue that was
corrected with one of the recent bug fixes related to the requests package
update.
…On Thu, Mar 21, 2024, 3:38 PM Daniel Cowan ***@***.***> wrote:
@dgilmore33 <https://github.com/dgilmore33> how long scraping taking for
you? if it's anything longer than 30 seconds per day, I think it's worth
looking into speeding it up. I could also do something as simple as
increasing the number of concurrently running jobs. I'm using
multiprocessing, but you mentioned multithreading - would multithreading be
better for this than multiprocessing?
—
Reply to this email directly, view it on GitHub
<#51 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKUGUTIBIQE3NFZBTA2DZYTYZNONZAVCNFSM6AAAAABETJRJOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJTHE3TCOBZG4>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
@crdarlin I'll update requests now, thx for the tip @dcstats multithreading hasn't provided a performance boost w/ multiprocessing in my experience, so I wouldn't expect it to. I forked the repo so I can just change the # of workers. Ultimately, I got my game_data for the regular season, so I should be fine updating it day-by-day until the end of the tourney. Thanks for the RE's! |
For now, I'm gonna mark this as an issue for a future release so I can push some other fixes. @jnmiller if you're still experience hanging on the latest version of CBBpy, let me know what versions of python and the required packages you're using so I can try to replicate. |
Every time I try to scrape a season (men's), the process gets stuck and hangs. Ctrl-C always gives the same stack trace:
Is the source site just detecting the scraping and blocking my IP address? Or is something else going on?
I can sometimes successfully scrape a very short date range (like a weekend) but immediately after a success, it stops working and hangs again.
The text was updated successfully, but these errors were encountered: