Scraping always pauses and doesn't finish #51

jnmiller · 2024-03-13T02:09:58Z

Every time I try to scrape a season (men's), the process gets stuck and hangs. Ctrl-C always gives the same stack trace:

Getting data for season 2022
No games on 11/08/21:   4%|███▌                                                                              | 8 of 182 days scraped in 3.3 sec
Scraping 184 games on 11/09/21:   4%|███                                                                   | 8 of 182 days scraped in 204.8 sec
Traceback (most recent call last):
  File "<SNIP>/./scrape.py", line 30, in <module>
    infos, box_scores, pbps = scraper.get_games_season(season, info=True, box=False, pbp=False)
  File "<SNIP>/lib/python3.10/site-packages/cbbpy/mens_scraper.py", line 80, in get_games_season
    return _get_games_season(season, "mens", info, box, pbp)
  File "<SNIP>/lib/python3.10/site-packages/cbbpy/cbbpy_utils.py", line 233, in _get_games_season
    info = _get_games_range(
  File "<SNIP>/lib/python3.10/site-packages/cbbpy/cbbpy_utils.py", line 186, in _get_games_range
    result = Parallel(n_jobs=cpus)(
  File "<SNIP>/lib/python3.10/site-packages/joblib/parallel.py", line 1952, in __call__
    return output if self.return_generator else list(output)
  File "<SNIP>/lib/python3.10/site-packages/joblib/parallel.py", line 1595, in _get_outputs
    yield from self._retrieve()
  File "<SNIP>/lib/python3.10/site-packages/joblib/parallel.py", line 1707, in _retrieve
    time.sleep(0.01)
KeyboardInterrupt

Is the source site just detecting the scraping and blocking my IP address? Or is something else going on?

I can sometimes successfully scrape a very short date range (like a weekend) but immediately after a success, it stops working and hangs again.

The text was updated successfully, but these errors were encountered:

dcstats · 2024-03-14T17:26:19Z

Will look into this, thanks!

jnmiller · 2024-03-15T19:10:00Z

It's sure looking like a bot detector - starting fresh (no attempts in last 12-24h) it will scrape 100-250 games, then stop. I removed the joblib parallel loop, making it sequential, then ran the debugger eventually got a request returning a 503. When I open the that url in a browser, it also shows an error. But when I browse some other pages and try that page again later, it will start working both in the browser and the scraper (presumably it identifies my IP address as being a human browsing again?).

Some mitigations might be

Add timeouts so it doesn't hang as long. Longer backoff if getting a 503
Configurable slowdown: use lower concurrency in joblib, or add random sleeps after after reading each page
Add support for rotating proxies
Try adding received cookies into the requests, maybe it's detecting scraping based on cookies being absent (tricky if combined with proxies)

I could possibly contribute if time allows. In the meantime is this data downloadable in bulk anywhere (at least 2010-2024 seasons)? I've looked and haven't yet found a free source with that whole time span and including pbp.

dcstats · 2024-03-15T19:16:37Z

@jnmiller interesting... the scraper uses rotating headers that have helped with the bot detection to the point where I've never had it block any of my scrapes. I haven't had the chance to run it since you raised this issue, so it's definitely possible that they've added more robust bot detection, but I don't see any issues raised on the cousin package for R (ncaahoopR), so I'm thinking this might be something different. let me try scraping a season when I get a second, but in the meantime I do have some data I can send you. what's your email?

jnmiller · 2024-03-16T01:12:14Z

Thanks, that would be great! G-mail: jarednmiller

dcstats · 2024-03-19T21:24:46Z

@jnmiller sent. I scraped the 23-24 season last night without issue, so I'm not sure what could be causing this issue. I'll still add some of these mitigations, but I'll have to do some more digging to figure out what might be causing this issue to pop up selectively

Mstolte02 · 2024-03-19T23:49:30Z

I am having the same issue unfortunately. Any chance you'd have data from 2017 to 2023 handy?

dcstats · 2024-03-20T00:57:59Z

@Mstolte02 @jnmiller could you both tell me what versions of python as well as the packages cbbpy, pandas, numpy, python-dateutil, pytz, tqdm, lxml, joblib, beautifulsoup4, and requests you're using? want to see if I can replicate this issue

@Mstolte02 what's your email? I can send you data

Mstolte02 · 2024-03-20T03:24:46Z

***@***.***

…

On Tue, Mar 19, 2024 at 8:58 PM Daniel Cowan ***@***.***> wrote: @Mstolte02 <https://github.com/Mstolte02> @jnmiller <https://github.com/jnmiller> could you both tell me what versions of python as well as the packages cbbpy, pandas, numpy, python-dateutil, pytz, tqdm, lxml, joblib, beautifulsoup4, and requests you're using? want to see if I can replicate this issue @Mstolte02 <https://github.com/Mstolte02> what's your email? I can send you data — Reply to this email directly, view it on GitHub <#51 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AYXD4HLPHCES6JEEG5TX7H3YZDNK3AVCNFSM6AAAAABETJRJOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBYGQ3TKMZSGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

dcstats · 2024-03-20T04:24:14Z

@Mstolte02 github obfuscates email addresses - send me an email (you can find mine at the bottom of CBBpy's README) and I'll reply with the data

dgilmore33 · 2024-03-21T22:02:15Z

I'm on date 11/13/23, looks like it just takes a long **s (not an email) time.

@dcstats could you open & assign me the issue of speeding up the method? I could use multi-threading and a rate-limiter. Once you do, I'll email you on the CBB.py email.

Thanks for making this repo! Looking forward to working together :)

dcstats · 2024-03-21T22:05:58Z

@dgilmore33 could you tell me what versions of python and the required packages you're using? I want to replicate this issue first, because locally I'm able to scrape entire seasons in around 30 minutes

dgilmore33 · 2024-03-21T22:15:32Z

@dcstats honestly I don't have an "issue", I'm used to long times to load data. I'll live.

Also, the more I think about it, better to keep a full season scrape at the current timeframe

version : 3.9.6
packages :

pip

altgraph 0.17.2
appnope 0.1.4
asttokens 2.4.1
attrs 23.2.0
bleach 6.1.0
certifi 2023.7.22
charset-normalizer 3.3.1
comm 0.2.2
cssselect 1.2.0
debugpy 1.8.1
decorator 5.1.1
exceptiongroup 1.2.0
executing 2.0.1
fastjsonschema 2.19.1
future 0.18.2
GDAL 3.8.4
idna 3.4
importlib_metadata 7.0.2
ipykernel 6.29.3
ipython 8.18.1
jedi 0.19.1
jsonschema 4.21.1
jsonschema-specifications 2023.12.1
jupyter_client 8.6.1
jupyter_core 5.7.2
kaggle 1.5.16
lxml 5.1.0
macholib 1.15.2
matplotlib-inline 0.1.6
nbformat 5.10.3
nest-asyncio 1.6.0
numpy 1.26.2
packaging 24.0
pandas 2.1.4
parso 0.8.3
pexpect 4.9.0
pip 21.2.4
platformdirs 4.2.0
prompt-toolkit 3.0.43
psutil 5.9.8
ptyprocess 0.7.0
pure-eval 0.2.2
Pygments 2.17.2
pyquery 2.0.0
python-dateutil 2.8.2
python-slugify 8.0.1
pytz 2023.3.post1
pyzmq 25.1.2
referencing 0.34.0
requests 2.31.0
rpds-py 0.18.0
setuptools 58.0.4
six 1.15.0
stack-data 0.6.3
text-unidecode 1.3
tornado 6.4
tqdm 4.66.1
traitlets 5.14.2
typing_extensions 4.10.0
tzdata 2023.3
urllib3 2.0.7
wcwidth 0.2.13
webencodings 0.5.1
wheel 0.37.0
zipp 3.18.1

conda==23.7.4

appnope 0.1.3 pyhd8ed1ab_0 conda-forge
asttokens 2.4.1 pyhd8ed1ab_0 conda-forge
attrs 23.2.0 pyh71513ae_0 conda-forge
beautifulsoup4 4.12.3 pypi_0 pypi
blas 1.0 mkl
blinker 1.7.0 pyhd8ed1ab_0 conda-forge
bottleneck 1.3.5 py311hb9e55a9_0
brotli 1.0.9 hca72f7f_7
brotli-bin 1.0.9 hca72f7f_7
brotli-python 1.0.9 py311h814d153_8 conda-forge
bs4 0.0.2 pypi_0 pypi
bzip2 1.0.8 h1de35cc_0
ca-certificates 2024.2.2 h8857fd0_0 conda-forge
cbbpy 2.0.2 pypi_0 pypi
certifi 2023.11.17 pypi_0 pypi
charset-normalizer 3.3.2 pyhd8ed1ab_0 conda-forge
click 8.1.7 unix_pyh707e725_0 conda-forge
comm 0.1.4 pyhd8ed1ab_0 conda-forge
contourpy 1.2.0 py311ha357a0b_0
cssselect 1.2.0 pypi_0 pypi
cycler 0.11.0 pyhd3eb1b0_0
dash 2.16.1 pyhd8ed1ab_0 conda-forge
debugpy 1.6.7 py311hcec6c5f_0
decorator 5.1.1 pyhd8ed1ab_0 conda-forge
exceptiongroup 1.2.0 pyhd8ed1ab_0 conda-forge
executing 2.0.1 pyhd8ed1ab_0 conda-forge
flask 3.0.2 pyhd8ed1ab_0 conda-forge
fonttools 4.25.0 pyhd3eb1b0_0
freetype 2.12.1 hd8bbffd_0
giflib 5.2.1 h6c40b1e_3
idna 3.6 pyhd8ed1ab_0 conda-forge
importlib-metadata 7.0.0 pyha770c72_0 conda-forge
importlib_metadata 7.0.0 hd8ed1ab_0 conda-forge
importlib_resources 6.3.2 pyhd8ed1ab_0 conda-forge
intel-openmp 2023.1.0 ha357a0b_43548
ipykernel 6.26.0 pyh3cd1d5f_0 conda-forge
ipython 8.18.1 pyh707e725_3 conda-forge
itsdangerous 2.1.2 pyhd8ed1ab_0 conda-forge
jedi 0.19.1 pyhd8ed1ab_0 conda-forge
jinja2 3.1.3 pyhd8ed1ab_0 conda-forge
joblib 1.3.2 pypi_0 pypi
jpeg 9e h6c40b1e_1
jsonschema 4.21.1 pyhd8ed1ab_0 conda-forge
jsonschema-specifications 2023.12.1 pyhd8ed1ab_0 conda-forge
jupyter_client 8.6.0 pyhd8ed1ab_0 conda-forge
jupyter_core 5.5.1 py311h6eed73b_0 conda-forge
kiwisolver 1.4.4 py311hcec6c5f_0
lcms2 2.12 hf1fd2bf_0
lerc 3.0 he9d5cce_0
libbrotlicommon 1.0.9 hca72f7f_7
libbrotlidec 1.0.9 hca72f7f_7
libbrotlienc 1.0.9 hca72f7f_7
libcxx 14.0.6 h9765a3e_0
libdeflate 1.17 hb664fd8_1
libffi 3.4.4 hecd8cb5_0
libpng 1.6.39 h6c40b1e_0
libsodium 1.0.18 hbcb3906_1 conda-forge
libtiff 4.5.1 hcec6c5f_0
libwebp 1.3.2 hf6ce154_0
libwebp-base 1.3.2 h6c40b1e_0
lxml 5.1.0 pypi_0 pypi
lz4-c 1.9.4 hcec6c5f_0
markupsafe 2.1.5 py311he705e18_0 conda-forge
matplotlib 3.8.0 py311hecd8cb5_0
matplotlib-base 3.8.0 py311h41a4f6b_0
matplotlib-inline 0.1.6 pyhd8ed1ab_0 conda-forge
mkl 2023.1.0 h8e150cf_43560
mkl-service 2.4.0 py311h6c40b1e_1
mkl_fft 1.3.8 py311h6c40b1e_0
mkl_random 1.2.4 py311ha357a0b_0
munkres 1.1.4 py_0
nba-api 1.4.1 pypi_0 pypi
nbformat 5.10.3 pyhd8ed1ab_0 conda-forge
ncurses 6.4 hcec6c5f_0
nest-asyncio 1.5.8 pyhd8ed1ab_0 conda-forge
numexpr 2.8.7 py311h728a8a3_0
numpy 1.26.2 py311h728a8a3_0
numpy-base 1.26.2 py311h53bf9ac_0
openjpeg 2.4.0 h66ea3da_0
openssl 3.2.1 hd75f5a5_1 conda-forge
packaging 23.1 py311hecd8cb5_0
pandas 2.1.4 py311hdb55bb0_0
parso 0.8.3 pyhd8ed1ab_0 conda-forge
pexpect 4.8.0 pyh1a96a4e_2 conda-forge
pickleshare 0.7.5 py_1003 conda-forge
pillow 10.0.1 py311h7d39338_0
pip 23.3.1 py311hecd8cb5_0
pkgutil-resolve-name 1.3.10 pyhd8ed1ab_1 conda-forge
platformdirs 4.1.0 pyhd8ed1ab_0 conda-forge
plotly 5.19.0 pyhd8ed1ab_0 conda-forge
prompt-toolkit 3.0.42 pyha770c72_0 conda-forge
psutil 5.9.7 py311he705e18_0 conda-forge
ptyprocess 0.7.0 pyhd3deb0d_0 conda-forge
pure_eval 0.2.2 pyhd8ed1ab_0 conda-forge
pygments 2.17.2 pyhd8ed1ab_0 conda-forge
pyparsing 3.0.9 py311hecd8cb5_0
pyquery 2.0.0 pypi_0 pypi
pysocks 1.7.1 pyha2e5f31_6 conda-forge
python 3.11.5 hf27a42d_0
python-dateutil 2.8.2 pyhd3eb1b0_0
python-fastjsonschema 2.19.1 pyhd8ed1ab_0 conda-forge
python-tzdata 2023.3 pyhd3eb1b0_0
python_abi 3.11 2_cp311 conda-forge
pytz 2023.3.post1 py311hecd8cb5_0
pyzmq 24.0.1 py311habfacb3_1 conda-forge
readline 8.2 hca72f7f_0
referencing 0.34.0 pyhd8ed1ab_0 conda-forge
requests 2.31.0 pyhd8ed1ab_0 conda-forge
retrying 1.3.3 py_2 conda-forge
rpds-py 0.18.0 py311hd64b9fd_0 conda-forge
setuptools 68.2.2 py311hecd8cb5_0
six 1.16.0 pyhd3eb1b0_1
soupsieve 2.5 pypi_0 pypi
sportsipy 0.6.0 pypi_0 pypi
sportsreference 0.5.2 pypi_0 pypi
sqlite 3.41.2 h6c40b1e_0
stack_data 0.6.2 pyhd8ed1ab_0 conda-forge
tbb 2021.8.0 ha357a0b_0
tenacity 8.2.3 pyhd8ed1ab_0 conda-forge
tk 8.6.12 h5d9f67b_0
tornado 6.3.3 py311h6c40b1e_0
tqdm 4.66.2 pypi_0 pypi
traitlets 5.14.0 pyhd8ed1ab_0 conda-forge
typing-extensions 4.9.0 hd8ed1ab_0 conda-forge
typing_extensions 4.9.0 pyha770c72_0 conda-forge
tzdata 2023c h04d1e81_0
urllib3 2.1.0 pypi_0 pypi
wcwidth 0.2.12 pyhd8ed1ab_0 conda-forge
werkzeug 3.0.1 pyhd8ed1ab_0 conda-forge
wheel 0.41.2 py311hecd8cb5_0
xz 5.4.5 h6c40b1e_0
zeromq 4.3.4 h23ab428_0
zipp 3.17.0 pyhd8ed1ab_0 conda-forge
zlib 1.2.13 h4dc903c_0
zstd 1.5.5 hc035e20_0

dcstats · 2024-03-21T22:37:58Z

@dgilmore33 how long scraping taking for you? if it's anything longer than 30 seconds per day, I think it's worth looking into speeding it up. I could also do something as simple as increasing the number of concurrently running jobs. I'm using multiprocessing, but you mentioned multithreading - would multithreading be better for this than multiprocessing?

crdarlin · 2024-03-21T23:01:53Z

Have you updated to the latest version? This might be the issue that was corrected with one of the recent bug fixes related to the requests package update.

…

On Thu, Mar 21, 2024, 3:38 PM Daniel Cowan ***@***.***> wrote: @dgilmore33 <https://github.com/dgilmore33> how long scraping taking for you? if it's anything longer than 30 seconds per day, I think it's worth looking into speeding it up. I could also do something as simple as increasing the number of concurrently running jobs. I'm using multiprocessing, but you mentioned multithreading - would multithreading be better for this than multiprocessing? — Reply to this email directly, view it on GitHub <#51 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AKUGUTIBIQE3NFZBTA2DZYTYZNONZAVCNFSM6AAAAABETJRJOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJTHE3TCOBZG4> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

dgilmore33 · 2024-03-21T23:12:09Z

@crdarlin I'll update requests now, thx for the tip

@dcstats multithreading hasn't provided a performance boost w/ multiprocessing in my experience, so I wouldn't expect it to. I forked the repo so I can just change the # of workers.

Ultimately, I got my game_data for the regular season, so I should be fine updating it day-by-day until the end of the tourney. Thanks for the RE's!

dcstats · 2024-03-22T03:36:21Z

For now, I'm gonna mark this as an issue for a future release so I can push some other fixes. @jnmiller if you're still experience hanging on the latest version of CBBpy, let me know what versions of python and the required packages you're using so I can try to replicate.

dcstats self-assigned this Mar 15, 2024

dcstats added the bug Something isn't working label Mar 15, 2024

dcstats added this to the 2.0.3 milestone Mar 15, 2024

dcstats added a commit that referenced this issue Mar 19, 2024

random sleeps #51

3fd4da9

dcstats modified the milestones: 2.0.3, 2.1.0 Mar 22, 2024

dcstats removed this from the 2.1.0 milestone Sep 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scraping always pauses and doesn't finish #51

Scraping always pauses and doesn't finish #51

jnmiller commented Mar 13, 2024 •

edited

Loading

dcstats commented Mar 14, 2024

jnmiller commented Mar 15, 2024

dcstats commented Mar 15, 2024 •

edited

Loading

jnmiller commented Mar 16, 2024

dcstats commented Mar 19, 2024

Mstolte02 commented Mar 19, 2024

dcstats commented Mar 20, 2024

Mstolte02 commented Mar 20, 2024 via email

dcstats commented Mar 20, 2024

dgilmore33 commented Mar 21, 2024

dcstats commented Mar 21, 2024

dgilmore33 commented Mar 21, 2024

dcstats commented Mar 21, 2024

crdarlin commented Mar 21, 2024 via email

dgilmore33 commented Mar 21, 2024

dcstats commented Mar 22, 2024

Scraping always pauses and doesn't finish #51

Scraping always pauses and doesn't finish #51

Comments

jnmiller commented Mar 13, 2024 • edited Loading

dcstats commented Mar 14, 2024

jnmiller commented Mar 15, 2024

dcstats commented Mar 15, 2024 • edited Loading

jnmiller commented Mar 16, 2024

dcstats commented Mar 19, 2024

Mstolte02 commented Mar 19, 2024

dcstats commented Mar 20, 2024

Mstolte02 commented Mar 20, 2024 via email

dcstats commented Mar 20, 2024

dgilmore33 commented Mar 21, 2024

dcstats commented Mar 21, 2024

dgilmore33 commented Mar 21, 2024

dcstats commented Mar 21, 2024

crdarlin commented Mar 21, 2024 via email

dgilmore33 commented Mar 21, 2024

dcstats commented Mar 22, 2024

jnmiller commented Mar 13, 2024 •

edited

Loading

dcstats commented Mar 15, 2024 •

edited

Loading