-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase scraping performance and reliability #9
Comments
@stevenleeg Thanks for the feedback and suggestions. The library could definitely use these improvements. I'll try to find some time to incorporate them unless you want to open a PR yourself. |
🐐 |
You're the bomb.com this works great. And of course thanks to @bvlaicu . I'll upload my edits on my fork but no idea how to merge/however it works One thing I'm confused about all the event_loop stuff. I think I understand at a super high level why asyncio is used (if you were running multiple 'apis' on same server to avoid race conditions?), the await function is great (not sure if this is built into python or part of asyncio?), but yea I basically removed some of it to avoid the double chromium. Anyways thanks all, I'm big into selenium and hadn't played with py puppet till now, this solves so many JS headaches |
Hi there,
First off, I just want to say huge thanks for building this- it helped me build a nice scraper to track/store my electricity usage which is something I've been putting off for a very long time.
While trying to get this library to work I noticed that I would successfully fetch reading data for only about 25% of my attempts. The other 75% would get caught up on authentication issues or some element not showing up as it should. I had a feeling these issues were due to various race conditions associated with trying to scrape JS-heavy webpage and wanted to do some refactoring, taking some lessons from how Cypress approaches this kind of work. Namely, I restructured the scraper to watch for and respond to elements appearing/disappearing on the page rather than waiting arbitrary amounts of time and hoping requests finished.
Here's the code, which is built pretty specifically for the context where I'm using it, but you can see the general ideas:
The results have been promising so far- in my testing this method has been more reliable and faster than the current implementation since it doesn't have to wait as long for data. I figured I'd share it here in case you'd like to incorporate the changes into your library (or if you're open to a PR I can see if I can make the changes myself), or for others to use as a reference.
One thing I'd also like to add: I would recommend returning all of the reading results rather than just the latest one. AMI data can be lagged or be updated as time goes on (utilities are bad at computers), so if you're trying to scrape and store your meter's data you'll likely want to fetch the whole set of readings and insert/replace each interval in the database you're storing them in. Since running this scraper I've noticed that the latest reading is usually
null
for an hour or so before it starts getting populated with a kwh value.Hope this is helpful!
The text was updated successfully, but these errors were encountered: