-
-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fingerprint avoidance: getting started with browser emulators and changing viewports, and user-agents #362
Comments
Secret Agent will automatically rotate your user-agent within the installed browsers, and will move your browser window around the screen a little - it uses the most popular viewport from https://gs.statcounter.com/. We chose not to change the viewport size by default because it can cause scripts to behave differently on different runs. You can do so, but just make sure to test it out. Many checks are bot-blocker and website specific. Some sites check that your user agent matches a browser and OS exactly, some just put you in a bucket based on statistics they grab from things like fingerprintjs. Your IP is the most frequently checked piece of information beyond the browser itself. You can rotate using VPNs or Proxies, and some sites check that your IP generally matches the timezone and language settings you use.
This is very dependent on your site. Some will want to see cookies and local storage in use, so will be aggressive if they think you're "new" to their site. In this case, you'll want to use the
You can configure both everytime you create an agent, but you likely only want to change up viewport if you're seeing issues with that, and you likely do not want to change user agent unless you're seeing issues. It will automatically have some entropy applied by SecretAgent. Playwright is likely just not concerned in the slightest with the size of your screen or position of your browser... those don't matter for testing sites for the most part. They're part of the Devtools api and part of what can be checked by the browser though, so we vary them.
There are many font packages out there you can install. Just varying them should be enough in general - I don't know of bot blockers using OS specific fonts too heavily.
CPU detection is very rare in scraping detection (from what i know), but there are a few bot detectors doing red pills for virtual machines - there's some research into this that the very aggressive bot blockers tried a bit, but it's not 100% reliable, so I think it's still somewhat rare. |
Thanks for the answers above. I will study them some more, but before I ask any followup questions: ... tucked in my questions above was the idea of using a single agent to scrape several urls before closing the agent. This is especially important for me because I have set a resource listener on my agent's Tab. So I tried implementing the following technique for 10+ consecutive urls:
The problem is that I am getting a timeout on the tab.goto(url) call (step 4), making me think Is there a way in secret-agent to accomplish the encapsulation that I am trying to do, allowing some |
WaitForNewTab is currently more about popups or Control clicking a link to force it to a new tab. I think your approach will be simpler to just create a new agent for each of your scrapes and/or just skip the tab close portion. You can always use a userProfile to restore any state you want between the agents.
… On Oct 22, 2021, at 6:36 PM, andynuss ***@***.***> wrote:
Thanks for the answers above. I will study them some more, but before I ask any followup questions:
... tucked in my questions above was the idea of using a single agent to scrape several urls before closing the agent.
This is similar to the idea of creating a playwright page from a browser, and then closing the Page before going on to the next url to scrape for the "browser" instance.
This is especially important for me because I have set a resource listener on my agent's Tab.
So I tried implementing the following technique for 10+ consecutive urls:
optionally call the agent.configure() function before the next scrape (i.e. to maybe change the viewport)
call agent.waitForNewTab to get a closeable tab when I am done with the scrape session
use tab.on('resource').then() to view the resources that are loaded when I visit the page
use tab.goto(url)
inject javascript to get various things in the dom and its frames that I am interested in
close the Tab created in 2. above so that I am making a clean demarcation for the next goto and its resources
The problem is that I am getting a timeout on the tab.goto(url) call (step 4), making me think
that my design pattern is not even possible because waitForNewTab() does not in fact act like
playwright browser.newPage() function.
Is there a way in secret-agent to accomplish the encapsulation that I am trying to do, allowing some
kind of async close() call prior to each new url scraped for a given agent?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Sounds good. |
I was wondering how long it took on average to create an agent (the default way) on my ec2 instance, and after the vm warms up, it appears to fluctuate between 2.0 and 4.5 seconds. This does seem significant, and it seems like it would be good to avoid creating a new agent for each page. I assume that if it wasn't for the fact that it is important for me to listen to which web-fonts are being requested for a given goto event/page, then I could simply navigate to another page with the same agent. So what I was thinking was that I could navigate (with the agent.goto method) to my own empty localhost page url (on that ec2 instance) after I decide I am done with the information I need for page N, then reset the fontList for the next page, and then immediately navigate to page N+1. Will this allow me to reuse an agent for several agent.goto() calls in a row without being confused as to which resources go to which? |
Seems reasonable. We could probably also surface information about which "document" is requesting a given resource. Is that 2-4.5 seconds under load? Our unit test suite running on linux is running on small-ish machines, and I'm under the impression it's launching new "browser contexts" far faster than that give the number of tests and time it takes to complete. That said, I haven't done any real analysis, so maybe I'm wrong... |
Concerning my test on the reported slowness of creating an agent:
noting that generally there are two simultaneous scrape sessions sharing the same 1 cpu m3.medium,
NOTE: the goto time is the actual call to agent.goto, plus the call to agent.activeTab.waitForLoad, plus the call to my custom plugin function to wait an additional time for all the iframes up to two levels to load. And the eval time is the time to do my scraping of html and stylesheets from the main frame and other iframes. |
In other words, the agent.close() is slow enough that it will make sense to re-use the same agent for successive request urls as I talked about above. But I noticed that you mentioned the importance for some sites to do something even stronger than re-use an agent for say 20 consecutive scrapes. That somehow I need to ensure that when I scrape medium.com or guardian.com, I somehow keep cookies and local and idb storage accumulating from the last The problem is, a given ec2 instance may only scrape 300 total urls before being rotated. And in this issue, I was talking about rotating an "agent" after just 20 scrapes. I guess I can figure out some way to load and save profiles from my own central server, but how exactly do I apply a given saved profile to a newly opened agent? And how |
My understanding of secret-agent is that one should let secret-agent do most of the heavy lifting for
browser emulation with the goal of avoiding fingerprinting.
For a newbie relative to avoiding bot-detection, it seems like the most important things are the user-agent that
"seems" to be in use, the viewport that seems to be in use, the fonts that seem to be installed, and various device
properties of the os that "seems" to be in use, versus the real OS our scraper is running on
(in our case, ideally an ec2 instance).
If we simply create a new Agent without any browser emulator plugin guidance, and specifying some reasonably random viewport of our choice, are we getting secret-agent's anti-fingerprinting by default? I.e. what user-agent
will we get for each Agent constructed, and what viewport?
Assuming that the more aggressive you are in using secret-agent to create a unique fingerprint for
your constructed agent, is it advisable to reuse the agent for several consecutive scrapes (i.e. for
10 or so page urls, waiting for a new tab, and using that tab to scrape the url, and the closing the tab)?
(on the assumption that it can take significant time to create a new agent that is "bot-proof").
Is there a proper way to guide secret-agent with the use of the DefaultBrowserEmulator to select our
own reasonable user-agent and viewport? Also, in terms of viewport specifically, I noticed that secret-agent
distinguishes between width/height and screenWidth/screenHeight, but playwright only has a single
width and height. Can you explain that?
Is there anything we can do to help avoid being fingerprinted based on the lack of fonts installed on
our linux ec2 instance?
Is it important not to use a really slow (cheap) aws ec2 machine whose cpu is only a fraction of a full
cpu (such as m3.medium, or even cheaper instances, or even lambda)? I.e. this would lead to the
page load event (followed by "scrape") taking quite a long time compared to what is "normal", such
as inflating 10 seconds to as many as 60 for a slow-loading page.
The text was updated successfully, but these errors were encountered: