-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use Generator2 lead fetcher fail #32
Comments
Hi @whjshj, could you share your configuration (at least, all custom-set properties related to Generator2, Generator, URLPartitioner and Fetcher? Also sharing the log files (job client stdout and hadoop.log or task logs) would help to debug the issue, thanks! Two comments so far:
|
hello @sebastian-nagel I'm using the settings from the website https://github.com/commoncrawl/cc-nutch-example. I only changed the number of threads to 1. In the initial fetching phase, it runs normally. However, after some time, there is only one thread alive, but it's just waiting, even though there is still data in the queue. The data isn't being selected because it exceeds the maximum number of threads, and the only active thread isn't processing the data. This phenomenon is quite strange. Then, a timeout is triggered, which is what you mentioned as the maptask task timeout. Have you encountered this situation before? |
Then the queue is blocked because the host of this queue responded (repeatedly) with an HTTP status code indicating a server error. This is quite common for wider web crawls, but it shouldn't happen if you crawl your intranet or own server. There are two options to ensure that the fetcher is fetching:
If either time limit or throughput threshold are hit, the current fetching cycle is stopped and the output is written to disk/HDFS. The script will then continue. In order to figure out the reason of the slow fetching, I need the log file. |
hadoop.log |
Hi @whjshj, according to the hadoop.log, the fetch job fails in the reduce phase when writing the WARC files. The native libraries for the language detector are not installed:
|
hello @sebastian-nagel Hello, I have identified the cause of the error when writing the WARC file, and I've already resolved the issue. Please take a look at the section before writing the WARC file. I have taken a screenshot. Can you see it? |
Great! Ok, I see:
|
"Thank you for your response; my confusion has been resolved. May I ask about the current situation of using Nutch-cc to crawl web pages? For example, in an iterative download, if a total of 1000 web pages need to be downloaded, how many of them are successfully downloaded in the end?" |
This totally depends on the fetch list:
|
Thank you very much for your response. In the fetch stage, regarding the map phase tasks, which is the downloading process, do you have any recommendations for the map-related configuration? Currently, I have set each map task to 1 core and 2GB of memory. Is this configuration reasonable? |
Yes, possible, under the assumption that
If you want to scale up, it's more efficient to parallelize first using threads (up to several hundred threads). Of course, more threads mean higher memory requirements to buffer the incoming data. Also scaling up requires to adjust many more parameters to your set up and context: connection pools, timeouts, etc. |
Thank you for your response. I am currently looking to deploy Nutch-cc on a large scale. Could you suggest some recommended configurations? For example, how many CPU cores and how much memory should be allocated to each map task and each reduce task? Additionally, during the fetch phase, what would be an appropriate number of concurrent download threads to set? |
It is difficult to recommend a final cluster configuration, because it depends on the kind of your crawl and the Hadoop cluster setup. Few tips:
Also important: choose a unique agent name together with contact information. You'll receive feedback from angry webmasters! Scaling up and staying polite is a challenge, but can be mastered. |
Thank you very much for your response. When you mentioned that a 4-core, 32GB machine can crawl 500,000 web pages in an hour, does that refer to successfully downloaded pages, or does it include 404 errors or pages blocked by robots.txt? When I successfully download 250,000 pages per hour, my machine starts to become unstable and begins to report errors. |
Yes, it means 500k successfully fetched web pages, given a success ratio of about 70%. 250k pages per hour isn't that bad already! But as said: every setup is different and requires tuning which takes time. |
For a machine with 4 cores and 32GB of memory, how many map tasks would you allocate in the Fetch phase? According to the ratio you provided, is it 1 core per 8GB of memory, or is one map task allocated 4 cores and 32GB of memory |
But, as said, try it out - different cluster hardware, crawl configuration or distribution of crawled hosts may work better with other setups. |
Thank you for your response. Could you recommend memory and CPU configurations for the reduce phase in fetch? I always encounter failures during that stage. Also, could you tell me how many web pages are typically downloaded in one batch during a round of scheduling |
When I use Generator2 to generate fetch requests to download a webpage, and set the number of threads to 1, it results in a timeout being triggered (long timeout = conf.getInt("mapreduce.task.timeout", 10 * 60 * 1000)), causing the download task to terminate.
The text was updated successfully, but these errors were encountered: