You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When storing in Kafka, key the launch requests by distinct seed/host configuration.
Switch to compacted Kafka so only the latest crawl spec. is kept per unique site key.
Ensure Heritrix always reads the whole Kafka topic, thus reconstructing the up-to-date configuration.#
Cleanest option would be to gather seeds together and assemble an overall launch spec per host, then change how Heritrix processes that to set up the queue config and then enqueue all the seeds with their launchTimestamp used to reflect the most recent launches.
The text was updated successfully, but these errors were encountered:
To avoid issues like ukwa/ukwa-heritrix#76, we could:
Cleanest option would be to gather seeds together and assemble an overall launch spec per host, then change how Heritrix processes that to set up the queue config and then enqueue all the seeds with their
launchTimestamp
used to reflect the most recent launches.The text was updated successfully, but these errors were encountered: