Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better handling of extracted URIs that are "data URIs" (base64 encoded media) #422

Open
kris-sigur opened this issue Jul 29, 2021 · 1 comment

Comments

@kris-sigur
Copy link
Collaborator

Data URIs seem to trip up the Extractor module. Excerpt from log follows with truncated URI:

Jun 18, 2021 6:47:08 PM org.archive.modules.extractor.ExtractorSitemap recordOutlink
WARNING: URIException when recording outlink http://lindaskoli.is/--18K truncated-- (in thread 'ToeThread #54: http://lindaskoli.is/post-sitemap.xml'; in processor 'extractorSitemap')
org.apache.commons.httpclient.URIException: URI length > 2083: http://lindaskoli.is/-- 18K truncated again--
        at org.archive.url.UsableURIFactory.fixup(UsableURIFactory.java:357)
        at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:301)
        at org.archive.net.UURIFactory.getInstance(UURIFactory.java:55)
        at org.archive.modules.extractor.Extractor.addRelativeToBase(Extractor.java:190)
        at org.archive.modules.extractor.ExtractorSitemap.recordOutlink(ExtractorSitemap.java:163)
        at org.archive.modules.extractor.ExtractorSitemap.innerExtract(ExtractorSitemap.java:105)
        at org.archive.modules.extractor.ContentExtractor.extract(ContentExtractor.java:37)
        at org.archive.modules.extractor.Extractor.innerProcess(Extractor.java:102)
        at org.archive.modules.Processor.innerProcessResult(Processor.java:175)
        at org.archive.modules.Processor.process(Processor.java:142)
        at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
        at org.archive.crawler.framework.ToeThread.run(ToeThread.java:148)

These can be quite large and, as seen above, are output into the log twice. In addition to any other remedies it might be best to modify the logging code to truncate offending URI in the log to avoid these excessive log spams. The above was only about 18K. The log of my last large scale crawl contains much larger examples.

More specifically useful would be for the Extractor class to detect data uris and either ignore them or pass them to a service that decodes them and does link extraction if the mimetype warrants it. Although I think data uris are rarely (ever?) used for mimetypes that might contain further URIs. My only experience of them has been images.

Some of this logic may belong in the underlying url libraries in webarchives-common.

@ato
Copy link
Collaborator

ato commented Jul 30, 2021

We probably need to make all the extractors use the outlink helper methods in the Extractor base classes consistently as there's a number of them that call curi.getOutlinks().add(link) directly. Then we can change the helper methods to ignore data URIs.

Might be nice to also change the outlink helpers to take a CharSequence instead of a String that way when possible they can filter out large URIs before they get copied to the heap as Strings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants