Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Commas in srcset-URLs are not handled correctly #458

Open
grob opened this issue Jan 15, 2022 · 1 comment
Open

Commas in srcset-URLs are not handled correctly #458

grob opened this issue Jan 15, 2022 · 1 comment
Labels
archive.org archive.org services not (just) Heritrix

Comments

@grob
Copy link

grob commented Jan 15, 2022

Although #243 is merged, srcset-URLs with commas in them are still not parsed/rewritten correctly, see https://web.archive.org/web/*/https://orf.at/ for example.

The original URLs used in srcset attributes look like this: https://assets.orf.at/mims/2022/03/26/crops/w=875,q=90/1204287_opener_429226_coronavirus_schule_tests_vorschau_v1_a.jpg?s=bad56ac4b6df02892d3bd744c8e9494d4fd72b50.

a complete srcset example used in this site:

<source media="(max-width: 600px)" srcset="https://assets.orf.at/mims/2022/03/26/crops/w=800,h=450,q=70/1204282_master_429226_coronavirus_schule_tests_vorschau_v1_a.jpg?s=baff281a0ee94f81ed19d576f7eff4f0ed6e44c9 800w, https://assets.orf.at/mims/2022/03/26/crops/w=1280,h=720,q=60/1204282_master_429226_coronavirus_schule_tests_vorschau_v1_a.jpg?s=735e42760bcc348a2afed7dde20a17bf2857caaf 1280w">

results in (see here):

<source media="(max-width: 600px)" srcset="https://web.archive.org/web/20220114214021im_/https://assets.orf.at/mims/2022/03/26/crops/w=800, /web/20220114214021im_/https://orf.at/stories/3243632/h=450, /web/20220114214021im_/https://orf.at/stories/3243632/q=70/1204282_master_429226_coronavirus_schule_tests_vorschau_v1_a.jpg?s=baff281a0ee94f81ed19d576f7eff4f0ed6e44c9 800w, https://web.archive.org/web/20220114214021im_/https://assets.orf.at/mims/2022/03/26/crops/w=1280, /web/20220114214021im_/https://orf.at/stories/3243632/h=720, /web/20220114214021im_/https://orf.at/stories/3243632/q=60/1204282_master_429226_coronavirus_schule_tests_vorschau_v1_a.jpg?s=735e42760bcc348a2afed7dde20a17bf2857caaf 1280w">
@ato ato added the archive.org archive.org services not (just) Heritrix label Jan 17, 2022
@ato
Copy link
Collaborator

ato commented Jan 17, 2022

As this is about rewriting this is likely an issue with the (closed-source) Wayback replay software not with the Heritrix web crawler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
archive.org archive.org services not (just) Heritrix
Projects
None yet
Development

No branches or pull requests

2 participants