Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Youtube videos are not working anymore #323

Closed
benoit74 opened this issue Jun 20, 2024 · 19 comments · Fixed by #326
Closed

Youtube videos are not working anymore #323

benoit74 opened this issue Jun 20, 2024 · 19 comments · Fixed by #326
Assignees
Labels
Milestone

Comments

@benoit74
Copy link
Collaborator

Again, Youtube videos are not working anymore.

Problem is different than previous times.

Test has been done with current codebase (2.0.2) and a single page at https://100r.co/site/orca.html

  • with an old WARC (from May 28), Youtube videos are working
  • with a recent WARC (from today, June 20), Youtube videos are not working

Issue is not limited to 100r.co website, same problem has been observed on test website for instance.

Youtube video is correctly playing a bit better on replayweb.page even with recent WARC, but only for the first 3 or 4 secs of the video. And fuzzy rule seems to not be properly applied at all here, it loads URLs like https://replayweb.page/w/id-91fd65096030/20240620091901mp_/https://rr3---sn-ixh7yn7e.googlevideo.com/videoplayback?expire=1718896744&ei=CPRzZpLcFfWPv_IPjvSygAo&ip=135.181.181.97&id=o-AL0BCMRMuaikI70Sw1cEvsUP0llpXaRbt2RUG1K9pUl0&itag=243&aitags=133,134,135,160,242,243,244,278&source=youtube&requiressl=yes&xpc=EgVo2aDSNQ==&mh=kw&mm=31,26&mn=sn-ixh7yn7e,sn-ugo7dn76&ms=au,onr&mv=m&mvi=3&pl=17&initcwndbps=458750&bui=AbKP-1Mo9GjiRwRL0ikYYxkrsoWyL4xu850dxEbOX2h0DN0GFKnazM15XVrw7NHp2_4QcfFEvMbTM4Kp&spc=UWF9f5U2kJlHxL16IKZBjf75XUuEd1D1UAZ7taixAMll90pghSMp&vprv=1&svpuc=1&mime=video/webm&ns=0J7ZUEUnZavF_9CYyLqdyIoQ&rqh=1&gir=yes&clen=1005408&dur=75.242&lmt=1636152001175039&mt=1718874771&fvip=4&keepalive=yes&c=WEB_EMBEDDED_PLAYER&sefc=1&txp=5311222&n=hMSqp81HO-ISEA&sparams=expire,ei,ip,id,aitags,source,requiressl,xpc,bui,spc,vprv,svpuc,mime,ns,rqh,gir,clen,dur,lmt&sig=AJfQdSswRQIgRtqyjdlrY2MlyHCQKOG0GqLCG8_iwSmp9RmCzLI8dZ4CIQCXgGClxMhvR1jhejK7i9fG3E6dsjNSmb3WX88lva7fNQ==&lsparams=mh,mm,mn,ms,mv,mvi,pl,initcwndbps&lsig=AHlkHjAwRQIgPb5Nsm9l8o1yBGPu3Y9OIr-UIZvJTY09UNOyLDM8-70CIQCty0yVKGQsCtS7GBE5FeDwqkiWEh3nSbVkrGY3y4bftw==&alr=yes&cpn=M9Gnq1A5TBTMVkJn&cver=1.20240616.00.00&range=197056-262591&rn=15&rbuf=3032&pot=KoEQBFBNRDpVbmRlZmluZWQ6Ok9xQGh0dHBzOi8vcmVwbGF5d2ViLnBhZ2Uvdy9pZC05MWZkNjUwOTYwMzAvMjAyNDA2MjAwOTE5MDFtcF8vaHR0cHM6Ly93d3cueW91dHViZS5jb20vcy9wbGF5ZXIvODQzMTRiZWYvcGxheWVyX2lhcy52ZmxzZXQvZW5fVVMvYmFzZS5qczo5OTc6MTMxCldnYUBodHRwczovL3JlcGxheXdlYi5wYWdlL3cvaWQtOTFmZDY1MDk2MDMwLzIwMjQwNjIwMDkxOTAxbXBfL2h0dHBzOi8vd3d3LnlvdXR1YmUuY29tL3MvcGxheWVyLzg0MzE0YmVmL3BsYXllcl9pYXMudmZsc2V0L2VuX1VTL2Jhc2UuanM6MTAwOTozNQpZZ2EvPEBodHRwczovL3JlcGxheXdlYi5wYWdlL3cvaWQtOTFmZDY1MDk2MDMwLzIwMjQwNjIwMDkxOTAxbXBfL2h0dHBzOi8vd3d3LnlvdXR1YmUuY29tL3MvcGxheWVyLzg0MzE0YmVmL3BsYXllcl9pYXMudmZsc2V0L2VuX1VTL2Jhc2UuanM6MTAxMjozODkKSWFAaHR0cHM6Ly9yZXBsYXl3ZWIucGFnZS93L2lkLTkxZmQ2NTA5NjAzMC8yMDI0MDYyMDA5MTkwMW1wXy9odHRwczovL3d3dy55b3V0dWJlLmNvbS9zL3BsYXllci84NDMxNGJlZi9wbGF5ZXJfaWFzLnZmbHNldC9lbl9VUy9iYXNlLmpzOjE2OTo0MAppYWEvdGhpcy5uZXh0QGh0dHBzOi8vcmVwbGF5d2ViLnBhZ2Uvdy9pZC05MWZkNjUwOTYwMzAvMjAyNDA2MjAwOTE5MDFtcF8vaHR0cHM6Ly93d3cueW91dHViZS5jb20vcy9wbGF5ZXIvODQzMTRiZWYvcGxheWVyX2lhcy52ZmxzZXQvZW5fVVMvYmFzZS5qczoxNzA6OTIKYkBodHRwczovL3JlcGxheXdlYi5wYWdlL3cvaWQtOTFmZDY1MDk2MDMwLzIwMjQwNjIwMDkxOTAxbXBfL2h0dHBzOi8vd3d3LnlvdXR1YmUuY29tL3MvcGxheWVyLzg0MzE0YmVmL3BsYXllcl9pYXMudmZsc2V0L2VuX1VTL2Jhc2UuanM6MTc0OjQwCnByb21pc2UgY2FsbGJhY2sqZkBodHRwczovL3JlcGxheXdlYi5wYWdlL3cvaWQtOTFmZDY1MDk2MDMwLzIwMjQwNjIwMDkxOTAxbXBfL2h0dHBzOi8vd3d3LnlvdXR1YmUuY29tL3MvcGxheWVyLzg0MzE0YmVmL3BsYXllcl9pYXMudmZsc2V0L2VuX1VTL2Jhc2UuanM6MTc2OjkxCnByb21pc2UgY2FsbGJhY2sqZkBodHRwczovL3JlcGxheXdlYi5wYWdlL3cvaWQtOTFmZDY1MDk2MDMwLzIwMjQwNjIwMDkxOTAxbXBfL2h0dHBzOi8vd3d3LnlvdXR1YmUuY29tL3MvcGxheWVyLzg0MzE0YmVmL3BsYXllcl9pYXMudmZsc2V0L2VuX1VTL2Jhc2UuanM6MTc2OjEwMQpwcm9taXNlIGNhbGxiYWNrKmZAaHR0cHM6Ly9yZXBsYXl3ZWIucGFnZS93L2lkLTkxZmQ2NTA5NjAzMC8yMDI0MDYyMDA5MTkwMW1wXy9odHRwczovL3d3dy55b3V0dWJlLmNvbS9zL3BsYXllci84NDMxNGJlZi9wbGF5ZXJfaWFzLnZmbHNldC9lbl9VUy9iYXNlLmpzOjE3NjoxMDEKcHJvbWlzZSBjYWxsYmFjaypmQGh0dHBzOi8vcmVwbGF5d2ViLnBhZ2Uvdy9pZC05MWZkNjUwOTYwMzAvMjAyNDA2MjAwOTE5MDFtcF8vaHR0cHM6Ly93d3cueW91dHViZS5jb20vcy9wbGF5ZXIvODQzMTRiZWYvcGxheWVyX2lhcy52ZmxzZXQvZW5fVVMvYmFzZS5qczoxNzY6MTAxCnByb21pc2UgY2FsbGJhY2sqZkBodHRwczovL3JlcGxheXdlYi5wYWdlL3cvaWQtOTFmZDY1MDk2MDMwLzIwMjQwNjIwMDkxOTAxbXBfL2h0dHBzOi8vd3d3LnlvdXR1YmUuY29tL3MvcGxheWVyLzg0MzE0YmVmL3BsYXllcl9pYXMudmZsc2V0L2VuX1VTL2Jhc2UuanM6MTc2OjEwMQpwcm9taXNlIGNhbGxiYWNrKmZAaHR0cHM6Ly9yZXBsYXl3ZWIucGFnZS93L2lkLTkxZmQ2NTA5NjAzMC8yMDI0MDYyMDA5MTkwMW1wXy9odHRwczovL3d3dy55b3V0dWJlLmNvbS9zL3BsYXllci84NDMxNGJlZi9wbGF5ZXJfaWFzLnZmbHNldC9lbl9VUy9iYXNlLmpzOjE3NjoxMDEKcHJvbWlzZSBjYWxsYmFjaypmQGh0dHBzOi8vcmVwbGF5d2ViLnBhZ2Uvdy9pZC05MWZkNjUwOTYwMzAvMjAyNDA2MjAwOTE5MDFtcF8vaHR0cHM6Ly93d3cueW91dHViZS5jb20vcy9wbGF5ZXIvODQzMTRiZWYvcGxheWVyX2lhcy52ZmxzZXQvZW5fVVMvYmFzZS5qczoxNzY6MTAxCnByb21pc2UgY2FsbGJhY2sqZkBodHRwczovL3JlcGxheXdlYi5w&ump=1&srfvp=1

This kind of URLs are supposed to be rewritten with a fuzzyrule.

@benoit74 benoit74 added the bug label Jun 20, 2024
@benoit74 benoit74 self-assigned this Jun 20, 2024
@benoit74
Copy link
Collaborator Author

It looks like Youtube player has changed and is now issuing range requests.

Old URLs looked like this:

https://rr2---sn-ixh7rn76.googlevideo.com/videoplayback?expire=1716941385&ei=6R1WZr7zGNS0v_IPsaWx2A8&ip=135.181.181.97&id=o-AMQ5gJcc4hzVvltA_AxK-5jieAaNVp8x1Jp8W3bwSPt2&itag=18&source=youtube&requiressl=yes&xpc=EgVo2aDSNQ%3D%3D&mh=kw&mm=31%2C26&mn=sn-ixh7rn76%2Csn-1gi7znek&ms=au%2Conr&mv=u&mvi=2&pl=27&bui=AWRWj2TlLa5pzu-H0spsOJskhSGRCiGgN5pyVMgdO1U8wOsFvsNJlwk3ySH0ElNPr-aOB9_400a8IjvD&spc=UWF9f4NgKOXsm-k0DpPwr_fmdZmsq9_t4kuBex926Kb8QREhvA&vprv=1&svpuc=1&mime=video%2Fmp4&ns=67R9uHg6kRTcz5ozGSdZSWoQ&rqh=1&cnr=14&ratebypass=yes&dur=75.302&lmt=1636994742695961&mt=1716918653&fvip=3&c=WEB_EMBEDDED_PLAYER&sefc=1&n=5Lz9sO9f89D-tw&sparams=expire%2Cei%2Cip%2Cid%2Citag%2Csource%2Crequiressl%2Cxpc%2Cbui%2Cspc%2Cvprv%2Csvpuc%2Cmime%2Cns%2Crqh%2Ccnr%2Cratebypass%2Cdur%2Clmt&sig=AJfQdSswRAIgW5oyf2pC4rs2s_nheYbxLF0o6jNi9wZSDb7FuJxYv5oCID0kcQOpuoRKq2QUtI5XRP_sRBuYGgmwaJ9OLX3al68W&lsparams=mh%2Cmm%2Cmn%2Cms%2Cmv%2Cmvi%2Cpl&lsig=AHWaYeowRAIgOL8_fRasw_iPu4sexptlJSIQsuTQhAOvAI2GtCsyHwkCIAiRV_DcVrELIULGAl2BOeFZt9URoXOq3dKFksQBIZcv&cpn=UXKtM1Qrg0jfBqvv&cver=1.20240521.01.00&ptk=youtube_none&pltype=contentugc

New URLs looks like this:

https://rr3---sn-25ge7nzz.googlevideo.com/videoplayback?expire=1718898847&ei=P_xzZqadKJetvdIP6qi2uA8&ip=80.13.74.183&id=o-AGyWYKHSQoZfJNnvIfpTU45pJ79H1moYLOON7xwTNmPa&itag=251&source=youtube&requiressl=yes&xpc=EgVo2aDSNQ%3D%3D&mh=kw&mm=31%2C26&mn=sn-25ge7nzz%2Csn-4g5edndd&ms=au%2Conr&mv=m&mvi=3&pl=21&initcwndbps=1740000&bui=AbKP-1Pv3j5FUCWLXcxtOobPNahk2O1F8gh6aVcFM3iRCxk3fPLkcN4zFdAKZ7WbxXrV1d2zA7bNQqYI&spc=UWF9fygdXdYF7fToehBP0rMMxtn5Uzx6lgmyKJNRVUNqwLKJ6Q&vprv=1&svpuc=1&mime=audio%2Fwebm&ns=y6uUbIjsYhYVKeSXoOsIv_oQ&rqh=1&gir=yes&clen=1156734&dur=75.281&lmt=1636151730851517&mt=1718876935&fvip=2&keepalive=yes&c=WEB_EMBEDDED_PLAYER&sefc=1&txp=5311222&n=ibiK4YPELiUEVA&sparams=expire%2Cei%2Cip%2Cid%2Citag%2Csource%2Crequiressl%2Cxpc%2Cbui%2Cspc%2Cvprv%2Csvpuc%2Cmime%2Cns%2Crqh%2Cgir%2Cclen%2Cdur%2Clmt&sig=AJfQdSswRQIhAKd6_lqslaogMTFOVKLzWZR-7KQqk8HOL6TpftGSrf0FAiBYHljIG1Px8tzJ-Nmv8Q6A6AY0H66fmlVHhNsvgWDaGg%3D%3D&lsparams=mh%2Cmm%2Cmn%2Cms%2Cmv%2Cmvi%2Cpl%2Cinitcwndbps&lsig=AHlkHjAwRAIgcFloFYIj4lEUfeUfAaqoZnaKqYlL1qfyIC1hOiQTVgUCIEpUwofap4VBVOlRwz6mwRXPtdyHrZgtzqH4zU3u5U54&alr=yes&cpn=yVb_ifi1u09V23DH&cver=1.20240616.00.00&range=141775-305530&rn=6&rbuf=9232&pot=MowBNubCFPefIBM23uEgMhXc8-OaRQGErA-6pHMr2UocmwZrimSLoxvqVFgE0Au7QHXEuZQ9t7vYhZJ1GAls13KKM8OfJNwP6stlfnOg2BAB8AzE1CqsNqY38ZfB_Ksq0N7ZZ0ZqdUS4Wp7kdLEgaeLfGSJwzIqEKtCMRdLAjquT1IhA1V6qfBYHuh8pLU4=&ump=1&srfvp=1
https://rr3---sn-25ge7nzz.googlevideo.com/videoplayback?expire=1718898847&ei=P_xzZqadKJetvdIP6qi2uA8&ip=80.13.74.183&id=o-AGyWYKHSQoZfJNnvIfpTU45pJ79H1moYLOON7xwTNmPa&itag=244&aitags=133%2C134%2C135%2C160%2C242%2C243%2C244%2C278&source=youtube&requiressl=yes&xpc=EgVo2aDSNQ%3D%3D&mh=kw&mm=31%2C26&mn=sn-25ge7nzz%2Csn-4g5edndd&ms=au%2Conr&mv=m&mvi=3&pl=21&initcwndbps=1740000&bui=AbKP-1Pv3j5FUCWLXcxtOobPNahk2O1F8gh6aVcFM3iRCxk3fPLkcN4zFdAKZ7WbxXrV1d2zA7bNQqYI&spc=UWF9fygdXdYF7fToehBP0rMMxtn5Uzx6lgmyKJNRVUNqwLKJ6Q&vprv=1&svpuc=1&mime=video%2Fwebm&ns=y6uUbIjsYhYVKeSXoOsIv_oQ&rqh=1&gir=yes&clen=1520075&dur=75.242&lmt=1636152001190484&mt=1718876935&fvip=2&keepalive=yes&c=WEB_EMBEDDED_PLAYER&sefc=1&txp=5311222&n=ibiK4YPELiUEVA&sparams=expire%2Cei%2Cip%2Cid%2Caitags%2Csource%2Crequiressl%2Cxpc%2Cbui%2Cspc%2Cvprv%2Csvpuc%2Cmime%2Cns%2Crqh%2Cgir%2Cclen%2Cdur%2Clmt&sig=AJfQdSswRgIhAMwB6Pq8xwQxi-QjY4Si_kvmx--hpaZQyPB6CuGn-NIpAiEAqc9ArL5rX4aVVU1ygGlq73Uy9-oSosCd0lwPg3oq7QY%3D&lsparams=mh%2Cmm%2Cmn%2Cms%2Cmv%2Cmvi%2Cpl%2Cinitcwndbps&lsig=AHlkHjAwRAIgcFloFYIj4lEUfeUfAaqoZnaKqYlL1qfyIC1hOiQTVgUCIEpUwofap4VBVOlRwz6mwRXPtdyHrZgtzqH4zU3u5U54&alr=yes&cpn=yVb_ifi1u09V23DH&cver=1.20240616.00.00&range=262765-510353&rn=7&rbuf=10541&pot=MowBNubCFPefIBM23uEgMhXc8-OaRQGErA-6pHMr2UocmwZrimSLoxvqVFgE0Au7QHXEuZQ9t7vYhZJ1GAls13KKM8OfJNwP6stlfnOg2BAB8AzE1CqsNqY38ZfB_Ksq0N7ZZ0ZqdUS4Wp7kdLEgaeLfGSJwzIqEKtCMRdLAjquT1IhA1V6qfBYHuh8pLU4=&ump=1&srfvp=1
https://rr3---sn-25ge7nzz.googlevideo.com/videoplayback?expire=1718898847&ei=P_xzZqadKJetvdIP6qi2uA8&ip=80.13.74.183&id=o-AGyWYKHSQoZfJNnvIfpTU45pJ79H1moYLOON7xwTNmPa&itag=251&source=youtube&requiressl=yes&xpc=EgVo2aDSNQ%3D%3D&mh=kw&mm=31%2C26&mn=sn-25ge7nzz%2Csn-4g5edndd&ms=au%2Conr&mv=m&mvi=3&pl=21&initcwndbps=1740000&bui=AbKP-1Pv3j5FUCWLXcxtOobPNahk2O1F8gh6aVcFM3iRCxk3fPLkcN4zFdAKZ7WbxXrV1d2zA7bNQqYI&spc=UWF9fygdXdYF7fToehBP0rMMxtn5Uzx6lgmyKJNRVUNqwLKJ6Q&vprv=1&svpuc=1&mime=audio%2Fwebm&ns=y6uUbIjsYhYVKeSXoOsIv_oQ&rqh=1&gir=yes&clen=1156734&dur=75.281&lmt=1636151730851517&mt=1718876935&fvip=2&keepalive=yes&c=WEB_EMBEDDED_PLAYER&sefc=1&txp=5311222&n=ibiK4YPELiUEVA&sparams=expire%2Cei%2Cip%2Cid%2Citag%2Csource%2Crequiressl%2Cxpc%2Cbui%2Cspc%2Cvprv%2Csvpuc%2Cmime%2Cns%2Crqh%2Cgir%2Cclen%2Cdur%2Clmt&sig=AJfQdSswRQIhAKd6_lqslaogMTFOVKLzWZR-7KQqk8HOL6TpftGSrf0FAiBYHljIG1Px8tzJ-Nmv8Q6A6AY0H66fmlVHhNsvgWDaGg%3D%3D&lsparams=mh%2Cmm%2Cmn%2Cms%2Cmv%2Cmvi%2Cpl%2Cinitcwndbps&lsig=AHlkHjAwRAIgcFloFYIj4lEUfeUfAaqoZnaKqYlL1qfyIC1hOiQTVgUCIEpUwofap4VBVOlRwz6mwRXPtdyHrZgtzqH4zU3u5U54&alr=yes&cpn=yVb_ifi1u09V23DH&cver=1.20240616.00.00&range=305531-500945&rn=8&rbuf=13760&pot=MowBNubCFPefIBM23uEgMhXc8-OaRQGErA-6pHMr2UocmwZrimSLoxvqVFgE0Au7QHXEuZQ9t7vYhZJ1GAls13KKM8OfJNwP6stlfnOg2BAB8AzE1CqsNqY38ZfB_Ksq0N7ZZ0ZqdUS4Wp7kdLEgaeLfGSJwzIqEKtCMRdLAjquT1IhA1V6qfBYHuh8pLU4=&ump=1&srfvp=1
https://rr3---sn-25ge7nzz.googlevideo.com/videoplayback?expire=1718898847&ei=P_xzZqadKJetvdIP6qi2uA8&ip=80.13.74.183&id=o-AGyWYKHSQoZfJNnvIfpTU45pJ79H1moYLOON7xwTNmPa&itag=244&aitags=133%2C134%2C135%2C160%2C242%2C243%2C244%2C278&source=youtube&requiressl=yes&xpc=EgVo2aDSNQ%3D%3D&mh=kw&mm=31%2C26&mn=sn-25ge7nzz%2Csn-4g5edndd&ms=au%2Conr&mv=m&mvi=3&pl=21&initcwndbps=1740000&bui=AbKP-1Pv3j5FUCWLXcxtOobPNahk2O1F8gh6aVcFM3iRCxk3fPLkcN4zFdAKZ7WbxXrV1d2zA7bNQqYI&spc=UWF9fygdXdYF7fToehBP0rMMxtn5Uzx6lgmyKJNRVUNqwLKJ6Q&vprv=1&svpuc=1&mime=video%2Fwebm&ns=y6uUbIjsYhYVKeSXoOsIv_oQ&rqh=1&gir=yes&clen=1520075&dur=75.242&lmt=1636152001190484&mt=1718876935&fvip=2&keepalive=yes&c=WEB_EMBEDDED_PLAYER&sefc=1&txp=5311222&n=ibiK4YPELiUEVA&sparams=expire%2Cei%2Cip%2Cid%2Caitags%2Csource%2Crequiressl%2Cxpc%2Cbui%2Cspc%2Cvprv%2Csvpuc%2Cmime%2Cns%2Crqh%2Cgir%2Cclen%2Cdur%2Clmt&sig=AJfQdSswRgIhAMwB6Pq8xwQxi-QjY4Si_kvmx--hpaZQyPB6CuGn-NIpAiEAqc9ArL5rX4aVVU1ygGlq73Uy9-oSosCd0lwPg3oq7QY%3D&lsparams=mh%2Cmm%2Cmn%2Cms%2Cmv%2Cmvi%2Cpl%2Cinitcwndbps&lsig=AHlkHjAwRAIgcFloFYIj4lEUfeUfAaqoZnaKqYlL1qfyIC1hOiQTVgUCIEpUwofap4VBVOlRwz6mwRXPtdyHrZgtzqH4zU3u5U54&alr=yes&cpn=yVb_ifi1u09V23DH&cver=1.20240616.00.00&range=510354-774346&rn=9&rbuf=15163&pot=MowBNubCFPefIBM23uEgMhXc8-OaRQGErA-6pHMr2UocmwZrimSLoxvqVFgE0Au7QHXEuZQ9t7vYhZJ1GAls13KKM8OfJNwP6stlfnOg2BAB8AzE1CqsNqY38ZfB_Ksq0N7ZZ0ZqdUS4Wp7kdLEgaeLfGSJwzIqEKtCMRdLAjquT1IhA1V6qfBYHuh8pLU4=&ump=1&srfvp=1

Among other changes, there is a new range query parameter which seems quite obviously needed to be transformed as well in the fuzzy rule.

Transferring this issue to warc2zim repo.

@benoit74 benoit74 transferred this issue from openzim/zimit Jun 20, 2024
@benoit74
Copy link
Collaborator Author

Unfortunately, the new range query parameter seems to be highly dynamic, computed client-side depending on ... "something".

It also looks that we should take into account two different kinds of dataset, one with an aitags query parameter, and another one without

What we have in the ZIM:

id=o-AL0BCMRMuaikI70Sw1cEvsUP0llpXaRbt2RUG1K9pUl0 range=0-65934
aitags id=o-AL0BCMRMuaikI70Sw1cEvsUP0llpXaRbt2RUG1K9pUl0 range=0-65983
aitags id=o-AL0BCMRMuaikI70Sw1cEvsUP0llpXaRbt2RUG1K9pUl0 range=65984-131519
id=o-AL0BCMRMuaikI70Sw1cEvsUP0llpXaRbt2RUG1K9pUl0 range=65935-141906
aitags id=o-AL0BCMRMuaikI70Sw1cEvsUP0llpXaRbt2RUG1K9pUl0 range=131520-262764
id=o-AL0BCMRMuaikI70Sw1cEvsUP0llpXaRbt2RUG1K9pUl0 range=141907-305751
aitags id=o-AL0BCMRMuaikI70Sw1cEvsUP0llpXaRbt2RUG1K9pUl0 range=262765-506566
id=o-AL0BCMRMuaikI70Sw1cEvsUP0llpXaRbt2RUG1K9pUl0 range=305752-500945
aitags id=o-AL0BCMRMuaikI70Sw1cEvsUP0llpXaRbt2RUG1K9pUl0 range=506567-774346
id=o-AL0BCMRMuaikI70Sw1cEvsUP0llpXaRbt2RUG1K9pUl0 range=500946-846308
aitags id=o-AL0BCMRMuaikI70Sw1cEvsUP0llpXaRbt2RUG1K9pUl0 range=774347-1159628
aitags id=o-AL0BCMRMuaikI70Sw1cEvsUP0llpXaRbt2RUG1K9pUl0 range=1159629-1520074
id=o-AL0BCMRMuaikI70Sw1cEvsUP0llpXaRbt2RUG1K9pUl0 range=846309-1156733

First requests issued are ok:

aitags id=o-AL0BCMRMuaikI70Sw1cEvsUP0llpXaRbt2RUG1K9pUl0 range=0-65983
id=o-AL0BCMRMuaikI70Sw1cEvsUP0llpXaRbt2RUG1K9pUl0 range=0-65934 
aitags id=o-AL0BCMRMuaikI70Sw1cEvsUP0llpXaRbt2RUG1K9pUl0 range=65984-131519
id=o-AL0BCMRMuaikI70Sw1cEvsUP0llpXaRbt2RUG1K9pUl0 range=65935-141906

But then for some reason, the player seems to try to fetch less bytes than on real website, and then realizing these calls doesn't work it tries to fetch even fewer bytes (which do not work either):

aitags id=o-AL0BCMRMuaikI70Sw1cEvsUP0llpXaRbt2RUG1K9pUl0 range=131520-217507
aitags id=o-AL0BCMRMuaikI70Sw1cEvsUP0llpXaRbt2RUG1K9pUl0 range=131520-132638
aitags id=o-AL0BCMRMuaikI70Sw1cEvsUP0llpXaRbt2RUG1K9pUl0 range=0-445
aitags id=o-AL0BCMRMuaikI70Sw1cEvsUP0llpXaRbt2RUG1K9pUl0 range=131520-132638
aitags id=o-AL0BCMRMuaikI70Sw1cEvsUP0llpXaRbt2RUG1K9pUl0 range=0-445

And it stays stuck trying to fetch these bytes, cycling forever.

It might be noteworthy that just before these "changed" ranges, there are two requests which are failing with a 404 while in real situation they return a 204. This is expected because warc2zim does not support 204. I don't think there is any relation because these calls do not happen at exactly the same moment on real website, so probably not related.

www.youtube.com/api/stats/playback%3Fns%3Dyt%26el%3Dembedded%26cpn%3DyqsCyjsNQbmf6n6y%26ver%3D2%26cmt%3D0.042%26fmt%3D243%26fs%3D0%26rt%3D0.698%26euri%3Dhttps%3A//100r.co/%26lact%3D704%26cl%3D643853961%26mos%3D0%26volume%3D100%26cbrand%3Dgoogle%26cbr%3DChrome%20Mobile%26cbrver%3D75.0.3765.0%26c%3DWEB_EMBEDDED_PLAYER%26cver%3D1.20240616.00.00%26cplayer%3DUNIPLAYER%26cmodel%3Dpixel%202%26cos%3DAndroid%26cosver%3D8.0%26cplatform%3DMOBILE%26epm%3D1%26hl%3Den_US%26cr%3DDE%26len%3D76%26fexp%3Dv1%2C24004644%2C434717%2C121055%2C6271%2C26443548%2C7111%2C36343%2C9954%2C1192%2C26496%2C6966%2C2%2C6689%2C2007%2C9072%2C29152%2C2196%2C9996%2C1103%2C4080%2C2872%2C101%2C7717%2C1190%2C2037%2C502%2C1969%2C5084%2C2462%2C713%2C1816%2C769%2C6933%2C1201%2C236%2C997%2C1376%2C3540%2C220%2C254%2C1548%2C113%2C1242%2C381%2C141%2C5724%2C210%2C83%2C19%26rtn%3D11%26afmt%3D251%26size%3D560%3A315%26muted%3D0%26docid%3DhQXa6TkSeH0%26ei%3DCPRzZpLcFfWPv_IPjvSygAo%26plid%3DAAYbTs5lBSGM3gZQ%26referrer%3Dhttps%3A//youtube.fuzzy.replayweb.page/embed/hQXa6TkSeH0%26of%3D-_xhI4eL4MjOL53E0nwGhA%26vm%3DCAIQABgEOjJBSHFpSlRJTHZoRUEtdjY1V09NOXZBTzNUbGREZTFrZmlnM0J4b0tJNFBNSTY1OEphQWJfQVBta0tETHFmLWlnRHNrVTR4bDBOUlZibk1vemZIREZaeHhGSUZ0dnRLWHhLdjIteEdQM2V1MUtTMlV5cVhlSFRvZk5fYVVqbFEtMmxhc2pWZDJZdGxuU3phTEdZdlk
www.youtube.com/ptracking%3Fhtml5%3D1%26video_id%3DhQXa6TkSeH0%26cpn%3DyqsCyjsNQbmf6n6y%26ei%3DCPRzZpLcFfWPv_IPjvSygAo%26ptk%3Dyoutube_none%26pltype%3Dcontentugc

Other calls are also occurring on after the failures, but they are probably less interesting to analyze.

@benoit74
Copy link
Collaborator Author

Returning 200 empty response on calls to www.youtube.com/api/stats/* and www.youtube.com/ptracking* does not help at all (originally it is supposed to be 204, maybe that's the problem).

@rgaudin
Copy link
Member

rgaudin commented Jun 20, 2024

Can we "force" YT to behave the previous way with a user agent maybe ?

@rgaudin
Copy link
Member

rgaudin commented Jun 20, 2024

would be a workaround not a fix obviously

@benoit74
Copy link
Collaborator Author

I spent some time trying to figure out if there is a new switch in player configuration which might be used to disable the new behavior and go back to previous one, but I failed so far.

Can we "force" YT to behave the previous way with a user agent maybe ?

This seems to be a promise track, on iOS 13 we still have the "old" mode while on iOS 17 we have the "new" mode. Trying to figure out if I achieve to find the "magic switch"

@benoit74
Copy link
Collaborator Author

benoit74 commented Jun 20, 2024

Player code is composed mostly of 1 HTML and 4 JS 3 JS files (edit: and 1 JSON file).

3 JS files are purely identical between iOS v13 and v17:

https://www.youtube.com/youtubei/v1/player?prettyPrint=false : (edit: JSON with) lot's of different things, but nothing looking like a switch, mostly only URLs which are different due to tracking

https://www.youtube.com/embed/hQXa6TkSeH0 : this is the HTML, and seemed to be the most promising file, with lots of differences in the ytcfg.set(...) call, where a big JSON config is passed. Replacing the file inside the WARC with the v13 content (and doing the same rewriting) did not worked, we probably have signatures or stuff like this in the JSON config). I modified one-by-one the following keys with specific DS rules: dropped ab_det_apb_b, dropped ab_fk_sk_cl, dropped enable_temp_fix_for_url_redirection, replace osName, replace os, replace osVersion, replace userAgent, replace serializedExperimentIds, replace serializedExperimentFlags, replace embedded_player_response.

All these modifications were done successfully ... but the player behavior is still the same, it adds range query parameter in the calls to googlevideo.com/videoplayback.

So the more likely is that detection is done client-side on how to call the googlevideo.com/videoplayback endpoint. So we are a bit doomed, even tweaking the user agent while browsing will probably not help. I will give it a try, should it work "by chance".

@benoit74
Copy link
Collaborator Author

I struggle to produce again the old behavior with the crawler.

For the record:

  • I achieve to get the old behavior consistently in Browserstack with an iPad 7th on Chrome (which happens to be Chrome v92.0)
  • when using the crawler, every iPad profiles lead to a broken WARC missing images and/or the video
  • when using the crawler, an old Android mobile device like Nexus 7 with latest Brave is still using the new behavior
  • I also tried using more ancient versions of the crawler to get old chromium versions: v6.0.0 (Chromium 91) and v0.9.1 (Chromium 101) with a Pixel 2 or Nexus 7 still exhibit the new behavior ; v6.0.0 with iPad 7th is still producing a broken ZIM

With all this, I'm now mostly stuck: I do not achieve to exhibit the old behavior inside the crawler, and I fail to find which switch / modification is needed in Youtube player configuration / codebase to exhibit the old behavior. Of course, I may have missed something (it would be a good news indeed).

@Jaifroid
Copy link

@benoit74 If the whole mp4 has actually been scraped and is in the ZIM, then I think there is hope.

A bit of brainstorming, in order of viability:

  1. Check whether the issue is in the WARC. Can it be replayed with Replay software? If not working there, then it's an upstream issue.
  2. RegEx needs modifying:? It could simply be a slight change in format that is causing the fuzzy matching regex to fail (if it is failing). This could either be upstream, or it could be in our own version of these regexes.
  3. Need to generalize Range Requests? AFAIK, the fuzzy matching should find the full archive no matter what specific range was requested. However, in Zimit1, this was the job of the Service Worker and was done client-side. In Zimit2, this transformation is done in advance for video requests received during warc2zim conversion. How do we match up these requests? When it was working, were we relying on the browser making an initial full request, so that subsequent range requests wouldn't be necessary?
  4. Is it possible to refuse Range Requests? If so, it might be enough to add Accept-Ranges: none to the HTTP Response Header. The difficulty is that this would possibly need to be done at scrape time, i.e. in the Replay recorder software, because we've eliminated Headers from Zimit2. Client-side, it might be possible for Kiwix Serve to refuse range requests when the user is replaying the archive, but that's not a solution that is easily replicable in other apps that use a Webview.
  5. In extremis,, something like the hack tried here could work, though that would be a last resort. It involves replacing the player with a simple HTML video tag when converting warc to zim. It's the nuclear option. There may be more subtle versions of that, using the YouTube player itself rather than replacing it.

As Range Requests are also involved in solving kiwix/kiwix-js#1256, I'll do some investigating as well. I think the test ZIM has disappeared, though. Do we have one to work on that is displaying the 4-second bug?

@benoit74
Copy link
Collaborator Author

Thank you!

If the whole mp4 has actually been scraped and is in the ZIM, then I think there is hope.

There is always hope in IT ^^ Sometimes it demand big efforts. The whole video is indeed present inside the WARC. It is not yet correctly transferred to the ZIM (but this is mostly a detail).

Check whether the issue is in the WARC. Can it be replayed with Replay software? If not working there, then it's an upstream issue.

WARC is valid (it represent with high fidelity what happens in the browser) BUT not replayable with replayweb.page because wabac suffers from the same problem as warc2zim

RegEx needs modifying:? It could simply be a slight change in format that is causing the fuzzy matching regex to fail (if it is failing). This could either be upstream, or it could be in our own version of these regexes.

We do not use upstream RegEx / fuzzy rules. I've modified the fuzzy rules (by adding the range + aitags distinction) so that everything is inside the ZIM. Unfortunately this is not sufficient because the player dynamically adjust the range requested.

Need to generalize Range Requests?

There is no more real Range Requests like few days ago (with Range header, 206 return code, ...). There is just "normal" GET request with 200 return code. And the range is specified as a query parameter, the browser and all other systems have no idea it is in fact a range request.

Is it possible to refuse Range Requests?

Again, no more range requests. I tried to find a magic switch in player configuration to restore these Range requests which were working few days ago, no luck so far.

In extremis,, something like the hack tried openzim/warc2zim#173 could work

We always refused to go this way, I don't think this is going to change, but good to mention as a nuclear solution.

As Range Requests are also involved in solving kiwix/kiwix-js#1256, I'll do some investigating as well. I think the test ZIM has disappeared, though. Do we have one to work on that is displaying the 4-second bug?

I will update the issue with links in few minutes.

@Jaifroid
Copy link

WARC is valid (it represents with high fidelity what happens in the browser) BUT not replayable with replayweb.page because wabac suffers from the same problem as warc2zim

So this issue is highly relevant for @ikreymer too, as a fix for YouTube video replay in Replayweb.page might well be applicable or adaptable here?

@benoit74
Copy link
Collaborator Author

I've uploaded ZIM and WARC to https://tmp.kiwix.org/ci/test-warc/100r.co/

Both ZIMs have been built right now we the same 2.0.2 warc2zim.

  • 20240528 version is a WARC with the "old" Youtube player
  • 20240620 version is a WARC with the "new" Youtube player

So this issue is highly relevant for @ikreymer too, as a fix for YouTube video replay in Replayweb.page might well be applicable or adaptable here?

Yep, I just waited a bit to have enough information before opening an issue in their repo, to confirm I wasn't reporting false information and to avoid duplicating effort if the fix was "achievable" in few hours by me (they do sufficient efforts for us, I don't mind to help them a bit when possible). It's now time to open the ticket.

@benoit74
Copy link
Collaborator Author

Nota: test WARC and ZIM have only the https://100r.co/site/orca.html page to be a lightweight as possible, so do not try to navigate outside of the page ^^

@Jaifroid
Copy link

I've downloaded those files, and I corroborate (via testing in the Kiwix PWA and with replayweb.page) that the "new" versions of both the ZIM and the WARC are unable to play the Orca video, while the "old" versions of both can play the Orca video.

@benoit74
Copy link
Collaborator Author

Problem is fixed in Browsertrix Crawler 1.2.0, see webrecorder/wabac.js#182

Transferring this back to zimit, nothing needs to be done in warc2zim

@benoit74 benoit74 transferred this issue from openzim/warc2zim Jun 24, 2024
@benoit74 benoit74 added this to the 2.0.3 milestone Jun 24, 2024
@Jaifroid
Copy link

That's good news. I'm glad a simple upgrade resolves it.

@benoit74
Copy link
Collaborator Author

Yep, me too.

But it is important to note that's only because webrecorder team was capable to quickly consider the issue and find and implement a workaround (and it is only a workaround, not sure it will work on the medium term, we still have a pretty big concern on this new player, they just forced it to fallback to previous "mode")

@Jaifroid
Copy link

Ah, thanks for the explanation. Well at least it buys some time to find a longer-term solution...

@ikreymer
Copy link
Collaborator

ikreymer commented Jun 24, 2024

Yep, me too.

But it is important to note that's only because webrecorder team was capable to quickly consider the issue and find and implement a workaround (and it is only a workaround, not sure it will work on the medium term, we still have a pretty big concern on this new player, they just forced it to fallback to previous "mode")

Well, it was always a workaround, it was doing the same thing, just in a different way - it was explicitly disabling DASH via rewriting in JS, now its done by disabling MediaSource.isTypeSupported - but the player did change, and now is using some other streaming approach perhaps (not sure which)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants