Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect more ZIMs for active content warning #889

Merged
merged 4 commits into from
Sep 25, 2022

Conversation

Jaifroid
Copy link
Member

New nautilus-based ZIMs don't trigger the active content warning in jQuery mode. This adds detection. There is also an incorrect instruction in the warning to "type a space". This doesn't do anything in Kiwix JS (in Kiwix JS Windows, typing a space opens a special Archive Index UI, and I think this instruction is a left-over from that).

We could make it so that typing a space displays the title index starting from the first entry within the current UI, but that would be a different PR.

@Jaifroid Jaifroid added this to the v3.6 milestone Sep 23, 2022
@Jaifroid Jaifroid self-assigned this Sep 23, 2022
@Jaifroid Jaifroid requested a review from mossroy September 23, 2022 07:03
Copy link
Contributor

@mossroy mossroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@Jaifroid
Copy link
Member Author

I'm wondering if we need to check some of the other ZIM types: we were to some extent relying on the fact that some of these ZIMs had inline script to detect active content, and now some of those have been addressed (not all yet). I should perhaps check the ZIM types that have been fixed, in case we need to add to this PR.

@mossroy
Copy link
Contributor

mossroy commented Sep 24, 2022

You're right.
Hopefully there would be an easy way to detect that

@Jaifroid Jaifroid changed the title Detect nautilus ZIMs for active content warning Detect more ZIMs for active content warning Sep 24, 2022
@Jaifroid
Copy link
Member Author

Jaifroid commented Sep 24, 2022

OK, here is a list of ZIMs that should be tested (mirroring the list in #865):

  • Gutenberg (still uses inline JS, but would be good to future-proof) - done in this PR;
  • Nautilus - fixed in this PR;
  • PhET (still uses inline JS - and last ZIM published in 2021);
  • YouTube - no need to fix as of August 2022 (tested with mali-pour-les-nuls_fr_all_2022-08.zim);
  • OpenEdx (still uses inline JS) - exempted from warning, because runs better in jQuery mode than in SW mode;
  • Sotoki (doesn't show content warning, and it is not needed, as Stackexchange is mostly supported in jQuery mode);
  • Wikihow - no longer shows active content warning, but seems to run well enough in jQuery mode, so perhaps doesn't need one - tested with wikihow_fa_maxi_2022-09.zim;
  • Ted Talks - no need to fix, uses a script called app.js which we already detect (tested with ted_en_playlist-what-are-we-really-teaching-ai_2022-08.zim);
  • Kolibri (still uses inline JS) - uses app.js.

@Jaifroid
Copy link
Member Author

Jaifroid commented Sep 24, 2022

To support detecting Gutenberg in the future, once inline JS is removed, we could detect this script: languages.js, or more generically tools.jsor l109.js. I guess the more generic the better, so that one detection might work for more than one ZIM type? In that case l109.js might be best? Though I have seen l18n used too (not in this ZIM).

@Jaifroid
Copy link
Member Author

There are just two left to test: OpenEdx and Kolibri. Unfortunately, I have no idea which ZIMs actually represent these ZIM types! Any ideas @mossroy, @kelson42? It's not crucial to test them, as these are scrapers that haven't been updated yet, so they will still show the active content warning. However, we could future proof them, if they are sitll active projects, by including them in this PR.

@mossroy
Copy link
Contributor

mossroy commented Sep 25, 2022

OpenEdx generates some moocs, see #789 (comment) and openzim/openedx#168
Kolibri generates some Khan Academy videos, see openzim/kolibri#39

@mossroy
Copy link
Contributor

mossroy commented Sep 25, 2022

It's true that we should not rely on inline javascript detection any more (as we're making scrapers get rid of it).
This can not be checked in a very reliable way, but what you do looks like a good compromise to me : detect some javascript libraries that would "imply" some dynamic content (i.e the content would not be readable without executing them)

@Jaifroid
Copy link
Member Author

OK thanks -- I should have followed up the PRs.

Khan Academy (specifically the one you linked) contains app.js, so we're covered already.

The OpenEdx one is a curious case: it actually runs better in jQuery mode than in Service Worker mode. All assets are browsable in jQuery mode, but in SW mode, the assets "break out" of the iframe, and so destroy the Kiwix JS reader. As it looks like this scraper won't get fixed (developer(s) have abandoned it??), I'll exempt this ZIM (via a unique JS it contains) so that we don't put up a misleading message to jQuery users. We might want to investigate if we can prevent the breakout from happening in SW mode, but only if it can be done in a generic way, which I doubt.

@kelson42
Copy link
Collaborator

Both scrapers are not abandoned, but openedx is pretty recent an kolibri still a work in progress.

@Jaifroid
Copy link
Member Author

Jaifroid commented Sep 25, 2022

Ah, sorry, it was rgaudin saying "chances of a fix are low" for one of them that made me think this! Very glad to hear they are in progress.

@Jaifroid
Copy link
Member Author

OK, I've finished testing all ZIMs. There is a bit of new code you should look at, @mossroy , because in order to exempt ZIMs containing the mooc.js script, I need to do a separate search of the HTML. This is because you can't prove a negative in a positive regular expression search. The active content regex stops searching as soon as it has matched, so it might never see that the html further down loads mooc.js. Hence an extra search is needed for that. The search is only run if active content was found, so it won't impact on load times (and regexes are very fast anyway).

Copy link
Contributor

@mossroy mossroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not test, but it looks good to me.

@Jaifroid Jaifroid merged commit 77da9be into master Sep 25, 2022
@Jaifroid Jaifroid deleted the Update-active-content-warning-detection branch September 25, 2022 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants