Missing images in Wikipedia articles #141

WolfgangDpunkt · 2022-10-13T08:34:55Z

Environment

Operating System: debian (aarch64)
node --version: v17.9.0
npm --version: 8.18.0
yarn --version, if using Yarn:
percollate --version: v2.2.0

Description

When I convert Wikipedia articles to epubs with this otherwise great and very useful tool, some of the images get lost. An adblocker is not used in this environment.

Here is my command line
percollate epub --individual --output /home/Perco-Epubs/ https://en.wikipedia.org/wiki/Canada --debug

And here is the resulting epub. I had to zip it, as Github does not accept epub files:
-Canada.epub.zip

And here's the direct comparison, in the "British North America" section the web version has two images, the epub version zero.

There are indeed images in the epub, percollate does not ignore all images, but most of them.
What could be the reason? Thanks a lot!

Here comes the debug log:

~# percollate epub --individual --output /home/_Perco-Epubs/ https://en.wikipedia.org/wiki/Canada --debug
{
  command: 'epub',
  operands: [ 'https://en.wikipedia.org/wiki/Canada' ],
  opts: {
    individual: true,
    output: '/home/_Perco-Epubs/',
    debug: true
  }
}
Fetching: https://en.wikipedia.org/wiki/Canada ✓
Enhancing web page: https://en.wikipedia.org/wiki/Canada ✓
Saving EPUB...
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Flag_of_Canada_%28Pantone%29.svg/125px-Flag_of_Canada_%28Pantone%29.svg.png
Fetching: https://upload.wikimedia.org/wikipedia/en/thumb/4/4f/Coat_of_arms_of_Canada.svg/85px-Coat_of_arms_of_Canada.svg.png
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/6/67/CAN_orthographic.svg/220px-CAN_orthographic.svg.png
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/11px-Increase2.svg.png
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/9/92/Decrease_Positive.svg/11px-Decrease_Positive.svg.png
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Nouvelle-France_map-en.svg/260px-Nouvelle-France_map-en.svg.png
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/3/31/Canada_WWI_l%27Emprunt_de_la_Victoire2.jpg/135px-Canada_WWI_l%27Emprunt_de_la_Victoire2.jpg
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/b/bf/Canada_WWI_Victory_Bonds2.jpg/136px-Canada_WWI_Victory_Bonds2.jpg
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/d/d7/Canada_topo.jpg/260px-Canada_topo.jpg
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/1/10/Canada_K%C3%B6ppen.svg/260px-Canada_K%C3%B6ppen.svg.png
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/b/bd/Toronto_from_above_at_night.jpg/240px-Toronto_from_above_at_night.jpg
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/4/43/FTAs_with_Canada.svg/260px-FTAs_with_Canada.svg.png
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/2/2d/STS-116_-_P5_Truss_hand-off_to_ISS_%28NASA_S116-E-05765%29.jpg/220px-STS-116_-_P5_Truss_hand-off_to_ISS_%28NASA_S116-E-05765%29.jpg
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/7/7d/Censusdivisions-ethnic.png/240px-Censusdivisions-ethnic.png
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/Statue_outside_Union_Station.jpg/170px-Statue_outside_Union_Station.jpg
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/e/e4/CBC_Radio_Canada_Chevrolet_Express_02.jpg/220px-CBC_Radio_Canada_Chevrolet_Express_02.jpg
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/8/86/O-Canada-1908.pdf/page1-170px-O-Canada-1908.pdf.jpg
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/2/26/Canada2010WinterOlympicsOTcelebration.jpg/220px-Canada2010WinterOlympicsOTcelebration.jpg
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/Sound-icon.svg/45px-Sound-icon.svg.png
1141364 total bytes, archive closed
Saved EPUB: /home/_Perco-Epubs/-Canada.epub

The text was updated successfully, but these errors were encountered:

danburzo · 2022-10-14T21:43:49Z

Thanks @WolfgangDpunkt for the report, the issue should be fixed in version 2.2.1

WolfgangDpunkt · 2022-10-17T07:35:18Z

Thank you very much! I have completed the update and progress is noticeable. Indeed, it now works with the example article "Canada" from the English Wikipedia.
But, I'm afraid, the problem is not yet completely solved.

If you can find the patience to work on this problem further, I would be happy. Since there are hardly any other reliable tools to convert wiki articles to epub books via command line, I think the bug has a high relevance.

In this article, for example, almost all the pictures are missing:
https://de.wikipedia.org/wiki/Wien

However, there does not seem to be a fundamental problem with international language versions of Wikipedia.
Because in the English article version "Vienna" there are a lot of pictures included in the epub, but not all of them:
https://en.wikipedia.org/wiki/Vienna#Culinary_specialities

The photo "Sachertorte" is missing in the epub, for example:

In fact, the debug log does not mention the filename of this photo either, for whatever reason this photo is ignored during the download (https://upload.wikimedia.org/wikipedia/commons/b/b8/Sachertorte_DSC03027.JPG)

danburzo · 2022-10-17T10:33:04Z

Thanks for pointing out the broken pages, it will help out with debugging. This is mostly Readability removing the images, I will investigate how to prevent that from
happening.

danburzo · 2022-11-29T12:38:07Z

Seems that the HTML markup for images in Wikipedia is going to change soon: https://diff.wikimedia.org/2022/11/28/tech-news-2022-48/ (via @simevidas), so that may make handling them a bit easier.

…e images to not be fetched for EPUB archive (Re: #141)

danburzo · 2023-03-07T22:38:54Z

It turns out that there was more than one issue at play preventing one image or the other from being properly fetched/bundled:

on non-English Wikipedia pages, URLs pointing to what look like images but are in fact HTML pages were not excluded, due the assumption they'd match wiki/File:. In fact, the File: part of the URL is localized, so you could have Fișier: or Datei:. Thanks @vongrad for investigating and submitting a patch!
additionally, regexes for matching image URLs were scattered in the codebase, and one of them was unintentionally case-sensitive, meaning it didn't match upercase filenames such as Sachertorte_DSC03027.JPG.

There may be additional issues with Readability as mentioned in earlier comments, but I'm confident upgrading to [email protected] will fix a lot of Wikipedia images.

WolfgangDpunkt · 2023-03-08T09:18:55Z

Dear @danburzo and @vongrad ,

I am very grateful for your work and your attention to my questions. This will help me a lot. I will test the new version as soon as possible. I appreciate your dedication very much.
Kudos for how patiently you troubleshot this issue.
Thank you very much!

danburzo closed this as completed in 5d23c61 Oct 14, 2022

danburzo reopened this Oct 17, 2022

danburzo added a commit that referenced this issue Mar 7, 2023

Unify regexes matching image URLs, fixes case sensitivity causing som…

2667b44

…e images to not be fetched for EPUB archive (Re: #141)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing images in Wikipedia articles #141

Missing images in Wikipedia articles #141

WolfgangDpunkt commented Oct 13, 2022

danburzo commented Oct 14, 2022 •

edited

Loading

WolfgangDpunkt commented Oct 17, 2022

danburzo commented Oct 17, 2022

danburzo commented Nov 29, 2022

danburzo commented Mar 7, 2023

WolfgangDpunkt commented Mar 8, 2023

Missing images in Wikipedia articles #141

Missing images in Wikipedia articles #141

Comments

WolfgangDpunkt commented Oct 13, 2022

Environment

Description

danburzo commented Oct 14, 2022 • edited Loading

WolfgangDpunkt commented Oct 17, 2022

danburzo commented Oct 17, 2022

danburzo commented Nov 29, 2022

danburzo commented Mar 7, 2023

WolfgangDpunkt commented Mar 8, 2023

danburzo commented Oct 14, 2022 •

edited

Loading