Skip to content

Commit

Permalink
feat: include original ZIM
Browse files Browse the repository at this point in the history
This puts original ZIM file in the root directory:

- makes unpacked version easier to audit
- mitigates the problem of links pointing at the source ZIM archive
  disapearing after 3 months (kiwix is unable to pay for infinite
  hosting)
- enables us to start experimenting with ZIMs read from IPFS without
  unpacking them, and comparing with the unpacked version

License: MIT
Signed-off-by: Marcin Rataj <[email protected]>
  • Loading branch information
lidel committed Feb 12, 2021
1 parent e4f5e18 commit 93dc652
Show file tree
Hide file tree
Showing 10 changed files with 170 additions and 43 deletions.
7 changes: 2 additions & 5 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,7 @@ ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update
RUN apt-get -y install --no-install-recommends git ca-certificates make build-essential curl wget apt-utils golang-go

RUN curl https://sh.rustup.rs -sSf | sh -s -- -y --profile default \
&& cat $HOME/.cargo/env >> ~/.bashrc

RUN curl -sL https://deb.nodesource.com/setup_12.x -o nodesource_setup.sh \
RUN curl -sL https://deb.nodesource.com/setup_14.x -o nodesource_setup.sh \
&& bash nodesource_setup.sh \
&& apt-get -y install --no-install-recommends nodejs \
&& npm install -g yarn http-server
Expand All @@ -17,7 +14,7 @@ RUN wget -nv https://dist.ipfs.io/go-ipfs/v0.7.0/go-ipfs_v0.7.0_linux-amd64.tar.
&& tar xvfz go-ipfs_v0.7.0_linux-amd64.tar.gz \
&& mv go-ipfs/ipfs /usr/local/bin/ipfs \
&& rm -r go-ipfs && rm go-ipfs_v0.7.0_linux-amd64.tar.gz \
&& ipfs init --profile badgerds \
&& ipfs init --profile badgerds --empty-repo \
&& ipfs config --json 'Experimental.ShardingEnabled' true

# #
Expand Down
134 changes: 120 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,14 @@ Putting Wikipedia Snapshots on IPFS and working towards making it fully read-wri
Existing Mirrors: https://en.wikipedia-on-ipfs.org, https://tr.wikipedia-on-ipfs.org
</p>

- [Purpose](#purpose)
- [How to add new Wikipedia snapshots to IPFS](#how-to-add-new-wikipedia-snapshots-to-ipfs)
- [Manual steps](#manual-steps)
- [Docker](#docker)
- [How to help](#how-to-help)
- [Cohost a lazy copy](#cohost-a-lazy-copy)
- [Cohost a full copy](#cohost-a-full-copy)

## Purpose

“We believe that information—knowledge—makes the world better. That when we ask questions, get the facts, and are able to understand all perspectives on an issue, it allows us to build the foundation for a more just and tolerant society”
Expand All @@ -35,11 +43,28 @@ The long term goal is to get the full-fledged read-write Wikipedia to work on to

A full read-write version (2) would require a strong collaboration with Wikipedia.org itself, and finishing work on important dynamic content challenges -- we are working on all the technology (2) needs, but it's not ready for prime-time yet. We will update when it is.

## How to add new Wikipedia snapshots to IPFS
# How to add new Wikipedia snapshots to IPFS

The process can be nearly fully automated, however it consists of many stages
and understanding what happens during each stage is paramount if ZIM format
changes and our build toolchain requires a debug and update.

- [Manual steps](#manual-steps) are useful in debug situations, when specific stage needs to be executed multiple times to fix a bug.
- [mirrorzim.sh](#mirrorzim.sh) automates some steps for QA purposes and ad-hoc experimentation
- [Docker build](#docker-build) is fully automated blackbox which takes ZIM file and produces CID and `IPFS_PATH` with datastore.

**Note: This is a work in progress.**. We intend to make it easy for anyone to
create their own wikipedia snapshots and add them to IPFS, making sure those
builds are deterministic and auditable, but our first emphasis has been to get
the initial snapshots onto the network. This means some of the steps aren't as
easy as we want them to be. If you run into trouble, seek help through a github
issue, commenting in the `#ipfs` channel on IRC, or by posting a thread on
https://discuss.ipfs.io.

## Manual steps

If you would like to create an updated Wikipedia snapshot on IPFS, you can follow these steps.

**Note: This is a work in progress.**. We intend to make it easy for anyone to create their own wikipedia snapshots and add them to IPFS, but our first emphasis has been to get the initial snapshots onto the network. This means some of the steps aren't as easy as we want them to be. If you run into trouble, seek help through a github issue, commenting in the #ipfs channel in IRC, or by posting a thread on https://discuss.ipfs.io.

### Step 0: Clone this repository

Expand Down Expand Up @@ -83,7 +108,7 @@ $ export IPFS_PATH=/path/to/IPFS_PATH_WIKIPEDIA_MIRROR
Make sure repo is initialized with [datastore backed by BadgerDB](https://github.com/ipfs/go-ds-badger) for improved performance:

```
ipfs init -p badgerds
ipfs init -p badgerds --empty-repo
```


Expand All @@ -103,7 +128,7 @@ $ ipfs config --json 'Experimental.ShardingEnabled' true
```


### Step 1: Download the latest snapshot from kiwix.org
### Step 3: Download the latest snapshot from kiwix.org

Source of ZIM files is at https://download.kiwix.org/zim/wikipedia/
Make sure you download `_all_maxi_` snapshots, as those include images.
Expand All @@ -123,7 +148,7 @@ Running the command will download the choosen zim file to the `./snapshots` dire



### Step 3: Unpack the ZIM snapshot
### Step 4: Unpack the ZIM snapshot

Unpack the ZIM snapshot using `extract_zim`:

Expand All @@ -137,7 +162,7 @@ $ zimdump dump ./snapshots/wikipedia_tr_all_maxi_2021-01.zim --dir ./tmp/wikiped
> It is often different than the "main page" of upstream Wikipedia.
> Kiwix Main page needs to be passed in the next step, so until there is an automated way to determine "main page" of ZIM, you need to open ZIM in Kiwix reader and eyeball the name of the landing page.
### Step 4: Convert the unpacked zim directory to a website with mirror info
### Step 5: Convert the unpacked zim directory to a website with mirror info

IMPORTANT: The snapshots must say who disseminated them. This effort to mirror Wikipedia snapshots is not affiliated with the Wikimedia foundation and is not connected to the volunteers whose contributions are contained in the snapshots. The snapshots must include information explaining that they were created and disseminated by independent parties, not by Wikipedia.

Expand All @@ -147,7 +172,7 @@ The conversion to a working website and the appending of necessary information i
$ node ./bin/run --help
```

First though the main page, as the archive appears on the appropriate wikimedia website, must be determined. For instance, the zim file for Turkish Wikipedia has a main page of `Kullanıcı:The_other_Kiwix_guy/Landing` but `https://tr.wikipedia.org` uses `Anasayfa` as the main page. Both must be passed to the node script.
The program requires main page for ZIM and online versions as one of inputs. For instance, the ZIM file for Turkish Wikipedia has a main page of `Kullanıcı:The_other_Kiwix_guy/Landing` but `https://tr.wikipedia.org` uses `Anasayfa` as the main page. Both must be passed to the node script.

To determine the original main page use `./tools/find_main_page_name.sh`:

Expand All @@ -174,14 +199,14 @@ Kullanıcı:The_other_Kiwix_guy/Landing
The conversion is done on the unpacked zim directory:

```sh
node ./bin/run ./tmp/wikipedia_tr_all_maxi_2021-01 \
node ./bin/run ./tmp/wikipedia_tr_all_maxi_2021-02 \
--hostingdnsdomain=tr.wikipedia-on-ipfs.org \
--zimfilesourceurl=https://download.kiwix.org/zim/wikipedia/wikipedia_tr_all_maxi_2019-12.zim \
--zimfile=./snapshots/wikipedia_tr_all_maxi_2021-02.zim \
--kiwixmainpage=Kullanıcı:The_other_Kiwix_guy/Landing \
--mainpage=Anasayfa
```

### Step 4: Import website directory to IPFS
### Step 6: Import website directory to IPFS

#### Add immutable copy

Expand All @@ -193,13 +218,28 @@ $ ipfs add -r --cid-version 1 --offline $unpacked_wiki

Save the last hash of the output from the above process. It is the CID of the website.

### Step 6: Share the hash
### Step 7: Share the root CID

Share the CID of your new snapshot so people can access it and replicate it onto their machines.

# Docker
### Step 8: Update *.wikipedia-on-ipfs.org

Make sure at least two full reliable copies exist before updating DNSLink.

## mirrorzim.sh

It is possible to automate steps 3-6 via a wrapper script named `mirrorzim.sh`.
It will download the latest snapshot of specified language (if needed), unpack it, and add it to IPFS.

To see how the script behaves try running it on one of the smallest wikis, such as `cu`:

```console
$ ./mirrorzim.sh --languagecode=cu --wikitype=wikipedia --hostingdnsdomain=cu.wikipedia-on-ipfs.org
```

## Docker build

A dockerfile with the software requirements is provided.
A `Dockerfile` with the software requirements is provided.

To build the docker image:

Expand All @@ -215,4 +255,70 @@ docker run -it -v $(pwd):/root/distributed-wikipedia-mirror -p 8080:8080 --entry

# How to Help

If you would like to contribute to this effort, look at the [issues](https://github.com/ipfs/distributed-wikipedia-mirror/issues) in this github repo. Especially check for [issues marked with the "wishlist" label](https://github.com/ipfs/distributed-wikipedia-mirror/labels/wishlist) and issues marked ["help wanted"](https://github.com/ipfs/distributed-wikipedia-mirror/labels/help%20wanted).
If you don't mind command line interface and have a lot of disk space,
bandwidth, or code skills, continue reading.

## Share mirror CID with people who can't trust DNS

Sharing a CID instead of a DNS name is useful when DNS is not reliable or
trustworthy. The latest CID for specific language mirror can be found via
DNSLink:

```console
$ ipfs resolve -r /ipns/tr.wikipedia-on-ipfs.org
/ipfs/bafy..
```

CID can then be opened via `ipfs://bafy..` in a web browser with [IPFS Companion](https://github.com/ipfs-shipyard/ipfs-desktop) extension
resolving IPFS addresses via [IPFS Desktop](https://docs.ipfs.io/install/ipfs-desktop/) node.

You can also try [Brave browser](https://brave.com), which ships with [native support for IPFS](https://brave.com/ipfs-support/).

## Cohost a lazy copy

Using MFS makes it easier to protect snapshots from being garbage collected
than low level pinning because you can assign meaningful names and it won't
prefetch any blocks unless you explicitly ask.

Every mirrored Wikipedia article you visit will be added to your lazy
copy, and will be contributing to your partial mirror. , and you won't need to host
the entire thing.

To cohost a lazy copy, execute:

```console
$ export LNG="tr"
$ ipfs files mkdir -p /wikipedia-mirror/$LNG
$ ipfs files cp $(ipfs resolve -r /ipns/$LNG.wikipedia-on-ipfs.org) /wikipedia-mirror/$LNG/$LNG_$(date +%F_%T)
```

Then simply start browsing the `$LNG.wikipedia-on-ipfs.org` site via your node.
Every visited page will be cached, cohosted, and protected from garbage collection.

## Cohost a full copy

Steps are the same as for a lazy copy, but you execute additional preload
after a lazy copy is in place:

```console
$ # export LNG="tr"
$ ipfs refs -r /ipns/$LNG.wikipedia-on-ipfs.org
```

Before you execute this, check if you have enough disk space to fit `CumulativeSize`:

```console
$ # export LNG="tr"
$ ipfs object stat --human /ipns/$LNG.wikipedia-on-ipfs.org ...rror MM?fix/build-2021
NumLinks: 5
BlockSize: 281
LinksSize: 251
DataSize: 30
CumulativeSize: 15 GB
```

We are working on improving deduplication between snapshots, but for now YMMV.

## Code

If you would like to contribute more to this effort, look at the [issues](https://github.com/ipfs/distributed-wikipedia-mirror/issues) in this github repo. Especially check for [issues marked with the "wishlist" label](https://github.com/ipfs/distributed-wikipedia-mirror/labels/wishlist) and issues marked ["help wanted"](https://github.com/ipfs/distributed-wikipedia-mirror/labels/help%20wanted).
23 changes: 17 additions & 6 deletions mirrorzim.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ set -euo pipefail

usage() {
echo "USAGE:"
echo " $0 - download a zim file, unpack it, convert to website then push to local ipfs instance"
echo " $0 - download a zim file, unpack it, convert to website then add to MFS at local IPFS instance"
echo ""
echo "SYNOPSIS"
echo " $0 --languagecode=<LANGUAGE_CODE> --wikitype=<WIKI_TYPE>"
Expand Down Expand Up @@ -91,21 +91,32 @@ printf "\nRemove tmp directory $TMP_DIRECTORY before run ..."
rm -rf $TMP_DIRECTORY

printf "\nUnpack the zim file into $TMP_DIRECTORY...\n"
ZIM_FILE_MAIN_PAGE=$(./extract_zim/extract_zim ./snapshots/$ZIM_FILE --out $TMP_DIRECTORY | grep 'Main page is' | cut -d' ' -f4)
zimdump dump ./snapshots/$ZIM_FILE --dir $TMP_DIRECTORY

# Find the main page of ZIM
ZIM_FILE_MAIN_PAGE=$(zimdump info ./snapshots/$ZIM_FILE | grep -oP 'main page: A/\K\S+')

# Resolve the main page as it is on wikipedia over http
MAIN_PAGE=$(./tools/find_main_page_name.sh "$LANGUAGE_CODE.$WIKI_TYPE.org")

printf "\nConvert the unpacked zim directory to a website\n"
node ./bin/run $TMP_DIRECTORY \
--zimfilesourceurl=$ZIM_FILE_SOURCE_URL \
--zimfile=./snapshots/$ZIM_FILE \
--kiwixmainpage=$ZIM_FILE_MAIN_PAGE \
--mainpage=$MAIN_PAGE \
${HOSTING_DNS_DOMAIN:+--hostingdnsdomain=$HOSTING_DNS_DOMAIN} \
${HOSTING_IPNS_HASH:+--hostingipnshash=$HOSTING_IPNS_HASH} \
${MAIN_PAGE_VERSION:+--mainpageversion=$MAIN_PAGE_VERSION}

printf "\nAdd the processed tmp directory to IPFS\n"
CID=$(ipfs add -r --cid-version 1 --offline $TMP_DIRECTORY | tail -n -1 | cut -d' ' -f2 )
printf "\nAdding original ZIM to the root dir:\n"
cp -v ./snapshots/$ZIM_FILE $TMP_DIRECTORY

printf "\nAdding the processed tmp directory to IPFS: be patient (it takes a few hours), and keep this terminal open\n"
CID=$(ipfs add -r --cid-version 1 --pin=false --offline -Qp $TMP_DIRECTORY)
MFS_DIR="/${ZIM_FILE}_$(date +%F_%T)"

# pin by adding to MFS under a meaningful name
ipfs files cp /ipfs/$CID "$MFS_DIR"

printf "\nThe CID of $ZIM_FILE is:\n$CID\n"
printf "\nThe root CID of $ZIM_FILE is:\n$CID\n"
printf "\nSaved in MFS under:\n$MFS_DIR\n"
5 changes: 2 additions & 3 deletions src/article-transforms.ts
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import { format } from 'date-fns'
import { readFileSync } from 'fs'
import { basename, relative } from 'path'
import Handlebars from 'handlebars'

import { EnhancedOpts } from './domain'
Expand All @@ -22,9 +23,7 @@ const generateFooterFrom = (options: EnhancedOpts) => {
CANONICAL_URL: options.canonicalUrl,
CANONICAL_URL_DISPLAY: decodeURIComponent(options.canonicalUrl),
IMAGES_DIR: options.relativeImagePath,
ZIM_URL:
options.zimFileSourceUrl ??
'https://wiki.kiwix.org/wiki/Content_in_all_languages'
ZIM_NAME: basename(options.zimFile)
}

const footerTemplate = Handlebars.compile(footerFragment.toString())
Expand Down
2 changes: 1 addition & 1 deletion src/domain.ts
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ export interface Options {
mainPageVersion?: number
hostingDNSDomain?: string
hostingIPNSHash?: string
zimFileSourceUrl: string
zimFile: string
noOfWorkerThreads: number
}

Expand Down
8 changes: 4 additions & 4 deletions src/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,15 @@ class ZimToWebsite extends Command {
static description = 'Convert unpacked zim files to usable websites'

static examples = [
'$ zim-to-website ./tmp \\\n --hostingdnsdomain=tr.wikipedia-on-ipfs.org \\\n --hostingipnshash=QmVH1VzGBydSfmNG7rmdDjAeBZ71UVeEahVbNpFQtwZK8W \\\n --zimfilesourceurl=https://download.kiwix.org/zim/wikipedia/wikipedia_tr_all_maxi_2019-12.zim \\\n --kiwixmainpage=Kullanıcı:The_other_Kiwix_guy/Landing \\\n --mainpage=Anasayfa'
'$ zim-to-website ./tmp \\\n --hostingdnsdomain=tr.wikipedia-on-ipfs.org \\\n --hostingipnshash=QmVH1VzGBydSfmNG7rmdDjAeBZ71UVeEahVbNpFQtwZK8W \\\n --zimfile=/path/to/wikipedia_tr_all_maxi_2019-12.zim \\\n --kiwixmainpage=Kullanıcı:The_other_Kiwix_guy/Landing \\\n --mainpage=Anasayfa'
]

static flags = {
version: flags.version({ char: 'v' }),
help: flags.help({ char: 'h' }),
zimfilesourceurl: flags.string({
zimfile: flags.string({
required: true,
description: 'the url of the original (before unpacking) source zim file'
description: 'the location of the original (before unpacking) source zim file'
}),
kiwixmainpage: flags.string({
required: true,
Expand Down Expand Up @@ -55,7 +55,7 @@ class ZimToWebsite extends Command {
unpackedZimDir: args.unpackedzimdir,
hostingDNSDomain: flags.hostingdnsdomain,
hostingIPNSHash: flags.hostingipnshash,
zimFileSourceUrl: flags.zimfilesourceurl,
zimFile: flags.zimfile,
kiwixMainPage: flags.kiwixmainpage,
mainPage: flags.mainpage,
mainPageVersion: flags.mainpageversion,
Expand Down
20 changes: 16 additions & 4 deletions src/site-transforms.ts
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ import {
import Handlebars from 'handlebars'
import fetch from 'node-fetch'
import path from 'path'
import { join } from 'path'
import { join, basename, relative } from 'path'

import {
appendHtmlPostfix,
Expand Down Expand Up @@ -70,6 +70,20 @@ export const moveArticleFolderToWiki = ({
cli.action.stop()
}

export const includeSourceZim = ({
zimFile,
unpackedZimDir
}: Options) => {
const zimCopy = join(unpackedZimDir, basename(zimFile))
if (existsSync(zimCopy)) {
return
}

cli.action.start(' Copying source ZIM to the root of unpacked version ')
copyFileSync(zimFile, zimCopy)
cli.action.stop()
}

export const insertIndexRedirect = (options: Options) => {
cli.action.start(" Inserting root 'index.html' as redirect to main page")
const template = Handlebars.compile(indexRedirectFragment.toString())
Expand Down Expand Up @@ -283,9 +297,7 @@ export const appendJavscript = (
SNAPSHOT_DATE: format(new Date(), 'yyyy-MM'),
HOSTING_IPNS_HASH: options.hostingIPNSHash,
HOSTING_DNS_DOMAIN: options.hostingDNSDomain,
ZIM_URL:
options.zimFileSourceUrl ??
'https://wiki.kiwix.org/wiki/Content_in_all_languages'
ZIM_NAME: basename(options.zimFile)
}

const dwmSitejs = Handlebars.compile(dwmSitejsTemplate.toString())({
Expand Down
6 changes: 3 additions & 3 deletions src/templates/footer_fragment.handlebars
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@
<div class="footer-titles">
<div class="footer-titles-title">Distributed Wikipedia Mirror Project</div>
<div class="footer-titles-subtitle">
Powered by <a href="https://ipfs.io">IPFS</a>
Powered by <a class="external" href="https://ipfs.io">IPFS</a>
</div>
</div>
<div class="footer-header-blank-slot">
Expand All @@ -111,8 +111,8 @@
a global effort, independent from Wikipedia.
</div>
<div>
Created on {{SNAPSHOT_DATE}} from the
<a class="external" href="{{ZIM_URL}}">kiwix ZIM file</a>
Created on {{SNAPSHOT_DATE}} from the kiwix ZIM file:
<a href="{{ZIM_URL}}">{{ZIM_NAME}}</a>
</div>
<div id="footer-ipns-link">
<!-- Placeholder -->
Expand Down
Loading

0 comments on commit 93dc652

Please sign in to comment.