libgen: use local database copies instead #88

tmplt · 2019-05-26T22:52:19Z

The HTML from http://libgen.io/foreignfiction/index.php is not parsed correctly. While the page is rendered correctly the HTML cannot be fed directly into BeautifulSoup because of some tags being in places that they shouldn't. An alternative interface (currently in some beta phase, but much easier to parse) is available at http://gen.lib.rus.ec/fiction/. I expect more changes to this interface in the coming months, so parsing it correctly is likely a moving target.

It would probably be a good idea to ask the devs if there are plans to expand the JSON API (See #85), or check how the desktop application gets its data.

The text was updated successfully, but these errors were encountered:

tmplt · 2019-05-26T23:00:24Z

The desktop application uses an imported local copy of the databases. These are all publicly available but the current latest backups will consume roughly ~1G (compressed). If we'd have these, we could just SQL us to what we want. Should bookwyrm download these databases? Some plugin preparation step? New DB releases are tagged with a proper "Last modified".

tmplt · 2019-05-26T23:10:35Z

Databases can be downloaded to ~/.local/share. For a start, bookwyrm can expect these files to exist and we'll find some neat way for it to automatically download them later.

tmplt · 2019-06-25T20:04:24Z

The database backups are MySQL dumps, but we do not want to host our own MySQL server, so instead we can convert the dump to sqlite-compatible statements via mysql2sqlite. After some minor adjustments to the produced file (removing the libgen. prefix to created tables in non-fiction; removing USING BTREE from the fiction dump; etc.) we can give the database to bookwyrm.

tmplt · 2019-06-25T20:12:44Z

x-rar-compressed-compressed databases can be downloaded from http://libgen.io/dbdumps/.

Downloads (with wget at least) seem to cut out every once in a while. Automated download should probably have low timeout and many retries.

tmplt · 2019-06-25T20:18:48Z

The easiest implementation of this approach would be to just feed every entry to bookwyrm so that it can do all the heavy lifting (takes no more than a few seconds on an SSD). The complexity of the libgen plugin will then lay in preparation: downloading the databases, unarchiving, and converting to sqlite dbs.

Related to #88

This enables up to raise any eventual exceptions to bookwyrm, instead of dumping them to std{out,err}. However, the exception will only be raised after remaining threads have finished. Related to #88.

Related to #88.

If no non-empty fields exist, series is None. Related to #88.

tmplt · 2019-07-07T20:27:29Z

Local databases are now queried. Bottleneck is not the disk but when feeding the items to bookwyrm. Can likely be sped up by spawning some feeder threads, but current implementation is sufficient for now.

tmplt · 2019-07-07T20:34:44Z

List of things to do/consider before we can close this issue:

Add --prepare option
Prompt user if database dumps should be downloaded (from fresh, updated, etc.)
Download dumps for fiction and libgen
Uncompress the dumps
Remove MySQL-ness and convert to sqlite3 statements. (see https://github.com/dumblob/mysql2sqlite)
Generate actual database with sqlite3.
Delete temporary files

The JSON API can apprently be used to apply future updates to the database, but we'll tackle that later.

For now, the biggest question is how we should convert the dumps to sqlite3 statements. Do we need all the replacements done in the awk-script or only a subset? Best outcome is if we can do everything in pure Python. Either case, the whole awk-script can probably be converted to Python via re (ugh).

tmplt · 2019-07-15T19:26:43Z

For the time being, this behavior should be wrapped into a --prepare option, that, as the option implies, prepares any and all plugins that require preparation.

tmplt added the python label May 26, 2019

tmplt added the discussion label May 26, 2019

tmplt changed the title ~~libgen: foreign fiction not parsed correctly~~ libgen: use local database copies instead Jun 30, 2019

tmplt added enhancement and removed discussion labels Jun 30, 2019

tmplt added this to the v0.8.0 milestone Jun 30, 2019

tmplt added a commit that referenced this issue Jun 30, 2019

feat(plugins/libgen): use local libgen database instead

c9b1189

Related to #88

tmplt added a commit that referenced this issue Jun 30, 2019

feat(plugins/libgen): query local fiction database

09c430f

Related to #88

tmplt mentioned this issue Jun 30, 2019

Investigate Libgen's JSON API #85

Closed

tmplt added a commit that referenced this issue Jul 7, 2019

feat(plugins/libgen): use local libgen database instead

6c8febf

Related to #88

tmplt added a commit that referenced this issue Jul 7, 2019

feat(plugins/libgen): query local fiction database

1dbce22

Related to #88

tmplt added a commit that referenced this issue Jul 7, 2019

fix(plugins/libgen): treat year == 0 as None for libgen

b7289bd

Related to #88.

tmplt added a commit that referenced this issue Jul 7, 2019

fix(plugins/libgen): correctly handle entries without a series

4e156b8

If no non-empty fields exist, series is None. Related to #88.

tmplt modified the milestones: v0.8.0, v0.9.0 Jul 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libgen: use local database copies instead #88

libgen: use local database copies instead #88

tmplt commented May 26, 2019

tmplt commented May 26, 2019

tmplt commented May 26, 2019

tmplt commented Jun 25, 2019 •

edited

Loading

tmplt commented Jun 25, 2019 •

edited

Loading

tmplt commented Jun 25, 2019

tmplt commented Jul 7, 2019

tmplt commented Jul 7, 2019 •

edited

Loading

tmplt commented Jul 15, 2019

libgen: use local database copies instead #88

libgen: use local database copies instead #88

Comments

tmplt commented May 26, 2019

tmplt commented May 26, 2019

tmplt commented May 26, 2019

tmplt commented Jun 25, 2019 • edited Loading

tmplt commented Jun 25, 2019 • edited Loading

tmplt commented Jun 25, 2019

tmplt commented Jul 7, 2019

tmplt commented Jul 7, 2019 • edited Loading

tmplt commented Jul 15, 2019

tmplt commented Jun 25, 2019 •

edited

Loading

tmplt commented Jun 25, 2019 •

edited

Loading

tmplt commented Jul 7, 2019 •

edited

Loading