Skip to content
This repository has been archived by the owner on Mar 10, 2022. It is now read-only.

libgen: use local database copies instead #88

Open
tmplt opened this issue May 26, 2019 · 8 comments
Open

libgen: use local database copies instead #88

tmplt opened this issue May 26, 2019 · 8 comments

Comments

@tmplt
Copy link
Owner

tmplt commented May 26, 2019

The HTML from http://libgen.io/foreignfiction/index.php is not parsed correctly. While the page is rendered correctly the HTML cannot be fed directly into BeautifulSoup because of some tags being in places that they shouldn't. An alternative interface (currently in some beta phase, but much easier to parse) is available at http://gen.lib.rus.ec/fiction/. I expect more changes to this interface in the coming months, so parsing it correctly is likely a moving target.

It would probably be a good idea to ask the devs if there are plans to expand the JSON API (See #85), or check how the desktop application gets its data.

@tmplt tmplt added the python label May 26, 2019
@tmplt
Copy link
Owner Author

tmplt commented May 26, 2019

The desktop application uses an imported local copy of the databases. These are all publicly available but the current latest backups will consume roughly ~1G (compressed). If we'd have these, we could just SQL us to what we want. Should bookwyrm download these databases? Some plugin preparation step? New DB releases are tagged with a proper "Last modified".

@tmplt
Copy link
Owner Author

tmplt commented May 26, 2019

Databases can be downloaded to ~/.local/share. For a start, bookwyrm can expect these files to exist and we'll find some neat way for it to automatically download them later.

@tmplt
Copy link
Owner Author

tmplt commented Jun 25, 2019

The database backups are MySQL dumps, but we do not want to host our own MySQL server, so instead we can convert the dump to sqlite-compatible statements via mysql2sqlite. After some minor adjustments to the produced file (removing the libgen. prefix to created tables in non-fiction; removing USING BTREE from the fiction dump; etc.) we can give the database to bookwyrm.

@tmplt
Copy link
Owner Author

tmplt commented Jun 25, 2019

x-rar-compressed-compressed databases can be downloaded from http://libgen.io/dbdumps/.

Downloads (with wget at least) seem to cut out every once in a while. Automated download should probably have low timeout and many retries.

@tmplt
Copy link
Owner Author

tmplt commented Jun 25, 2019

The easiest implementation of this approach would be to just feed every entry to bookwyrm so that it can do all the heavy lifting (takes no more than a few seconds on an SSD). The complexity of the libgen plugin will then lay in preparation: downloading the databases, unarchiving, and converting to sqlite dbs.

@tmplt tmplt changed the title libgen: foreign fiction not parsed correctly libgen: use local database copies instead Jun 30, 2019
@tmplt tmplt added this to the v0.8.0 milestone Jun 30, 2019
tmplt added a commit that referenced this issue Jun 30, 2019
tmplt added a commit that referenced this issue Jun 30, 2019
tmplt added a commit that referenced this issue Jul 7, 2019
tmplt added a commit that referenced this issue Jul 7, 2019
tmplt added a commit that referenced this issue Jul 7, 2019
This enables up to raise any eventual exceptions to bookwyrm, instead of
dumping them to std{out,err}.

However, the exception will only be raised after remaining threads have
finished.

Related to #88.
tmplt added a commit that referenced this issue Jul 7, 2019
tmplt added a commit that referenced this issue Jul 7, 2019
If no non-empty fields exist, series is None.

Related to #88.
@tmplt
Copy link
Owner Author

tmplt commented Jul 7, 2019

Local databases are now queried. Bottleneck is not the disk but when feeding the items to bookwyrm. Can likely be sped up by spawning some feeder threads, but current implementation is sufficient for now.

@tmplt
Copy link
Owner Author

tmplt commented Jul 7, 2019

List of things to do/consider before we can close this issue:

  • Add --prepare option
  • Prompt user if database dumps should be downloaded (from fresh, updated, etc.)
  • Download dumps for fiction and libgen
  • Uncompress the dumps
  • Remove MySQL-ness and convert to sqlite3 statements. (see https://github.com/dumblob/mysql2sqlite)
  • Generate actual database with sqlite3.
  • Delete temporary files

The JSON API can apprently be used to apply future updates to the database, but we'll tackle that later.

For now, the biggest question is how we should convert the dumps to sqlite3 statements. Do we need all the replacements done in the awk-script or only a subset? Best outcome is if we can do everything in pure Python. Either case, the whole awk-script can probably be converted to Python via re (ugh).

@tmplt tmplt modified the milestones: v0.8.0, v0.9.0 Jul 8, 2019
@tmplt
Copy link
Owner Author

tmplt commented Jul 15, 2019

For the time being, this behavior should be wrapped into a --prepare option, that, as the option implies, prepares any and all plugins that require preparation.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant