-
Notifications
You must be signed in to change notification settings - Fork 5
libgen: use local database copies instead #88
Comments
The desktop application uses an imported local copy of the databases. These are all publicly available but the current latest backups will consume roughly ~1G (compressed). If we'd have these, we could just SQL us to what we want. Should bookwyrm download these databases? Some plugin preparation step? New DB releases are tagged with a proper "Last modified". |
Databases can be downloaded to |
The database backups are MySQL dumps, but we do not want to host our own MySQL server, so instead we can convert the dump to sqlite-compatible statements via mysql2sqlite. After some minor adjustments to the produced file (removing the |
Downloads (with wget at least) seem to cut out every once in a while. Automated download should probably have low timeout and many retries. |
The easiest implementation of this approach would be to just feed every entry to bookwyrm so that it can do all the heavy lifting (takes no more than a few seconds on an SSD). The complexity of the libgen plugin will then lay in preparation: downloading the databases, unarchiving, and converting to sqlite dbs. |
This enables up to raise any eventual exceptions to bookwyrm, instead of dumping them to std{out,err}. However, the exception will only be raised after remaining threads have finished. Related to #88.
If no non-empty fields exist, series is None. Related to #88.
Local databases are now queried. Bottleneck is not the disk but when feeding the items to bookwyrm. Can likely be sped up by spawning some feeder threads, but current implementation is sufficient for now. |
List of things to do/consider before we can close this issue:
The JSON API can apprently be used to apply future updates to the database, but we'll tackle that later. For now, the biggest question is how we should convert the dumps to sqlite3 statements. Do we need all the replacements done in the awk-script or only a subset? Best outcome is if we can do everything in pure Python. Either case, the whole awk-script can probably be converted to Python via |
For the time being, this behavior should be wrapped into a |
The HTML from
http://libgen.io/foreignfiction/index.php
is not parsed correctly. While the page is rendered correctly the HTML cannot be fed directly into BeautifulSoup because of some tags being in places that they shouldn't. An alternative interface (currently in some beta phase, but much easier to parse) is available athttp://gen.lib.rus.ec/fiction/
. I expect more changes to this interface in the coming months, so parsing it correctly is likely a moving target.It would probably be a good idea to ask the devs if there are plans to expand the JSON API (See #85), or check how the desktop application gets its data.
The text was updated successfully, but these errors were encountered: