-
Notifications
You must be signed in to change notification settings - Fork 1
Making a scraper
Create a folder in distros with the next structure:
distro_name
βββ info.json
βββ logo.png
βββ scraper.py
If distro_name starts with underscore (e.g. _disabled), it will not be counted.
Let's take a look for every file.
info.json contains a distro name and a link to the official website. Arch Linux info.json example:
{
"name": "Arch Linux",
"url": "https://archlinux.org"
}
Fallback values will be used if info.json is missing or values ain't provided. Arch Linux fallback values will be next:
{
"name": "arch",
"url": "https://distrowatch.com/table.php?distribution=arch"
}
Should be 128x128px with transparent background. Arch Linux logo.png example:
If logo.png is missing, the fallback logo will be used:
A scraper can be written as you like, as long as it returns the desired values.
It must return an array of tuples (every tuple contains iso_url, iso_arch, iso_size, iso_version in order).
Arch Linux scraper returns next values:
[
(
'https://mirror.yandex.ru/archlinux/iso/2021.05.01/archlinux-2021.05.01-x86_64.iso',
'x86_64',
792014848,
'2021.05.01'
),
(
'https://mirror.yandex.ru/archlinux/iso/2021.06.01/archlinux-2021.06.01-x86_64.iso',
'x86_64',
811937792,
'2021.06.01'
),
(
'https://mirror.yandex.ru/archlinux/iso/2021.07.01/archlinux-2021.07.01-x86_64.iso',
'x86_64',
817180672,
'2021.07.01'
),
(
'https://mirror.yandex.ru/archlinux/iso/archboot/2020.07/archlinux-2020.07-1-archboot-network.iso',
'x86_64',
516947968,
'2020.07'
),
(
'https://mirror.yandex.ru/archlinux/iso/archboot/2020.07/archlinux-2020.07-1-archboot.iso',
'x86_64',
1280491520,
'2020.07'
)
]
A scraper includes from main import * # noqa
in top which imports next stuff to the namespace:
- get
- json
- re
- requests
- rq (custom requests class with some tweaks)
Some examples of scrapers with explanations:
from main import * # noqa
def init():
values = [] # init empty array
exceptions = ['arch/', 'latest/', 'archlinux-x86_64'] # exclude iso urls that contains these values
regexp_version = re.compile(r'-(\d+.\d+(.\d+)?)') # regexp for get version from iso url
url_base = 'https://mirror.yandex.ru/archlinux/iso/' # base url from where links will be parsed
for iso_url in get.urls(url_base, exclude=exceptions,
recurse=True): # recursive search for iso urls in base url
iso_arch = get.arch(iso_url) # detect architecture (you might enter string manually if there's
# a single iso url or multiple iso urls have the same arch
iso_size = get.size(iso_url) # get iso size in bytes
iso_version = re.search(regexp_version, iso_url).group(1) # search for regexp in iso url
values.append((iso_url, iso_arch, iso_size, iso_version))
return values