Skip to content

Making a scraper

Evgeny edited this page May 30, 2023 · 6 revisions

Create a folder in distros with the next structure:

distro_name
β”œβ”€β”€ info.json
β”œβ”€β”€ logo.png
└── scraper.py

If distro_name starts with underscore (e.g. _disabled), it will not be counted.

Let's take a look for every file.

info.json

info.json contains a distro name and a link to the official website. Arch Linux info.json example:

{
    "name": "Arch Linux",
    "url": "https://archlinux.org"
}

Fallback values will be used if info.json is missing or values ain't provided. Arch Linux fallback values will be next:

{
    "name": "arch",
    "url": "https://distrowatch.com/table.php?distribution=arch"
}

logo.png

Should be 128x128px with transparent background. Arch Linux logo.png example:


Arch Linux


If logo.png is missing, the fallback logo will be used:


DriveDroid Logo


scraper.py

A scraper can be written as you like, as long as it returns the desired values.

It must return an array of tuples (every tuple contains iso_url, iso_arch, iso_size, iso_version in order).

Arch Linux scraper returns next values:

[
  (
    'https://mirror.yandex.ru/archlinux/iso/2021.05.01/archlinux-2021.05.01-x86_64.iso',
    'x86_64',
    792014848,
    '2021.05.01'
  ),
  (
    'https://mirror.yandex.ru/archlinux/iso/2021.06.01/archlinux-2021.06.01-x86_64.iso',
    'x86_64',
    811937792,
    '2021.06.01'
  ),
  (
    'https://mirror.yandex.ru/archlinux/iso/2021.07.01/archlinux-2021.07.01-x86_64.iso',
    'x86_64',
    817180672,
    '2021.07.01'
  ),
  (
    'https://mirror.yandex.ru/archlinux/iso/archboot/2020.07/archlinux-2020.07-1-archboot-network.iso',
    'x86_64',
    516947968,
    '2020.07'
  ),
  (
    'https://mirror.yandex.ru/archlinux/iso/archboot/2020.07/archlinux-2020.07-1-archboot.iso',
    'x86_64',
    1280491520,
    '2020.07'
  )
]

A scraper includes from main import * # noqa in top which imports next stuff to the namespace:

  • get
  • json
  • re
  • requests
  • rq (custom requests class with some tweaks)

Some examples of scrapers with explanations:

from main import *  # noqa


def init():

    values = [] # init empty array
    exceptions = ['arch/', 'latest/', 'archlinux-x86_64'] # exclude iso urls that contains these values
    regexp_version = re.compile(r'-(\d+.\d+(.\d+)?)') # regexp for get version from iso url
    url_base = 'https://mirror.yandex.ru/archlinux/iso/' # base url from where links will be parsed

    for iso_url in get.urls(url_base, exclude=exceptions,
                                      recurse=True): # recursive search for iso urls in base url
        iso_arch = get.arch(iso_url) # detect architecture (you might enter string manually if there's
                                     # a single iso url or multiple iso urls have the same arch
        iso_size = get.size(iso_url) # get iso size in bytes
        iso_version = re.search(regexp_version, iso_url).group(1) # search for regexp in iso url
        
        values.append((iso_url, iso_arch, iso_size, iso_version))

    return values