Is there a documentation for Extractor class? #1656

Ailothaen · 2021-06-29T17:12:50Z

Ailothaen
Jun 29, 2021

Related to this issue: #1443

I want to start writing an extractor, but I feel a bit confused by how I should do it. I looked at some extractors and at the base Extractor class, and I do not really see how I could adapt it to my use case.

Is there any documentation for the Extractor class, that explains the role of each attribute and method, and how they are to be used?

mikf · 2021-06-29T18:58:28Z

mikf
Jun 29, 2021
Maintainer

There is no documentation for that, but I should be able to answer your questions and give a quick overview over how things are supposed to be used:

Important methods are items() and request().

items() should return the results with yield Message.Directory, metadata and yield Message.Url, url, metadata. Message.Directory sets the target directory, Message.Url causes url to be downloaded.
request() is used for HTTP requests. It works more or less like request's session.request() in that you'd do something like self.request(url, params=params, headers=headers).json() to for example fetch a JSON resource

Important attributes are category, subcategory, directory_fmt, filename_fmt, archive_fmt, and pattern:

category in your case should be "wikipedia" or "wikimedia"
subcategory should describe in one word what this extractor class is capable of handling, e.g. "article"
*_fmt are the default format strings
pattern is a regular expression that should match all URLs the extractor can handle. The resulting match object is the first real argument of an extractors's __init__()

0 replies

Ailothaen · 2021-09-08T13:33:31Z

Ailothaen
Sep 8, 2021
Author

Hello,

I wanted to get back on trying to implementing a Wikipedia extractor (for articles first, at least), and I tried again to look at some extractors + the main Extractor object to understand their base structure and how they work.
However, I still wannot wrap my head around most of the stuff that I see. The most confusing, I would say, is how the process is supposed to go: what triggers the extractor, what class is called first, what are the necessary things we have to send apart from the URL? I could understand the parts involving yield and the Message object, but for the rest I am quite lost.

I think it would be really great if you could write a documentation about how to write an extractor, starting from a base code and going step by step: starting maybe with a simple example (direct URL?), then going to a bit more complex examples (using config variables, tokens, cookies, involving postprocessors...). The goal would be not really to explain "how to code" (we know how to Python, after all), but more to give an overview of the "API" offered by gallery-dl.

Thank you!

0 replies

rachmadaniHaryono · 2021-11-22T12:19:40Z

rachmadaniHaryono
Nov 22, 2021

task1

extractor should return https://example.com/p1/main.jpg when given url under https://example.com/p2/

create extractor module, in this case we will create example.py on gallery_dl/extractor
add your module name to gallery_dl.extractor.__init__.modules, in this case example from example.py

modules = [
  ...
  'example',
  ...

create extractor class

from here there is 2 solution. solution 1 is the simplest extractor.
solution 2 is choosen because it will be used on future task.
your extractor name will be tested on test/test_extractor.py:TestExtractorModule

3.1 solution 1

from .common import Extractor, Message

class ExampleExtractor(Extractor):
  category = 'example'
  pattern = 'https://example.com/p2/'

  def items(self):
    yield Message.Url, 'https://example.com/p1/main.jpg', {}

3.2 solution 2

from .common import Extractor, Message

BASE_PATTERN = r"(?:https?://)?example\.com"

class ExampleExtractor(Extractor):
  category = 'example'

class ExampleP2Extractor(ExampleExtractor):
  subcategory = 'p2'
  pattern = BASE_PATTERN + '/p2/'

  def items(self):
    yield Message.Url, 'https://example.com/p1/main.jpg', {}

test your extractor

TODO how to use gallery-dl extractor test

you can use gallery-dl extractor test,
or you can use gallery-dl DataJob if you use it on your application

example with DataJob

>>> from gallery_dl.job import DataJob
... job = DataJob(url='https://example.com/p2/1.html')
... job.run()
... list(job.data)
[
  [
    3,
    "https://example.com/p1/main.jpg",
    {
      "category": "example",
      "subcategory": "p2"
    }
  ]
]
[(3, 'https://example.com/p1/main.jpg', {'category': 'example', 'subcategory': 'p2'})]
>>> import pprint
... pprint.pprint(vars(job))
{'_logger_extra': {'extractor': <gallery_dl.extractor.example.ExampleP2Extractor object at 0x7fdba4afcfa0>,
                   'job': <gallery_dl.job.DataJob object at 0x7fdba428e0d0>,
                   'keywords': <gallery_dl.output.KwdictProxy object at 0x7fdb9e04ed00>,
                   'path': <gallery_dl.output.PathfmtProxy object at 0x7fdb9e04ee20>},
 'ascii': True,
 'data': [(3,
           'https://example.com/p1/main.jpg',
           {'category': 'example', 'subcategory': 'p2'})],
 'extractor': <gallery_dl.extractor.example.ExampleP2Extractor object at 0x7fdba4afcfa0>,
 'file': <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>,
 'filter': <function filter_dict at 0x7fdb9e6c83a0>,
 'kwdict': {},
 'pathfmt': None,
 'pred_queue': <function build_predicate.<locals>.<lambda> at 0x7fdb9e03eca0>,
 'pred_url': <function build_predicate.<locals>.<lambda> at 0x7fdb9e03ec10>,
 'status': 0,
 'url_key': None}

task2

extractor should return multiple_url when given url under https://example.com/p3/

for this task https://example.com/p3/1.json return following json data

{
"url": ["https://example.com/p3/direct_link1.jpg", "https://example.com/p3/direct_link2.jpg"], 
"youtube": ["https://www.youtube.com/watch?v=wf-BqAjZb8M"]
}

create extractor class

...
class ExampleP3Extractor(ExampleExtractor):
  subcategory = 'p3'
  pattern = BASE_PATTERN + '/p3/[^.]+\.json'

  def items(self):
    # TODO
    pass

write method items

import requests
...
class ExampleP3Extractor(ExampleExtractor):
  ...
  def items(self):
    resp: requests.Response = self.request(self.url)
    json_data = resp.json()
    for url in json_data.get('url', []):
      yield Message.Url, url, {}
    for url in json_data.get('youtube', []):
      yield Message.Queue, url, {}

test your extractor

DataJob example

>>> from gallery_dl.job import DataJob
... job = DataJob(url='https://example.com/p3/1.json')
... job.run()
... pprint.pprint(job.data)
[
  [
    3,
    "https://example.com/p3/direct_link1.jpg",
    {
      "category": "example",
      "subcategory": "p3"
    }
  ],
  [
    3,
    "https://example.com/p3/direct_link2.jpg",
    {
      "category": "example",
      "subcategory": "p3"
    }
  ],
  [
    6,
    "https://www.youtube.com/watch?v=wf-BqAjZb8M",
    {}
  ]
]
[(3,
  'https://example.com/p3/direct_link1.jpg',
  {'category': 'example', 'subcategory': 'p3'}),
 (3,
  'https://example.com/p3/direct_link2.jpg',
  {'category': 'example', 'subcategory': 'p3'}),
 (6, 'https://www.youtube.com/watch?v=wf-BqAjZb8M', {})]

Note

from #1656 (comment)

I think it would be really great if you could write a documentation about how to write an extractor,
starting from a base code and going step by step: starting maybe with a simple example (direct URL?),
then going to a bit more complex examples (using config variables, tokens, cookies, involving postprocessors...). 
The goal would be not really to explain "how to code" (we know how to Python, after all), 
but more to give an overview of the "API" offered by gallery-dl.

so basically

direct url
self.request use
config variable
token
cookies
postprocessors related

i don't know how to write task for config variable,

but here is my use case

example

{ "extractor": { "example": { "key": "value" } } }

...
class ExampleP4Extractor(ExampleExtractor):
  subcategory = 'p4'
  pattern = BASE_PATTERN + '/p4/'

  def __init__(self, match):
    super().__init__(match)
    self.my_var = self.config("key", 'default_value')

  def items(self):
    if self.my_var == 'default_value':
      pass  # do something
    elif self.my_var == 'value':
      pass  # do something
    else:
      pass  # do something

i don't know enough about other items (token, cookies, etc) to write documentation

i also use gallery-dl as extractor only so maybe there is something i miss for downloader, for example metadata

also i don't know how to add test to extractor using current gallery-dl test setup

unrelated but there is another discussion here

#1822

maybe close or redirect to this discussion

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a documentation for Extractor class? #1656

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Is there a documentation for Extractor class? #1656

Ailothaen Jun 29, 2021

Replies: 3 comments

mikf Jun 29, 2021 Maintainer

Ailothaen Sep 8, 2021 Author

rachmadaniHaryono Nov 22, 2021

task1

task2

Note

Ailothaen
Jun 29, 2021

mikf
Jun 29, 2021
Maintainer

Ailothaen
Sep 8, 2021
Author

rachmadaniHaryono
Nov 22, 2021