Replies: 3 comments
-
There is no documentation for that, but I should be able to answer your questions and give a quick overview over how things are supposed to be used: Important methods are
Important attributes are
|
Beta Was this translation helpful? Give feedback.
-
Hello, I wanted to get back on trying to implementing a Wikipedia extractor (for articles first, at least), and I tried again to look at some extractors + the main I think it would be really great if you could write a documentation about how to write an extractor, starting from a base code and going step by step: starting maybe with a simple example (direct URL?), then going to a bit more complex examples (using config variables, tokens, cookies, involving postprocessors...). The goal would be not really to explain "how to code" (we know how to Python, after all), but more to give an overview of the "API" offered by gallery-dl. Thank you! |
Beta Was this translation helpful? Give feedback.
-
task1extractor should return https://example.com/p1/main.jpg when given url under https://example.com/p2/
modules = [
...
'example',
...
from here there is 2 solution. 3.1 solution 1 from .common import Extractor, Message
class ExampleExtractor(Extractor):
category = 'example'
pattern = 'https://example.com/p2/'
def items(self):
yield Message.Url, 'https://example.com/p1/main.jpg', {} 3.2 solution 2 from .common import Extractor, Message
BASE_PATTERN = r"(?:https?://)?example\.com"
class ExampleExtractor(Extractor):
category = 'example'
class ExampleP2Extractor(ExampleExtractor):
subcategory = 'p2'
pattern = BASE_PATTERN + '/p2/'
def items(self):
yield Message.Url, 'https://example.com/p1/main.jpg', {}
TODO how to use gallery-dl extractor test you can use gallery-dl extractor test, example with DataJob >>> from gallery_dl.job import DataJob
... job = DataJob(url='https://example.com/p2/1.html')
... job.run()
... list(job.data)
[
[
3,
"https://example.com/p1/main.jpg",
{
"category": "example",
"subcategory": "p2"
}
]
]
[(3, 'https://example.com/p1/main.jpg', {'category': 'example', 'subcategory': 'p2'})]
>>> import pprint
... pprint.pprint(vars(job))
{'_logger_extra': {'extractor': <gallery_dl.extractor.example.ExampleP2Extractor object at 0x7fdba4afcfa0>,
'job': <gallery_dl.job.DataJob object at 0x7fdba428e0d0>,
'keywords': <gallery_dl.output.KwdictProxy object at 0x7fdb9e04ed00>,
'path': <gallery_dl.output.PathfmtProxy object at 0x7fdb9e04ee20>},
'ascii': True,
'data': [(3,
'https://example.com/p1/main.jpg',
{'category': 'example', 'subcategory': 'p2'})],
'extractor': <gallery_dl.extractor.example.ExampleP2Extractor object at 0x7fdba4afcfa0>,
'file': <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>,
'filter': <function filter_dict at 0x7fdb9e6c83a0>,
'kwdict': {},
'pathfmt': None,
'pred_queue': <function build_predicate.<locals>.<lambda> at 0x7fdb9e03eca0>,
'pred_url': <function build_predicate.<locals>.<lambda> at 0x7fdb9e03ec10>,
'status': 0,
'url_key': None} task2extractor should return multiple_url when given url under https://example.com/p3/ for this task https://example.com/p3/1.json return following json data {
"url": ["https://example.com/p3/direct_link1.jpg", "https://example.com/p3/direct_link2.jpg"],
"youtube": ["https://www.youtube.com/watch?v=wf-BqAjZb8M"]
}
...
class ExampleP3Extractor(ExampleExtractor):
subcategory = 'p3'
pattern = BASE_PATTERN + '/p3/[^.]+\.json'
def items(self):
# TODO
pass
import requests
...
class ExampleP3Extractor(ExampleExtractor):
...
def items(self):
resp: requests.Response = self.request(self.url)
json_data = resp.json()
for url in json_data.get('url', []):
yield Message.Url, url, {}
for url in json_data.get('youtube', []):
yield Message.Queue, url, {}
DataJob example >>> from gallery_dl.job import DataJob
... job = DataJob(url='https://example.com/p3/1.json')
... job.run()
... pprint.pprint(job.data)
[
[
3,
"https://example.com/p3/direct_link1.jpg",
{
"category": "example",
"subcategory": "p3"
}
],
[
3,
"https://example.com/p3/direct_link2.jpg",
{
"category": "example",
"subcategory": "p3"
}
],
[
6,
"https://www.youtube.com/watch?v=wf-BqAjZb8M",
{}
]
]
[(3,
'https://example.com/p3/direct_link1.jpg',
{'category': 'example', 'subcategory': 'p3'}),
(3,
'https://example.com/p3/direct_link2.jpg',
{'category': 'example', 'subcategory': 'p3'}),
(6, 'https://www.youtube.com/watch?v=wf-BqAjZb8M', {})] Notefrom #1656 (comment)
so basically
i don't know how to write task for config variable, but here is my use case example { "extractor": { "example": { "key": "value" } } } ...
class ExampleP4Extractor(ExampleExtractor):
subcategory = 'p4'
pattern = BASE_PATTERN + '/p4/'
def __init__(self, match):
super().__init__(match)
self.my_var = self.config("key", 'default_value')
def items(self):
if self.my_var == 'default_value':
pass # do something
elif self.my_var == 'value':
pass # do something
else:
pass # do something i don't know enough about other items (token, cookies, etc) to write documentation i also use gallery-dl as extractor only so maybe there is something i miss for downloader, for example metadata also i don't know how to add test to extractor using current gallery-dl test setup unrelated but there is another discussion here maybe close or redirect to this discussion |
Beta Was this translation helpful? Give feedback.
-
Related to this issue: #1443
I want to start writing an extractor, but I feel a bit confused by how I should do it. I looked at some extractors and at the base Extractor class, and I do not really see how I could adapt it to my use case.
Is there any documentation for the Extractor class, that explains the role of each attribute and method, and how they are to be used?
Beta Was this translation helpful? Give feedback.
All reactions