-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy patharchivebot-own.py
executable file
·956 lines (891 loc) · 39.4 KB
/
archivebot-own.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
#!/usr/bin/env python3
"""archivebot.py - discussion page archiving bot.
usage:
python pwb.py archivebot [OPTIONS] [TEMPLATE_PAGE]
Several TEMPLATE_PAGE templates can be given at once. Default is
`User:MiszaBot/config`. Bot examines backlinks (Special:WhatLinksHere)
to all TEMPLATE_PAGE templates. Then goes through all pages (unless a
specific page specified using options) and archives old discussions.
This is done by breaking a page into threads, then scanning each thread
for timestamps. Threads older than a specified threshold are then moved
to another page (the archive), which can be named either basing on the
thread's name or then name can contain a counter which will be
incremented when the archive reaches a certain size.
Transcluded template may contain the following parameters:
{{TEMPLATE_PAGE
|archive =
|algo =
|counter =
|maxarchivesize =
|minthreadsleft =
|minthreadstoarchive =
|archiveheader =
|key =
}}
Meanings of parameters are:
archive Name of the page to which archived threads will be put.
Must be a subpage of the current page. Variables are
supported.
algo Specifies the maximum age of a thread. Must be
in the form old(<delay>) where <delay> specifies
the age in seconds (s), hours (h), days (d),
weeks (w), or years (y) like 24h or 5d. Default is
old(24h).
counter The current value of a counter which could be assigned as
variable. Will be updated by bot. Initial value is 1.
maxarchivesize The maximum archive size before incrementing the counter.
Value can be given with appending letter like K or M
which indicates KByte or MByte. Default value is 200K.
minthreadsleft Minimum number of threads that should be left on a page.
Default value is 5.
minthreadstoarchive The minimum number of threads to archive at once. Default
value is 2.
archiveheader Content that will be put on new archive pages as the
header. This parameter supports the use of variables.
Default value is {{talkarchive}}
key A secret key that (if valid) allows archives not to be
subpages of the page being archived.
Variables below can be used in the value for "archive" in the template
above; numbers are latin digits:
%(counter)d the current value of the counter
%(year)d year of the thread being archived
%(isoyear)d ISO year of the thread being archived
%(isoweek)d ISO week number of the thread being archived
%(semester)d semester term of the year of the thread being archived
%(quarter)d quarter of the year of the thread being archived
%(month)d month (as a number 1-12) of the thread being archived
%(monthname)s localized name of the month above
%(monthnameshort)s first three letters of the name above
%(week)d week number of the thread being archived
Alternatively you may use localized digits. This is only available for a
few site languages. Refer :attr:`NON_LATIN_DIGITS
<userinterfaces.transliteration.NON_LATIN_DIGITS>` whether
there is a localized one:
%(localcounter)s the current value of the counter
%(localyear)s year of the thread being archived
%(localisoyear)s ISO year of the thread being archived
%(localisoweek)s ISO week number of the thread being archived
%(localsemester)s semester term of the year of the thread being archived
%(localquarter)s quarter of the year of the thread being archived
%(localmonth)s month (as a number 1-12) of the thread being archived
%(localweek)s week number of the thread being archived
The ISO calendar starts with the Monday of the week which has at least four
days in the new Gregorian calendar. If January 1st is between Monday and
Thursday (including), the first week of that year started the Monday of that
week, which is in the year before if January 1st is not a Monday. If it's
between Friday or Sunday (including) the following week is then the first week
of the year. So up to three days are still counted as the year before.
.. seealso:: Python :python:`datetime.date.isocalendar
<library/datetime.html#datetime.date.isocalendar>`,
https://webspace.science.uu.nl/~gent0113/calendar/isocalendar.htm
Options (may be omitted):
-help show this help message and exit
-calc:PAGE calculate key for PAGE and exit
-file:FILE load list of pages from FILE
-force override security options
-locale:LOCALE switch to locale LOCALE
-namespace:NS only archive pages from a given namespace
-page:PAGE archive a single PAGE, default ns is a user talk page
-salt:SALT specify salt
-keep Preserve thread order in archive even if threads are
archived later
-sort Sort archive by timestamp; should not be used with -keep
-async Run the bot in parallel tasks.
.. versionchanged:: 7.6
Localized variables for "archive" template parameter are supported.
`User:MiszaBot/config` is the default template. `-keep` option was
added.
.. versionchanged:: 7.7
``-sort`` and ``-async`` options were added.
.. versionchanged:: 8.2
KeyboardInterrupt was enabled with ``-async`` option.
"""
#
# (C) Pywikibot team, 2006-2023
#
# Distributed under the terms of the MIT license.
#
from __future__ import annotations
import datetime
import locale
import os
import re
import signal
import threading
import time
from collections import OrderedDict, defaultdict
from concurrent.futures import ThreadPoolExecutor
from contextlib import nullcontext
from hashlib import md5
from math import ceil
from textwrap import fill
from typing import Any, Pattern
from warnings import warn
import pywikibot
from pywikibot import i18n
from pywikibot.backports import pairwise
from pywikibot.exceptions import Error, NoPageError
from pywikibot.textlib import (
case_escape,
extract_sections,
findmarker,
to_local_digits, TimeStripperPatterns,
)
from pywikibot.time import MW_KEYS, parse_duration, str2timedelta, TZoneFixedOffset
from pywikibot.tools import PYTHON_VERSION
class ArchiveBotSiteConfigError(Error):
"""There is an error originated by archivebot's on-site configuration."""
class MalformedConfigError(ArchiveBotSiteConfigError):
"""There is an error in the configuration template."""
class MissingConfigError(ArchiveBotSiteConfigError):
"""The config is missing in the header.
It's in one of the threads or transcluded from another page.
"""
class ArchiveSecurityError(ArchiveBotSiteConfigError):
"""Page title is not a valid archive of page being archived.
The page title is neither a subpage of the page being archived,
nor does it match the key specified in the archive configuration template.
"""
def str2localized_duration(site, string: str) -> str:
"""Localise a shorthand duration.
Translates a duration written in the shorthand notation (ex. "24h", "7d")
into an expression in the local wiki language ("24 hours", "7 days").
"""
try:
key, duration = parse_duration(string)
except ValueError as e:
raise MalformedConfigError(e) from None
template = site.mediawiki_message(MW_KEYS[key])
if template:
# replace plural variants
exp = i18n.translate(site.code, template, {'$1': duration})
return exp.replace('$1', to_local_digits(duration, site.code))
return to_local_digits(string, site.code)
def str2size(string: str) -> tuple[int, str]:
"""Return a size for a shorthand size.
Accepts a string defining a size::
1337 - 1337 bytes
150K - 150 kilobytes
2M - 2 megabytes
:Returns: a tuple ``(size, unit)``, where ``size`` is an integer and
unit is ``'B'`` (bytes) or ``'T'`` (threads).
"""
match = re.fullmatch(r'(\d{1,3}(?: \d{3})+|\d+) *([BkKMT]?)', string)
if not match:
raise MalformedConfigError(f"Couldn't parse size: {string}")
val, unit = (int(match[1].replace(' ', '')), match[2])
if unit == 'M':
val *= 1024
unit = 'K'
if unit in ('K', 'k'):
val *= 1024
if unit != 'T':
unit = 'B'
return val, unit
def template_title_regex(tpl_page: pywikibot.Page) -> Pattern:
"""Return a regex that matches to variations of the template title.
It supports the transcluding variant as well as localized namespaces and
case-insensitivity depending on the namespace.
:param tpl_page: The template page
:type tpl_page: pywikibot.page.Page
"""
ns = tpl_page.site.namespaces[tpl_page.namespace()]
marker = '?' if ns.id == 10 else ''
title = tpl_page.title(with_ns=False)
title = case_escape(ns.case, title)
return re.compile(r'(?:(?:{}):){}{}'.format('|'.join(ns), marker, title))
def calc_md5_hexdigest(txt, salt) -> str:
"""Return md5 hexdigest computed from text and salt."""
s = md5()
s.update(salt.encode('utf-8'))
s.update(b'\n')
s.update(txt.encode('utf8'))
s.update(b'\n')
return s.hexdigest()
class DiscussionThread:
"""
An object representing a discussion thread on a page.
It represents something that is of the form::
== Title of thread ==
Thread content here. ~~~~
:Reply, etc. ~~~~
"""
def __init__(self, title: str, timestripper: MyTimeStripper) -> None:
"""Initializer."""
self.title = title
self.ts = timestripper
self.code = self.ts.site.code
self.content = ''
self.timestamp = None
def __repr__(self) -> str:
"""Return a string representation."""
return '{}("{}",{} bytes)'.format(self.__class__.__name__, self.title,
len(self.content.encode('utf-8')))
def feed_line(self, line: str) -> None:
"""Add a line to the content and find the newest timestamp."""
if not self.content and not line:
return
self.content += line + '\n'
timestamp = self.ts.timestripper(line)
if not self.timestamp: # first time
self.timestamp = timestamp
if timestamp:
self.timestamp = max(self.timestamp, timestamp)
def size(self) -> int:
"""Return size of discussion thread.
Note that the result is NOT equal to that of
len(self.to_text()). This method counts bytes, rather than
codepoints (characters). This corresponds to MediaWiki's
definition of page size.
"""
return len(self.title.encode('utf-8')) + len(
self.content.encode('utf-8')) + 12
def to_text(self) -> str:
"""Return wikitext discussion thread."""
return f'== {self.title} ==\n\n{self.content}'
class DiscussionPage(pywikibot.Page):
"""A class that represents a single page of discussion threads.
Feed threads to it and run an update() afterwards.
"""
def __init__(self, source, archiver, params=None, keep=False) -> None:
"""Initializer."""
super().__init__(source)
self.archiver = archiver
# for testing purposes we allow archiver to be None and we are able
# to create the a DiscussionPage in this way:
# >>> import pywikibot as py
# >>> from scripts.archivebot import DiscussionPage
# >>> d = DiscussionPage(py.Page(py.Site(), <talk page name>), None)
self.params = params
self.keep = keep
self.full = False
self.archived_threads = 0
if archiver is None:
self.timestripper = MyTimeStripper(self.site)
else:
self.timestripper = self.archiver.timestripper
def __getattr__(self, name):
"""Lazy load page if archives, header or threads attribute is missing.
.. versionadded:: 8.1
"""
if name in ('archives', 'header', 'threads'):
self.load_page()
return self.__getattribute__(name)
@staticmethod
def max(
ts1: pywikibot.Timestamp | None,
ts2: pywikibot.Timestamp | None
) -> pywikibot.Timestamp | None:
"""Calculate the maximum of two timestamps but allow None as value.
.. versionadded:: 7.6
"""
if ts1 is None:
return ts2
if ts2 is None:
return ts1
return max(ts1, ts2)
def load_page(self) -> None:
"""Load the page to be archived and break it up into threads.
.. versionchanged:: 7.6
If `-keep` option is given run through all threads and set
the current timestamp to the previous if the current is lower.
.. versionchanged:: 7.7
Load unsigned threads using timestamp of the next thread.
"""
self.header = ''
self.threads = []
self.archives = {}
try:
text = self.get()
except NoPageError:
self.header = self.archiver.get_attr(
'archiveheader',
i18n.twtranslate(self.site.code, 'archivebot-archiveheader'))
if self.params:
self.header = self.header % self.params
return
# Exclude unsupported headings (h1, h3, etc):
# adding the marker will make them ignored by extract_sections()
marker = findmarker(text)
text = re.sub(r'^((=|={3,})[^=])', marker + r'\1', text, flags=re.M)
# Find threads, avoid archiving categories or interwiki
header, threads, footer = extract_sections(text, self.site)
header = header.replace(marker, '')
if header and footer:
self.header = '\n\n'.join((header.rstrip(), footer, ''))
else:
self.header = header + footer
for thread in threads:
cur_thread = DiscussionThread(thread.heading, self.timestripper)
# remove heading line
_, *lines = thread.content.replace(marker, '').splitlines()
for line in lines:
cur_thread.feed_line(line)
self.threads.append(cur_thread)
# add latter timestamp to predecessor if it is None
for last, prev in pairwise(reversed(self.threads)):
if not prev.timestamp:
prev.timestamp = last.timestamp
if self.keep:
# set the timestamp to the previous if the current is lower
for first, second in pairwise(self.threads):
second.timestamp = self.max(first.timestamp, second.timestamp)
# This extra info is not desirable when run under the unittest
# framework, which may be run either directly or via setup.py
if pywikibot.calledModuleName() not in ['archivebot_tests', 'setup']:
pywikibot.info(f'{len(self.threads)} thread(s) found on {self}')
def is_full(self, max_archive_size: tuple[int, str]) -> bool:
"""Check whether archive size exceeded."""
if self.full:
return True
size, unit = max_archive_size
self_size = self.size()
if (unit == 'B' and self_size >= size
or unit == 'T' and len(self.threads) >= size
or self_size > self.archiver.maxsize):
self.full = True # note: this is one-way flag
return self.full
def feed_thread(self, thread: DiscussionThread,
max_archive_size: tuple[int, str]) -> bool:
"""Append a new thread to the archive."""
self.threads.append(thread)
self.archived_threads += 1
return self.is_full(max_archive_size)
def size(self) -> int:
"""Return size of talk page threads.
Note that this method counts bytes, rather than codepoints
(characters). This corresponds to MediaWiki's definition
of page size.
.. versionchanged:: 7.6
return 0 if archive page neither exists nor has threads
(:phab:`T313886`).
"""
if not (self.exists() or self.threads):
return 0
return len(self.header.encode('utf-8')) + sum(t.size()
for t in self.threads)
def update(self, summary, sort_threads: bool = False) -> None:
"""Recombine threads and save page."""
if sort_threads:
pywikibot.info('Sorting threads...')
self.threads.sort(key=lambda t: t.timestamp)
newtext = self.header.strip() + '\n\n' # Fix trailing newlines
for t in self.threads:
newtext += t.to_text()
if self.full:
summary += ' ' + i18n.twtranslate(self.site.code,
'archivebot-archive-full')
self.text = newtext
self.save(summary)
class PageArchiver:
"""A class that encapsulates all archiving methods."""
algo = 'none'
def __init__(self, page, template, salt: str, force: bool = False,
keep: bool = False, sort: bool = False) -> None:
"""Initializer.
:param page: a page object to be archived
:type page: :py:obj:`pywikibot.Page`
:param template: a template with configuration settings
:type template: :py:obj:`pywikibot.Page`
:param salt: salt value
:param force: override security value
"""
self.attributes = OrderedDict([
('archive', ['', False]),
('algo', ['old(24h)', False]),
('counter', ['1', False]),
('maxarchivesize', ['200K', False]),
])
self.salt = salt
self.force = force
self.sort = sort
self.site = page.site
self.tpl = template
self.timestripper = MyTimeStripper(site=self.site)
# read maxarticlesize
try:
# keep a gap of 1 KB not to block later changes
self.maxsize = self.site.siteinfo['maxarticlesize'] - 1024
except KeyError: # mw < 1.28
self.maxsize = 2_096_128 # 2 MB - 1 KB gap
self.page = DiscussionPage(page, self, keep=keep)
self.comment_params = {
'from': self.page.title(),
}
self.now = datetime.datetime.now(datetime.timezone.utc)
self.archives = {}
self.archived_threads = 0
self.month_num2orig_names = {}
for n, (long, short) in enumerate(self.site.months_names, start=1):
self.month_num2orig_names[n] = {'long': long, 'short': short}
self.load_config()
def get_attr(self, attr, default='') -> Any:
"""Get an archiver attribute."""
return self.attributes.get(attr, [default])[0]
def set_attr(self, attr, value, out: bool = True) -> None:
"""Set an archiver attribute."""
if attr == 'archive':
value = value.replace('_', ' ')
elif attr == 'maxarchivesize':
size, unit = str2size(value)
if unit == 'B' and size > self.maxsize:
value = f'{self.maxsize // 1024} K'
warn('Siteinfo "maxarticlesize" exceeded. Decreasing '
'"maxarchivesize" to ' + value,
ResourceWarning, stacklevel=2)
self.attributes[attr] = [value, out]
def saveables(self) -> list[str]:
"""Return a list of saveable attributes."""
return [a for a in self.attributes if self.attributes[a][1]
and a != 'maxage']
def attr2text(self) -> str:
"""Return a template with archiver saveable attributes."""
return '{{%s\n%s\n}}' \
% (self.tpl.title(with_ns=(self.tpl.namespace() != 10)),
'\n'.join(f'|{a} = {self.get_attr(a)}'
for a in self.saveables()))
def key_ok(self) -> bool:
"""Return whether key is valid."""
hexdigest = calc_md5_hexdigest(self.page.title(), self.salt)
return self.get_attr('key') == hexdigest
def load_config(self) -> None:
"""Load and validate archiver template."""
pywikibot.info(
f'Looking for: {{{{{self.tpl.title()}}}}} in {self.page}')
for tpl, params in self.page.raw_extracted_templates:
try: # Check tpl name before comparing; it might be invalid.
tpl_page = pywikibot.Page(self.site, tpl, ns=10)
tpl_page.title()
except Error:
continue
if tpl_page == self.tpl:
for item, value in params.items():
self.set_attr(item, value)
break
else:
raise MissingConfigError('Missing or malformed template')
for field in ('algo', 'archive'):
if not self.get_attr(field, ''):
raise MissingConfigError(
f'Missing argument {field!r} in template')
def should_archive_thread(self, thread: DiscussionThread
) -> tuple[str, str] | None:
"""Check whether a thread has to be archived.
:return: the archivation reason as a tuple of localization args
"""
# Only if we have the template
is_there = self.get_attr('isthere')
if is_there and not re.search(r'\{\{ՔՓ\|', thread.content):
return None
# Archived by timestamp
algo = self.get_attr('algo')
re_t = re.fullmatch(r'old\((.*)\)', algo)
if re_t:
if not thread.timestamp:
return None
# TODO: handle unsigned
try:
maxage = str2timedelta(re_t[1], thread.timestamp)
except ValueError as e:
raise MalformedConfigError(e) from None
if self.now - thread.timestamp > maxage:
duration = str2localized_duration(self.site, re_t[1])
return ('duration', duration)
# TODO: handle marked with template
return None
def get_archive_page(self, title: str, params=None) -> DiscussionPage:
"""Return the page for archiving.
If it doesn't exist yet, create and cache it.
Also check for security violations.
"""
if title not in self.archives:
page_title = self.page.title()
archive_link = pywikibot.Link(title, self.site)
if not (title.startswith(page_title + '/') or self.force
or self.key_ok()):
raise ArchiveSecurityError(
'Archive page {} does not start with page title ({})!'
.format(archive_link, page_title))
self.archives[title] = DiscussionPage(archive_link, self, params)
return self.archives[title]
def get_params(self, timestamp, counter: int) -> dict:
"""Make params for archiving template."""
lang = self.site.lang
params = {
'counter': counter,
'year': timestamp.year,
'isoyear': timestamp.isocalendar()[0],
'isoweek': timestamp.isocalendar()[1],
'semester': int(ceil(timestamp.month / 6)),
'quarter': int(ceil(timestamp.month / 3)),
'month': timestamp.month,
'week': int(time.strftime('%W', timestamp.timetuple())),
}
params.update({'local' + key: to_local_digits(value, lang)
for key, value in params.items()})
monthnames = self.month_num2orig_names[timestamp.month]
params['monthname'] = monthnames['long']
params['monthnameshort'] = monthnames['short']
return params
def preload_pages(self, counter: int, thread, pattern) -> None:
"""Preload pages if counter matters."""
if counter < 25:
return
for c in range(counter):
params = self.get_params(thread.timestamp, c + 1)
self.get_archive_page(pattern % params, params)
list(self.site.preloadpages(self.archives.values()))
def analyze_page(self) -> set[tuple[str, str]]:
"""Analyze DiscussionPage."""
max_size = self.get_attr('maxarchivesize')
max_arch_size = str2size(max_size)
if not max_arch_size[0]:
raise MalformedConfigError(f'invalid maxarchivesize {max_size!r}')
counter = int(self.get_attr('counter', '1'))
pattern = self.get_attr('archive')
keep_threads = []
threads_per_archive = defaultdict(list)
whys = set()
pywikibot.info(f'Processing {len(self.page.threads)} threads')
fields = self.get_params(self.now, 0).keys() # dummy parameters
regex = re.compile(r'%(\((?:{})\))d'.format('|'.join(fields)))
stringpattern = regex.sub(r'%\1s', pattern)
for i, thread in enumerate(self.page.threads):
# TODO: Make an option so that unstamped (unsigned) posts get
# archived.
why = self.should_archive_thread(thread)
if not why or why[0] != 'duration':
keep_threads.append(i)
continue
params = self.get_params(thread.timestamp, counter)
# this is actually just a dummy key to group the threads by
# "era" regardless of the counter and deal with it later
try:
key = pattern % params
except TypeError as e:
if 'a real number is required' in str(e):
pywikibot.error(e)
pywikibot.info(
fill('<<lightblue>>Use string format field like '
'%(localfield)s instead of %(localfield)d. '
'Trying to solve it...'))
pywikibot.info()
pattern = stringpattern
key = pattern % params
else:
raise MalformedConfigError(e)
threads_per_archive[key].append((i, thread))
whys.add(why) # FIXME: we don't know if we ever archive anything
params = self.get_params(self.now, counter)
aux_params = self.get_params(self.now, counter + 1)
counter_matters = (pattern % params) != (pattern % aux_params)
# we need to start with the oldest archive since that is
# the one the saved counter applies to, so sort the groups
# by the oldest timestamp
groups = sorted(threads_per_archive.values(),
key=lambda group: min(t.timestamp for _, t in group))
era_change = False
for group in groups:
# We will reset counter IFF:
# 1. it matters (AND)
# 2. "era" (year, month, etc.) changes (AND)
# 3. there is something to put to the new archive.
counter_found = False
for i, thread in group:
threads_left = len(self.page.threads) - self.archived_threads
if threads_left <= int(self.get_attr('minthreadsleft', 5)):
keep_threads.append(i)
continue # Because there's too little threads left.
if era_change:
era_change = False
counter = 1
params = self.get_params(thread.timestamp, counter)
archive = self.get_archive_page(pattern % params, params)
if counter_matters:
self.preload_pages(counter, thread, pattern)
while not counter_found and counter > 1 \
and not archive.exists():
# This may happen when either:
# 1. a previous version of the bot run and reset
# the counter without archiving anything
# (number #3 above)
# 2. era changed between runs.
# Decrease the counter.
counter -= 1
params = self.get_params(thread.timestamp, counter)
archive = self.get_archive_page(
pattern % params, params)
# There are only non existing pages found by countdown
counter_found = True
while archive.is_full(max_arch_size):
counter += 1
params = self.get_params(thread.timestamp, counter)
archive = self.get_archive_page(
pattern % params, params)
archive.feed_thread(thread, max_arch_size)
self.archived_threads += 1
if counter_matters:
era_change = True
if self.archived_threads:
self.page.threads = [self.page.threads[i]
for i in sorted(keep_threads)]
self.set_attr('counter', str(counter))
return whys
return set()
def run(self) -> None:
"""Process a single DiscussionPage object."""
if not self.page.botMayEdit():
return
whys = self.analyze_page()
mintoarchive = int(self.get_attr('minthreadstoarchive', 2))
if self.archived_threads < mintoarchive:
# We might not want to archive a measly few threads
# (lowers edit frequency)
pywikibot.info(f'Only {self.archived_threads} (< {mintoarchive}) '
f'threads are old enough. Skipping')
return
if whys:
# Search for the marker template
rx = re.compile(r'\{\{%s\s*?\n.*?\n\}\}'
% (template_title_regex(self.tpl).pattern),
re.DOTALL)
if not rx.search(self.page.header):
raise MalformedConfigError(
"Couldn't find the template in the header"
)
pywikibot.info(f'Archiving {self.archived_threads} thread(s).')
# Save the archives first (so that bugs don't cause a loss of data)
for archive in self.archives.values():
count = archive.archived_threads
if not count:
continue
self.comment_params['count'] = count
comment = i18n.twtranslate(self.site.code,
'archivebot-archive-summary',
self.comment_params)
archive.update(comment, sort_threads=self.sort)
# Save the page itself
self.page.header = rx.sub(self.attr2text(), self.page.header)
self.comment_params['count'] = self.archived_threads
comma = self.site.mediawiki_message('comma-separator')
self.comment_params['archives'] = comma.join(
a.title(as_link=True) for a in self.archives.values()
if a.archived_threads > 0
)
# Find out the reasons and return them localized
translated_whys = set()
for why, arg in whys:
# Archived by timestamp
if why == 'duration':
translated_whys.add(
i18n.twtranslate(self.site.code,
'archivebot-older-than',
{'duration': arg,
'count': self.archived_threads}))
# TODO: handle unsigned or archived by template
self.comment_params['why'] = comma.join(translated_whys)
comment = i18n.twtranslate(self.site.code,
'archivebot-page-summary',
self.comment_params)
self.page.update(comment)
class MyTimeStripper(pywikibot.textlib.TimeStripper):
def __init__(self, site=None) -> None:
"""Initializer."""
#super().__init__(site)
self.site = pywikibot.Site() if site is None else site
self.origNames2monthNum = {}
# use first_lower/first_upper for 'vi' language because monthsnames
# were changed: T324310
functions = [pywikibot.tools.first_upper, pywikibot.tools.first_lower]
for n, (long, short) in enumerate(self.site.months_names, start=1):
for func in functions:
self.origNames2monthNum[func(long)] = n
self.origNames2monthNum[func(short)] = n
# in some cases month in ~~~~ might end without dot even if
# site.months_names do not.
if short.endswith('.'):
self.origNames2monthNum[func(short[:-1])] = n
timeR = (r'(?P<time>(?P<hour>([0-1]\d|2[0-3]))[:\.h]'
r'(?P<minute>[0-5]\d))')
timeznR = r'\((?P<tzinfo>[A-Z]+)\)'
yearR = r'(?P<year>(19|20)\d\d)(?:{})?'.format('\ub144')
# if months have 'digits' as names, they need to be
# removed; will be handled as digits in regex, adding d+{1,2}\.?
escaped_months = [month for month in self.origNames2monthNum if
not month.strip('.').isdigit()]
# match longest names first.
escaped_months = [re.escape(month) for
month in sorted(escaped_months, reverse=True)]
# work around for cs wiki: if month are in digits, we assume
# that format is dd. mm. (with dot and spaces optional)
# the last one is workaround for Korean
if any(month.isdigit() for month in self.origNames2monthNum):
self.is_digit_month = True
monthR = r'(?P<month>({})|(?:1[012]|0?[1-9])\.)' \
.format('|'.join(escaped_months))
dayR = r'(?P<day>(3[01]|[12]\d|0?[1-9]))(?:{})' \
r'?\.?\s*(?:[01]?\d\.)?'.format('\uc77c')
else:
self.is_digit_month = False
monthR = r'(?P<month>({}))'.format('|'.join(escaped_months))
dayR = r'(?P<day>(3[01]|[12]\d|0?[1-9]))\.?'
self.patterns = TimeStripperPatterns(
re.compile(timeR),
re.compile(timeznR),
re.compile(yearR),
re.compile(monthR),
re.compile(dayR),
)
self._hyperlink_pat = re.compile(r'\[\s*?http[s]?://[^\]]*?\]')
self._comment_pat = re.compile(r'<!--(.*?)-->')
self._wikilink_pat = re.compile(
r'\[\[(?P<link>[^\]\|]*?)(?P<anchor>\|[^\]]*)?\]\]')
self.tzinfo = TZoneFixedOffset(self.site.siteinfo['timeoffset'],
self.site.siteinfo['timezone'])
def process_page(page, *args: Any) -> bool:
"""Call PageArchiver for a single page.
:return: Return True to continue with the next page, False to break
the loop.
.. versionadded:: 7.6
.. versionchanged:: 7.7
pass an unspecified number of arguments to the bot using ``*args``
"""
if not page.exists():
pywikibot.info(f'{page} does not exist, skipping...')
return True
pywikibot.info(f'\n\n>>> <<lightpurple>>{page}<<default>> <<<')
# Catching exceptions, so that errors in one page do not bail out
# the entire process
try:
archiver = PageArchiver(page, *args)
archiver.run()
except ArchiveBotSiteConfigError as e:
# no stack trace for errors originated by pages on-site
pywikibot.error(f'Missing or malformed template in page {page}: {e}')
except Exception as ex:
pywikibot.exception(f'Error occurred while processing page {page}')
except KeyboardInterrupt:
pywikibot.info('\nUser quit bot run...')
return False
return True
def show_md5_key(calc, salt, site) -> bool:
"""Show calculated MD5 hexdigest."""
if not calc:
return False
if not salt:
pywikibot.bot.suggest_help(missing_parameters=['-salt'])
else:
page = pywikibot.Page(site, calc)
if page.exists():
calc = page.title()
else:
pywikibot.info(
f'NOTE: the specified page "{calc}" does not (yet) exist.')
pywikibot.info(f'key = {calc_md5_hexdigest(calc, salt)}')
return True
def main(*args: str) -> None:
"""
Process command line arguments and invoke bot.
If args is an empty list, sys.argv is used.
:param args: command line arguments
"""
def signal_handler(signum, frame):
pywikibot.info('\n<<lightyellow>>User quit bot run...')
exiting.set()
exiting = threading.Event()
filename = None
pagename = None
namespace = None
salt = ''
force = False
calc = None
keep = False
sort = False
asynchronous = False
templates = []
local_args = pywikibot.handle_args(['Մասնակից:Աշբոտ/Արխիվ'])
for arg in local_args:
option, _, value = arg.partition(':')
if not option.startswith('-'):
templates.append(arg)
continue
option = option[1:]
if option in ('file', 'filename'):
filename = value
elif option == 'locale':
# Required for english month names
locale.setlocale(locale.LC_TIME, value.encode('utf8'))
elif option == 'timezone':
os.environ['TZ'] = value.timezone
# Or use the preset value
if hasattr(time, 'tzset'):
time.tzset()
elif option == 'calc':
calc = value
elif option == 'salt':
salt = value
elif option == 'force':
force = True
elif option == 'page':
pagename = value
elif option == 'namespace':
namespace = value
elif option == 'keep':
keep = True
elif option == 'sort':
sort = True
elif option == 'async':
asynchronous = True
site = pywikibot.Site()
if show_md5_key(calc, salt, site):
return
if not templates:
templates = ['User:MiszaBot/config']
pywikibot.info('No template was specified, using default {{{{{}}}}}.'
.format(templates[0]))
if asynchronous:
signal.signal(signal.SIGINT, signal_handler)
context = ThreadPoolExecutor
else:
context = nullcontext
for template_name in templates:
tmpl = pywikibot.Page(site, template_name, ns=10)
if filename:
with open(filename) as f:
gen = [pywikibot.Page(site, line, ns=10) for line in f]
elif pagename:
gen = [pywikibot.Page(site, pagename, ns=3)]
else:
ns = [str(namespace)] if namespace is not None else []
pywikibot.info('Fetching template transclusions...')
gen = tmpl.getReferences(only_template_inclusion=True,
follow_redirects=False,
namespaces=ns,
content=True)
botargs = tmpl, salt, force, keep, sort
futures = [] # needed for Python < 3.9
with context() as executor:
for pg in gen:
if asynchronous:
future = executor.submit(process_page, pg, *botargs)
if PYTHON_VERSION < (3, 9):
futures.append(future)
if not exiting.is_set():
continue
canceled: str | int = ''
pywikibot.info(
'<<lightyellow>>Canceling pending Futures... ',
newline=False)
if PYTHON_VERSION < (3, 9):
canceled = sum(future.cancel() for future in futures)
else:
executor.shutdown(cancel_futures=True)
pywikibot.info(f'{canceled} done')
break
elif not process_page(pg, *botargs):
break
if __name__ == '__main__':
start = datetime.datetime.now()
main()
pywikibot.info('\nExecution time: {} seconds'
.format((datetime.datetime.now() - start).seconds))