Skip to content

Commit

Permalink
Deploying to gh-pages from @ 88ef5e2 🚀
Browse files Browse the repository at this point in the history
  • Loading branch information
claromes committed Jun 14, 2024
1 parent eec076e commit 742046f
Show file tree
Hide file tree
Showing 27 changed files with 468 additions and 107 deletions.
2 changes: 1 addition & 1 deletion 404.html
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
<title>Page Not Found &#8212; Wayback Tweets Documentation (1.0.x)</title>
<link rel="stylesheet" type="text/css" href="_static/pygments.css?v=4f649999" />
<link rel="stylesheet" type="text/css" href="_static/flask.css?v=b87c8d14" />
<link rel="stylesheet" type="text/css" href="_static/css/custom.css?v=dc030f7e" />
<link rel="stylesheet" type="text/css" href="_static/css/custom.css?v=34aeb135" />
<script src="_static/documentation_options.js?v=f2a433a1"></script>
<script src="_static/doctools.js?v=9a2dae69"></script>
<script src="_static/sphinx_highlight.js?v=dc90522c"></script>
Expand Down
5 changes: 3 additions & 2 deletions _sources/api.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,9 @@ Parse
.. autoclass:: TwitterEmbed
:members:

.. autoclass:: JsonParser
:members:
.. TODO: JSON Issue
.. .. autoclass:: JsonParser
.. :members:
Export
Expand Down
33 changes: 33 additions & 0 deletions _sources/cli.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,36 @@ Usage
.. click:: waybacktweets.cli.main:cli
:prog: waybacktweets
:nested: full

Collapsing
------------

The Wayback Tweets command line tool recommends the use of three types of "collapse": ``urlkey``, ``digest``, and ``timestamp`` field.

- ``urlkey``: (`str`) A canonical transformation of the URL you supplied, for example, ``org,eserver,tc)/``. Such keys are useful for indexing.

- ``digest``: (`str`) The ``SHA1`` hash digest of the content, excluding the headers. It's usually a base-32-encoded string.

- ``timestamp``: (`datetime`) A 14 digit date-time representation in the ``YYYYMMDDhhmmss`` format. We recommend ``YYYYMMDD``.

However, it is possible to use it with other options. Read below text extracted from the official Wayback CDX Server API (Beta) documentation.

.. note::

A new form of filtering is the option to "collapse" results based on a field, or a substring of a field. Collapsing is done on adjacent CDX lines where all captures after the first one that are duplicate are filtered out. This is useful for filtering out captures that are "too dense" or when looking for unique captures.

To use collapsing, add one or more ``collapse=field`` or ``collapse=field:N`` where ``N`` is the first ``N`` characters of field to test.

- Ex: Only show at most 1 capture per hour (compare the first 10 digits of the ``timestamp`` field). Given 2 captures ``20130226010000`` and ``20130226010800``, since first 10 digits ``2013022601`` match, the 2nd capture will be filtered out:

http://web.archive.org/cdx/search/cdx?url=google.com&collapse=timestamp:10

The calendar page at `web.archive.org` uses this filter by default: `http://web.archive.org/web/*/archive.org`

- Ex: Only show unique captures by ``digest`` (note that only adjacent digest are collapsed, duplicates elsewhere in the cdx are not affected):

http://web.archive.org/cdx/search/cdx?url=archive.org&collapse=digest

- Ex: Only show unique urls in a prefix query (filtering out captures except first capture of a given url). This is similar to the old prefix query in wayback (note: this query may be slow at the moment):

http://web.archive.org/cdx/search/cdx?url=archive.org&collapse=urlkey&matchType=prefix
5 changes: 3 additions & 2 deletions _sources/errors.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,10 @@ This error is raised when the package fails to establish a new connection with w

The output message from the package would be: ``Failed to establish a new connection with web.archive.org. Max retries exceeded.``

This is the error often returned when performing experimental parsing of URLs with the mimetype ``application/json``.
.. TODO: JSON Issue
.. This is the error often returned when performing experimental parsing of URLs with the mimetype ``application/json``.
The warning output message from the package would be: ``Connection error with https://web.archive.org/web/<TIMESTAMP>/https://twitter.com/<USERNAME>/status/<TWEET_ID>. Max retries exceeded. Error parsing the JSON, but the CDX data was saved.``
.. The warning output message from the package would be: ``Connection error with https://web.archive.org/web/<TIMESTAMP>/https://twitter.com/<USERNAME>/status/<TWEET_ID>. Max retries exceeded. Error parsing the JSON, but the CDX data was saved.``
HTTPError
----------------
Expand Down
3 changes: 2 additions & 1 deletion _sources/index.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Wayback Tweets
Wayback Tweets Documentation
------------------------------

Retrieves archived tweets' CDX data from the Wayback Machine, performs necessary parsing, and saves the data.
Retrieves archived tweets CDX data from the Wayback Machine, performs necessary parsing, and saves the data in CSV, JSON, and HTML formats.


User Guide
Expand All @@ -22,6 +22,7 @@ User Guide
result
errors
contribute
todo


Command-Line Interface
Expand Down
6 changes: 6 additions & 0 deletions _sources/installation.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -37,4 +37,10 @@ From source
poetry install
Run Streamlit App:

.. code-block:: shell
streamlit run app/app.py
`Read the Poetry CLI documentation <https://python-poetry.org/docs/cli/>`_.
3 changes: 2 additions & 1 deletion _sources/result.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,8 @@ The package saves in three formats: CSV, JSON, and HTML. The files have the foll

- ``parsed_archived_tweet_url``: (`str`) The original archived URL after parsing. `Check the utility functions <api.html#module-waybacktweets.utils.utils>`_.

- ``parsed_tweet_text_mimetype_json``: (`str`) The tweet text extracted from the archived URL that has mimetype ``application/json``.
.. TODO: JSON Issue
.. - ``parsed_tweet_text_mimetype_json``: (`str`) The tweet text extracted from the archived URL that has mimetype ``application/json``.
- ``available_tweet_text``: (`str`) The tweet text extracted from the URL that is still available on the Twitter account.

Expand Down
13 changes: 13 additions & 0 deletions _sources/todo.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
TODO
================

.. |uncheck| raw:: html

<input type="checkbox">

|uncheck| Code: JSON Issue: Create a separate function to handle JSON return, apply JsonParser (``waybacktweets/api/parse_tweets.py:73``), and avoid rate limiting

|uncheck| Docs: Add tutorial on how to save Tweet via command line

|uncheck| Web App: Return complete JSON when mimetype is ``application/json``

2 changes: 1 addition & 1 deletion _sources/workflow.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -19,5 +19,5 @@ Use the mouse to zoom in and out the flowchart.
C--> |4xx| E[return None]
E--> F{request Archived\nTweet URL}
F--> |4xx| G[return Only CDX data]
F--> |2xx/3xx: application/json| J[return JSON text]
F--> |TODO: 2xx/3xx: application/json| J[return JSON text]
F--> |2xx/3xx: text/html, warc/revisit, unk| K[return HTML iframe tag]
3 changes: 2 additions & 1 deletion _static/css/custom.css
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#cli #usage #waybacktweets h3,
.sphinxsidebarwrapper li ul li ul:has(a[href="#waybacktweets"]):last-child{
#cli .admonition-title,
.sphinxsidebarwrapper li ul li ul:has(a[href="#waybacktweets"]):last-child {
display: none;
}
Loading

0 comments on commit 742046f

Please sign in to comment.