Skip to content

Commit

Permalink
review tweet url parser, update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
claromes committed Jun 17, 2024
1 parent b70d207 commit 81a6bca
Show file tree
Hide file tree
Showing 9 changed files with 61 additions and 24 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

[![PyPI](https://img.shields.io/pypi/v/waybacktweets)](https://pypi.org/project/waybacktweets) [![Streamlit App](https://static.streamlit.io/badges/streamlit_badge_black_white.svg)](https://waybacktweets.streamlit.app)

Retrieves archived tweets CDX data from the Wayback Machine, performs necessary parsing, and saves the data in CSV, JSON, and HTML formats.
Retrieves archived tweets CDX data from the Wayback Machine, performs several parses to facilitate the analysis of archived tweets and types of tweets (see [Field Options](https://claromes.github.io/waybacktweets/field_options.html)), and saves the data in CSV, JSON, and HTML formats.

## Installation

Expand Down
6 changes: 6 additions & 0 deletions docs/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ Request
.. autoclass:: WaybackTweets
:members:

.. _parser:

Parse
---------

Expand All @@ -24,6 +26,8 @@ Parse
.. autoclass:: JsonParser
:members:

.. _exporter:

Export
---------

Expand All @@ -42,6 +46,8 @@ Visualize
:members:
:private-members:

.. _utils:

Utils
-------

Expand Down
9 changes: 3 additions & 6 deletions docs/field_options.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
Field Options
================

The package saves in three formats: CSV, JSON, and HTML. The files have the following fields:
The package performs several parses to facilitate the analysis of archived tweets and types of tweets. The fields below are available, which can be passed to the :ref:`parser` and :ref:`exporter`, in addition, the command line tool returns all these fields.

- ``archived_urlkey``: (`str`) A canonical transformation of the URL you supplied, for example, ``org,eserver,tc)/``. Such keys are useful for indexing.

Expand All @@ -13,12 +13,9 @@ The package saves in three formats: CSV, JSON, and HTML. The files have the foll

- ``archived_tweet_url``: (`str`) The original archived URL.

- ``parsed_tweet_url``: (`str`) The original tweet URL after parsing. `Check the utility functions <api.html#module-waybacktweets.utils.utils>`_.
- ``parsed_tweet_url``: (`str`) The original tweet URL after parsing. Old URLs were archived in a nested manner. The parsing applied here unnests these URLs, when necessary. Check the :ref:`utils`.

- ``parsed_archived_tweet_url``: (`str`) The original archived URL after parsing. `Check the utility functions <api.html#module-waybacktweets.utils.utils>`_.

.. TODO: JSON Issue
.. - ``parsed_tweet_text_mimetype_json``: (`str`) The tweet text extracted from the archived URL that has mimetype ``application/json``.
- ``parsed_archived_tweet_url``: (`str`) The original archived URL after parsing. It is not guaranteed that this option will be archived, it is just a facilitator, as the originally archived URL does not always exist, due to changes in URLs and web services of the social network Twitter. Check the :ref:`utils`.

- ``available_tweet_text``: (`str`) The tweet text extracted from the URL that is still available on the Twitter account.

Expand Down
2 changes: 1 addition & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Wayback Tweets
.. image:: ../assets/waybacktweets.png
:align: center

Retrieves archived tweets CDX data from the Wayback Machine, performs necessary parsing, and saves the data in CSV, JSON, and HTML formats.
Retrieves archived tweets CDX data from the Wayback Machine, performs several parses to facilitate the analysis of archived tweets and types of tweets (see :ref:`field_options`), and saves the data in CSV, JSON, and HTML formats.

.. note::
Intensive queries can lead to rate limiting, resulting in a temporary ban of a few minutes from web.archive.org.
Expand Down
13 changes: 7 additions & 6 deletions docs/todo.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,16 @@ TODO

<input type="checkbox">

|uncheck| JSON Parser: Create a separate function to handle JSON return, apply JsonParser (``waybacktweets/api/parse.py:111``), and avoid rate limiting (`Planned for v1.2`)
|uncheck| Review and publish the new version of the Streamlit Web App

|uncheck| Download images when tweet URL has extensions like JPG or PNG (`Planned for v1.2`)
|uncheck| Unit Tests

|uncheck| Develop a scraper to download snapshots from https://archive.today (`Not planned`)
|uncheck| JSON Parser: Create a separate function to handle JSON return, apply JsonParser (``waybacktweets/api/parse.py:111``), and avoid rate limiting

|uncheck| Unit Tests (`Planned for v1.1`)
|uncheck| Download images when tweet URL has extensions like JPG or PNG

|uncheck| Mapping and parsing of other Twitter-related URLs (`Planned`)
|uncheck| Implement logging system (remove print statements)

|uncheck| Review and publish the new version of the Streamlit Web App (`Planned for v1.0.1`)
|uncheck| Mapping and parsing of other Twitter-related URLs

|uncheck| Develop a scraper to download snapshots from https://archive.today
4 changes: 4 additions & 0 deletions waybacktweets/api/parse.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
from rich.progress import Progress

from waybacktweets.config.config import config
from waybacktweets.config.field_options import FIELD_OPTIONS
from waybacktweets.exceptions.exceptions import (
ConnectionError,
GetResponseError,
Expand Down Expand Up @@ -167,6 +168,9 @@ def __init__(
username: str,
field_options: List[str],
):
if not all(option in FIELD_OPTIONS for option in field_options):
raise ValueError("Some field options are not valid.")

self.archived_tweets_response = archived_tweets_response
self.username = username
self.field_options = field_options
Expand Down
1 change: 1 addition & 0 deletions waybacktweets/config/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# flake8: noqa: F401

from waybacktweets.config.config import config
from waybacktweets.config.field_options import FIELD_OPTIONS
19 changes: 19 additions & 0 deletions waybacktweets/config/field_options.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
"""
List of valid field options that can be used for parsing tweets.
"""

FIELD_OPTIONS = [
"archived_urlkey",
"archived_timestamp",
"original_tweet_url",
"archived_tweet_url",
"parsed_tweet_url",
"parsed_archived_tweet_url",
"available_tweet_text",
"available_tweet_is_RT",
"available_tweet_info",
"archived_mimetype",
"archived_statuscode",
"archived_digest",
"archived_length",
]
29 changes: 19 additions & 10 deletions waybacktweets/utils/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -114,21 +114,30 @@ def clean_wayback_machine_url(

def check_pattern_tweet(tweet_url: str) -> str:
"""
Extracts the tweet ID from a tweet URL.
Extracts the URL from a tweet URL with patterns such as:
- Reply: /status//
- Link: /status///
- Twimg: /status/https://pbs
Args:
tweet_url (str): The tweet URL to extract the ID from.
tweet_url (str): The tweet URL to extract the URL from.
Returns:
The extracted tweet ID.
Only the extracted URL from a tweet.
"""
pattern = re.compile(r'/status/"([^"]+)"')

match = pattern.search(tweet_url)
if match:
return match.group(1).lstrip("/")
else:
return tweet_url
patterns = [
re.compile(r'/status/"([^"]+)"'),
re.compile(r'/status/&quot;([^"]+)&quot;'),
re.compile(r'/status/%3B([^"]+)%3B'),
]

for pattern in patterns:
match = pattern.search(tweet_url)
if match:
return match.group(1).lstrip("/")
else:
return tweet_url


def delete_tweet_pathnames(tweet_url: str) -> str:
Expand Down

0 comments on commit 81a6bca

Please sign in to comment.