Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache YouTube video titles in the DB #1093

Merged
merged 1 commit into from
Aug 3, 2023
Merged

Cache YouTube video titles in the DB #1093

merged 1 commit into from
Aug 3, 2023

Conversation

seanh
Copy link
Contributor

@seanh seanh commented Jul 17, 2023

Fixes #1118.

Testing

Test models.Video

$ make shell
>>> video = models.Video(video_id="VIDEO_ID", title="TITLE")
>>> video
Video(id=None, video_id='VIDEO_ID')
>>> db.add(video)
>>> tm.commit()
>>> tm.begin()
<transaction._transaction.Transaction at 0x7f5235b13040>
>>> from sqlalchemy import select
>>> db.scalars(select(models.Video)).all()
[Video(id=1, video_id='VIDEO_ID')]

Test VideoFactory

>>> factories.VideoFactory()
Video(id=None, video_id='video_id_0')
>>> factories.VideoFactory()
Video(id=None, video_id='video_id_1')

Test the web UI

http://localhost:9083/https://www.youtube.com/watch?v=byN4S8WDPt8 (the video title is used for the tab's title and for the document title if you create an annotation)

@seanh seanh changed the base branch from main to db July 17, 2023 17:17
@seanh seanh changed the title cache video titles Cache YouTube video titles in the DB Jul 17, 2023
@seanh seanh force-pushed the cache-video-titles branch 3 times, most recently from 1d2075d to 518ac76 Compare July 18, 2023 16:37
via/models/_mixins.py Outdated Show resolved Hide resolved
via/models/video.py Outdated Show resolved Hide resolved
@marcospri
Copy link
Member

marcospri commented Jul 19, 2023

Testing this a a way to test the boilerplate in other PRs, doing:

  • make services
  • make db
  • This PR doesn't include the migration, that needs to happen before this. Tested that also
  • mkdir via/migrations/versions
  • tox -e dev --run-command 'alembic revision --autogenerate -m "video"'
  • tox -e dev --run-command 'alembic history'
  • tox -e dev --run-command 'alembic upgrade head'
  • tox -e dev --run-command 'alembic downgrade -1'
  • tox -e dev --run-command 'alembic upgrade head'
  • Got around Integrate Via's new DB with Pyramid and pytest #1019 (comment) comenting create_all.
  • make sql works
  • The actual purpose of this PR also works (ie new videos appear on the table)

class CreatedUpdatedMixin(MappedAsDataclass):
created: Mapped[datetime] = mapped_column(
init=False,
server_default=func.now(), # pylint:disable=not-callable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was holding the SQLA alchemy PR over LMS because this is annoying (not-callable)

I guess I'll have to bite the bullet and spirnkle a few of these there
hypothesis/lms#5283

@seanh seanh force-pushed the db branch 2 times, most recently from 0b3210c to 35c1239 Compare July 25, 2023 16:03
seanh added a commit that referenced this pull request Jul 26, 2023
Context
-------

At startup (when you run `make dev`) Via's Gunicorn processes attempt to
create the DB tables if they don't already exist. This is done in dev
only, not in production:

    # via/db.py

    def includeme(config):
        ...

        if config.registry.settings.get("dev"):
            # Create the database tables if they don't already exist.
            Base.metadata.create_all(engine)

Problem
-------

Via actually has four Gunicorn workers in dev (see the `workers = 4` in
`conf/gunicorn/dev.conf.py`). All 4 try to create the DB tables at the
same time and 3 of them crash, printing out error messages.

The issue is kind of difficult to produce:

1. You need to actually have a model, so you won't be able to produce
   the issue on the `db` branch without changes. See
   #1072 and
   #1093 for branches on which you
   can reproduce the issue.

2. The model's table needs to actually be missing from the DB table. You
   can do that by running this command:

       make services args=down && make services && make dev

3. Since the issue is a race condition you won't be able to reproduce it
   reliably, but if you run the above command several times you should
   see it.

Solution
--------

This commit moves the `sqlalchemy.create_all()` call out of the Gunicorn
worker process and instead adds a separate `make_db` process to the list
of processes that `supervisord-dev.conf` runs when you run `make dev`.
This new `make_db` process starts up, calls `create_all()`, and then
exits, and since it's a singular process there's no collision.
seanh added a commit that referenced this pull request Jul 26, 2023
Context
-------

At startup (when you run `make dev`) Via's Gunicorn processes attempt to
create the DB tables if they don't already exist. This is done in dev
only, not in production:

    # via/db.py

    def includeme(config):
        ...

        if config.registry.settings.get("dev"):
            # Create the database tables if they don't already exist.
            Base.metadata.create_all(engine)

Problem
-------

Via actually has four Gunicorn workers in dev (see the `workers = 4` in
`conf/gunicorn/dev.conf.py`). All 4 try to create the DB tables at the
same time and 3 of them crash, printing out error messages.

The issue is kind of difficult to produce:

1. You need to actually have a model, so you won't be able to produce
   the issue on the `db` branch without changes. See
   #1072 and
   #1093 for branches on which you
   can reproduce the issue.

2. The model's table needs to actually be missing from the DB table. You
   can do that by running this command:

       make services args=down && make services && make dev

3. Since the issue is a race condition you won't be able to reproduce it
   reliably, but if you run the above command several times you should
   see it.

Solution
--------

This commit moves the `sqlalchemy.create_all()` call out of the Gunicorn
worker process and instead adds a separate `make_db` process to the list
of processes that `supervisord-dev.conf` runs when you run `make dev`.
This new `make_db` process starts up, calls `create_all()`, and then
exits, and since it's a singular process there's no collision.
seanh added a commit that referenced this pull request Jul 26, 2023
Context
-------

At startup (when you run `make dev`) Via's Gunicorn processes attempt to
create the DB tables if they don't already exist. This is done in dev
only, not in production:

    # via/db.py

    def includeme(config):
        ...

        if config.registry.settings.get("dev"):
            # Create the database tables if they don't already exist.
            Base.metadata.create_all(engine)

Problem
-------

Via actually has four Gunicorn workers in dev (see the `workers = 4` in
`conf/gunicorn/dev.conf.py`). All 4 try to create the DB tables at the
same time and 3 of them crash, printing out error messages.

The issue is kind of difficult to produce:

1. You need to actually have a model, so you won't be able to produce
   the issue on the `db` branch without changes. See
   #1072 and
   #1093 for branches on which you
   can reproduce the issue.

2. The model's table needs to actually be missing from the DB table. You
   can do that by running this command:

       make services args=down && make services && make dev

3. Since the issue is a race condition you won't be able to reproduce it
   reliably, but if you run the above command several times you should
   see it.

Solution
--------

This commit moves the `sqlalchemy.create_all()` call out of the Gunicorn
worker process and instead adds a separate `make_db` process to the list
of processes that `supervisord-dev.conf` runs when you run `make dev`.
This new `make_db` process starts up, calls `create_all()`, and then
exits, and since it's a singular process there's no collision.
@seanh seanh force-pushed the db branch 2 times, most recently from 1ec445e to 7f96bf1 Compare July 27, 2023 13:26
seanh added a commit that referenced this pull request Jul 27, 2023
Context
-------

At startup (when you run `make dev`) Via's Gunicorn processes attempt to
create the DB tables if they don't already exist. This is done in dev
only, not in production:

    # via/db.py

    def includeme(config):
        ...

        if config.registry.settings.get("dev"):
            # Create the database tables if they don't already exist.
            Base.metadata.create_all(engine)

Problem
-------

Via actually has four Gunicorn workers in dev (see the `workers = 4` in
`conf/gunicorn/dev.conf.py`). All 4 try to create the DB tables at the
same time and 3 of them crash, printing out error messages.

The issue is kind of difficult to produce:

1. You need to actually have a model, so you won't be able to produce
   the issue on the `db` branch without changes. See
   #1072 and
   #1093 for branches on which you
   can reproduce the issue.

2. The model's table needs to actually be missing from the DB table. You
   can do that by running this command:

       make services args=down && make services && make dev

3. Since the issue is a race condition you won't be able to reproduce it
   reliably, but if you run the above command several times you should
   see it.

Solution
--------

This commit moves the `sqlalchemy.create_all()` call out of the Gunicorn
worker process and instead adds a separate `make_db` process to the list
of processes that `supervisord-dev.conf` runs when you run `make dev`.
This new `make_db` process starts up, calls `create_all()`, and then
exits, and since it's a singular process there's no collision.
seanh added a commit that referenced this pull request Jul 27, 2023
Context
-------

At startup (when you run `make dev`) Via's Gunicorn processes attempt to
create the DB tables if they don't already exist. This is done in dev
only, not in production:

    # via/db.py

    def includeme(config):
        ...

        if config.registry.settings.get("dev"):
            # Create the database tables if they don't already exist.
            Base.metadata.create_all(engine)

Problem
-------

Via actually has four Gunicorn workers in dev (see the `workers = 4` in
`conf/gunicorn/dev.conf.py`). All 4 try to create the DB tables at the
same time and 3 of them crash, printing out error messages.

The issue is kind of difficult to reproduce:

1. You need to actually have a model, so you won't be able to produce
   the issue on the `db` branch without changes. See
   #1072 and
   #1093 for branches on which you
   can reproduce the issue.

2. The model's table needs to actually be missing from the DB table. You
   can do that by running this command:

       make services args=down && make services && make dev

3. Since the issue is a race condition you won't be able to reproduce it
   reliably, but if you run the above command several times you should
   see it.

Solution
--------

This commit moves the `sqlalchemy.create_all()` call out of the Gunicorn
worker process and instead adds a separate `make_db` process to the list
of processes that `supervisord-dev.conf` runs when you run `make dev`.
This new `make_db` process starts up, calls `create_all()`, and then
exits, and since it's a singular process there's no collision.
seanh added a commit that referenced this pull request Jul 30, 2023
Context
-------

At startup (when you run `make dev`) Via's Gunicorn processes attempt to
create the DB tables if they don't already exist. This is done in dev
only, not in production:

    # via/db.py

    def includeme(config):
        ...

        if config.registry.settings.get("dev"):
            # Create the database tables if they don't already exist.
            Base.metadata.create_all(engine)

Problem
-------

Via actually has four Gunicorn workers in dev (see the `workers = 4` in
`conf/gunicorn/dev.conf.py`). All 4 try to create the DB tables at the
same time and 3 of them crash, printing out error messages.

The issue is kind of difficult to reproduce:

1. You need to actually have a model, so you won't be able to produce
   the issue on the `db` branch without changes. See
   #1072 and
   #1093 for branches on which you
   can reproduce the issue.

2. The model's table needs to actually be missing from the DB table. You
   can do that by running this command:

       make services args=down && make services && make dev

3. Since the issue is a race condition you won't be able to reproduce it
   reliably, but if you run the above command several times you should
   see it.

Solution
--------

This commit moves the `sqlalchemy.create_all()` call out of the Gunicorn
worker process and instead adds a separate `make_db` process to the list
of processes that `supervisord-dev.conf` runs when you run `make dev`.
This new `make_db` process starts up, calls `create_all()`, and then
exits, and since it's a singular process there's no collision.
seanh added a commit that referenced this pull request Jul 30, 2023
Context
-------

At startup (when you run `make dev`) Via's Gunicorn processes attempt to
create the DB tables if they don't already exist. This is done in dev
only, not in production:

    # via/db.py

    def includeme(config):
        ...

        if config.registry.settings.get("dev"):
            # Create the database tables if they don't already exist.
            Base.metadata.create_all(engine)

Problem
-------

Via actually has four Gunicorn workers in dev (see the `workers = 4` in
`conf/gunicorn/dev.conf.py`). All 4 try to create the DB tables at the
same time and 3 of them crash, printing out error messages.

The issue is kind of difficult to reproduce:

1. You need to actually have a model, so you won't be able to produce
   the issue on the `db` branch without changes. See
   #1072 and
   #1093 for branches on which you
   can reproduce the issue.

2. The model's table needs to actually be missing from the DB table. You
   can do that by running this command:

       make services args=down && make services && make dev

3. Since the issue is a race condition you won't be able to reproduce it
   reliably, but if you run the above command several times you should
   see it.

Solution
--------

This commit moves the `sqlalchemy.create_all()` call out of the Gunicorn
worker process and instead adds a separate `make_db` process to the list
of processes that `supervisord-dev.conf` runs when you run `make dev`.
This new `make_db` process starts up, calls `create_all()`, and then
exits, and since it's a singular process there's no collision.
Base automatically changed from db to main July 31, 2023 12:16
seanh added a commit that referenced this pull request Jul 31, 2023
Context
-------

At startup (when you run `make dev`) Via's Gunicorn processes attempt to
create the DB tables if they don't already exist. This is done in dev
only, not in production:

    # via/db.py

    def includeme(config):
        ...

        if config.registry.settings.get("dev"):
            # Create the database tables if they don't already exist.
            Base.metadata.create_all(engine)

Problem
-------

Via actually has four Gunicorn workers in dev (see the `workers = 4` in
`conf/gunicorn/dev.conf.py`). All 4 try to create the DB tables at the
same time and 3 of them crash, printing out error messages.

The issue is kind of difficult to reproduce:

1. You need to actually have a model, so you won't be able to produce
   the issue on the `db` branch without changes. See
   #1072 and
   #1093 for branches on which you
   can reproduce the issue.

2. The model's table needs to actually be missing from the DB table. You
   can do that by running this command:

       make services args=down && make services && make dev

3. Since the issue is a race condition you won't be able to reproduce it
   reliably, but if you run the above command several times you should
   see it.

Solution
--------

This commit moves the `sqlalchemy.create_all()` call out of the Gunicorn
worker process and instead adds a separate `make_db` process to the list
of processes that `supervisord-dev.conf` runs when you run `make dev`.
This new `make_db` process starts up, calls `create_all()`, and then
exits, and since it's a singular process there's no collision.
@seanh seanh force-pushed the cache-video-titles branch 2 times, most recently from 60b0edb to 964fb9f Compare July 31, 2023 14:35
@seanh seanh changed the base branch from main to transcript July 31, 2023 14:35
Comment on lines 7 to 11
class Video(CreatedUpdatedMixin, Base):
__tablename__ = "video"

id: Mapped[int] = mapped_column(primary_key=True, autoincrement=True, init=False)
video_id: Mapped[str] = mapped_column(unique=True)
title: Mapped[str] = mapped_column(repr=False)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we've done with transcripts (over in #1072) I've decided not to namespace videos by type/source yet (e.g. by adding a type="youtube" column). We can add this later if and when we ever add support for a second type of video

@seanh seanh marked this pull request as ready for review July 31, 2023 15:12
@seanh seanh requested a review from marcospri July 31, 2023 16:35
Base automatically changed from transcript to main August 1, 2023 14:06
@seanh seanh force-pushed the cache-video-titles branch 2 times, most recently from 1a75004 to d99233a Compare August 1, 2023 16:58
@seanh seanh changed the base branch from main to video-table-migration August 1, 2023 16:58
Copy link
Member

@marcospri marcospri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 👍

Added the do-not-merge tag here just as remainder to deploy and run the migration first

# https://developers.google.com/youtube/v3/docs/videos/list
video = self._db.scalar(select(Video).where(Video.video_id == video_id))

if video:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess is now idiomatic python to use if video := here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Nice! Done

class Video(AutoincrementingIntegerIDMixin, CreatedUpdatedMixin, Base):
__tablename__ = "video"

video_id: Mapped[str] = mapped_column(unique=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we already have transcripts in the DB it's probably too complicated to be worth it but we could have a FK in transcripts pointing to this table.

We can join now by video_id if we need to, it has an index in both tables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I thought about this but I didn't really think it was worth the complexity. We don't need an fkey in order to look up a Video given a Transcript, as you said. So I'm not sure an fkey would really buy us anything. In fact, I think we might actually not want to enforce that every Video in the DB must have a matching Transcript or vice-versa, the order of operations (API calls etc) may call for the Video alone to first be saved and then the transcript not downloaded until a later request or vice-versa. I suppose this does mean that any code that has a Video cannot assume that a matching Transcript exists or vice-versa, which may lead to a little more code complexity at some point.

Base automatically changed from video-table-migration to main August 3, 2023 11:36
@seanh seanh merged commit 4687525 into main Aug 3, 2023
7 checks passed
@seanh seanh deleted the cache-video-titles branch August 3, 2023 11:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Save video titles in the DB
2 participants