Return the oldest transcript matching the video ID #1163

seanh · 2023-08-15T11:32:59Z

Just return the oldest saved transcript in the DB that matches the given video_id, instead of also searching for the hard-coded transcript_id value of "en".

This is a slightly simplified rewrite of #1159.

YouTubeService.get_transcript() currently always saves transcripts in the DB with transcript.transcript_id="en" (hard-coded) and it always queries the DB for a transcript with transcript.video_id=video_id and transcript.transcript_id="en".

In #1162 we're going to start saving transcripts in the DB with different transcript_id's (we're going to be generating unique transcript IDs that work even when a video has multiple transcripts with the same language code) so we can no longer always query for a hard-coded transcript.transcript_id="en".

Instead, this PR changes get_transcript() to query for transcript.video_id=video_id only and then, since there could be more than one saved transcript with the same video_id, it just returns the oldest saved transcript.

This should work, and it'll also mean that the default transcripts of previously-annotated videos and assignments don't change when we change the default transcript selection algorithm.

What we should really do here (probably) is persist the choice of default transcript by adding a nullable video.default_transcript column to the video table (foreign key to the transcript table). But for the sake of expedience this will do for now. We can always back-fill a video.default_transcript column in future by finding the oldest saved transcript for each video_id.

Just return the oldest saved transcript in the DB that matches the given `video_id`, instead of also searching for the hard-coded `transcript_id` value of `"en"`.

seanh · 2023-08-15T12:03:32Z

tests/factories/transcript.py

+    transcript = Sequence(
+        lambda n: [
+            {
+                "text": f"This is the first line of transcript {n}",
+                "start": 0.0,
+                "duration": 7.52,
+            },
+            {
+                "text": f"This is the second line of transcript {n}",
+                "start": 5.6,
+                "duration": 4.72,
+            },
+        ]
+    )


Fixed TranscriptFactory so that each Transcript object has a different value for Transcript.transcript. This is needed by the tests below, and is probably a good idea in general: TranscriptFactory always returning Transcript's with the same Transcript.transcript is a bit of a footgun (tests that compare the values of two Transcript.transcript's could have false positives thinking they came from the same Transcript when in fact they belonged to different ones).

seanh · 2023-08-15T12:05:07Z

tests/unit/via/services/youtube_test.py

@@ -116,7 +117,6 @@ def test_get_transcript(self, db_session, svc, YouTubeTranscriptApi):
        ]

    @pytest.mark.usefixtures("db_session")
-    @pytest.mark.parametrize("transcript__transcript_id", ["en"])


This parametrize was fixing the value of transcript.transcript_id to "en" because get_transcript()'s DB query would only find transcripts with Transcript.transcript_id="en". It's no longer necessary for the test to do this.

marcospri

Looks good 👍

Tested in my local DB with an existing transcript from main and then:

insert into transcript (select now(), now(), 2, video_id, transcript_id || 'new', transcript from transcript limit 1);

update trasnscript set transcript='[]' where id = 2;

all good

update trasnscript set transcript='[]' where id = 1;

no transcript lines now

seanh requested a review from marcospri August 15, 2023 11:33

seanh mentioned this pull request Aug 15, 2023

Return the first transcript we find, instead of searching only for "en" #1159

Closed

Return the oldest transcript matching the video ID

f431a6d

Just return the oldest saved transcript in the DB that matches the given `video_id`, instead of also searching for the hard-coded `transcript_id` value of `"en"`.

seanh force-pushed the query-for-oldest-transcript branch from afbf2b0 to f431a6d Compare August 15, 2023 12:01

seanh commented Aug 15, 2023

View reviewed changes

seanh marked this pull request as ready for review August 15, 2023 12:06

marcospri approved these changes Aug 16, 2023

View reviewed changes

marcospri merged commit e54c2f6 into main Aug 22, 2023
7 checks passed

marcospri deleted the query-for-oldest-transcript branch August 22, 2023 07:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return the oldest transcript matching the video ID #1163

Return the oldest transcript matching the video ID #1163

seanh commented Aug 15, 2023 •

edited

Loading

seanh Aug 15, 2023

seanh Aug 15, 2023

marcospri left a comment

Return the oldest transcript matching the video ID #1163

Return the oldest transcript matching the video ID #1163

Conversation

seanh commented Aug 15, 2023 • edited Loading

seanh Aug 15, 2023

Choose a reason for hiding this comment

seanh Aug 15, 2023

Choose a reason for hiding this comment

marcospri left a comment

Choose a reason for hiding this comment

seanh commented Aug 15, 2023 •

edited

Loading