Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return the oldest transcript matching the video ID #1163

Merged
merged 1 commit into from
Aug 22, 2023

Conversation

seanh
Copy link
Contributor

@seanh seanh commented Aug 15, 2023

Just return the oldest saved transcript in the DB that matches the given video_id, instead of also searching for the hard-coded transcript_id value of "en".

This is a slightly simplified rewrite of #1159.

YouTubeService.get_transcript() currently always saves transcripts in the DB with transcript.transcript_id="en" (hard-coded) and it always queries the DB for a transcript with transcript.video_id=video_id and transcript.transcript_id="en".

In #1162 we're going to start saving transcripts in the DB with different transcript_id's (we're going to be generating unique transcript IDs that work even when a video has multiple transcripts with the same language code) so we can no longer always query for a hard-coded transcript.transcript_id="en".

Instead, this PR changes get_transcript() to query for transcript.video_id=video_id only and then, since there could be more than one saved transcript with the same video_id, it just returns the oldest saved transcript.

This should work, and it'll also mean that the default transcripts of previously-annotated videos and assignments don't change when we change the default transcript selection algorithm.

What we should really do here (probably) is persist the choice of default transcript by adding a nullable video.default_transcript column to the video table (foreign key to the transcript table). But for the sake of expedience this will do for now. We can always back-fill a video.default_transcript column in future by finding the oldest saved transcript for each video_id.

Just return the oldest saved transcript in the DB that matches the given
`video_id`, instead of also searching for the hard-coded `transcript_id`
value of `"en"`.
@seanh seanh force-pushed the query-for-oldest-transcript branch from afbf2b0 to f431a6d Compare August 15, 2023 12:01
Comment on lines +13 to +26
transcript = Sequence(
lambda n: [
{
"text": f"This is the first line of transcript {n}",
"start": 0.0,
"duration": 7.52,
},
{
"text": f"This is the second line of transcript {n}",
"start": 5.6,
"duration": 4.72,
},
]
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed TranscriptFactory so that each Transcript object has a different value for Transcript.transcript. This is needed by the tests below, and is probably a good idea in general: TranscriptFactory always returning Transcript's with the same Transcript.transcript is a bit of a footgun (tests that compare the values of two Transcript.transcript's could have false positives thinking they came from the same Transcript when in fact they belonged to different ones).

@@ -116,7 +117,6 @@ def test_get_transcript(self, db_session, svc, YouTubeTranscriptApi):
]

@pytest.mark.usefixtures("db_session")
@pytest.mark.parametrize("transcript__transcript_id", ["en"])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parametrize was fixing the value of transcript.transcript_id to "en" because get_transcript()'s DB query would only find transcripts with Transcript.transcript_id="en". It's no longer necessary for the test to do this.

@seanh seanh marked this pull request as ready for review August 15, 2023 12:06
Copy link
Member

@marcospri marcospri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 👍

Tested in my local DB with an existing transcript from main and then:

insert into transcript (select now(), now(), 2, video_id, transcript_id || 'new', transcript from transcript limit 1);

update trasnscript set transcript='[]' where id = 2;  

all good

update trasnscript set transcript='[]' where id = 1;  

no transcript lines now

@marcospri marcospri merged commit e54c2f6 into main Aug 22, 2023
7 checks passed
@marcospri marcospri deleted the query-for-oldest-transcript branch August 22, 2023 07:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants