-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return the oldest transcript matching the video ID #1163
Conversation
Just return the oldest saved transcript in the DB that matches the given `video_id`, instead of also searching for the hard-coded `transcript_id` value of `"en"`.
afbf2b0
to
f431a6d
Compare
transcript = Sequence( | ||
lambda n: [ | ||
{ | ||
"text": f"This is the first line of transcript {n}", | ||
"start": 0.0, | ||
"duration": 7.52, | ||
}, | ||
{ | ||
"text": f"This is the second line of transcript {n}", | ||
"start": 5.6, | ||
"duration": 4.72, | ||
}, | ||
] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed TranscriptFactory
so that each Transcript
object has a different value for Transcript.transcript
. This is needed by the tests below, and is probably a good idea in general: TranscriptFactory
always returning Transcript
's with the same Transcript.transcript
is a bit of a footgun (tests that compare the values of two Transcript.transcript
's could have false positives thinking they came from the same Transcript
when in fact they belonged to different ones).
@@ -116,7 +117,6 @@ def test_get_transcript(self, db_session, svc, YouTubeTranscriptApi): | |||
] | |||
|
|||
@pytest.mark.usefixtures("db_session") | |||
@pytest.mark.parametrize("transcript__transcript_id", ["en"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This parametrize
was fixing the value of transcript.transcript_id
to "en"
because get_transcript()
's DB query would only find transcripts with Transcript.transcript_id="en"
. It's no longer necessary for the test to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good 👍
Tested in my local DB with an existing transcript from main and then:
insert into transcript (select now(), now(), 2, video_id, transcript_id || 'new', transcript from transcript limit 1);
update trasnscript set transcript='[]' where id = 2;
all good
update trasnscript set transcript='[]' where id = 1;
no transcript lines now
Just return the oldest saved transcript in the DB that matches the given
video_id
, instead of also searching for the hard-codedtranscript_id
value of"en"
.This is a slightly simplified rewrite of #1159.
YouTubeService.get_transcript()
currently always saves transcripts in the DB withtranscript.transcript_id="en"
(hard-coded) and it always queries the DB for a transcript withtranscript.video_id=video_id
andtranscript.transcript_id="en"
.In #1162 we're going to start saving transcripts in the DB with different
transcript_id
's (we're going to be generating unique transcript IDs that work even when a video has multiple transcripts with the same language code) so we can no longer always query for a hard-codedtranscript.transcript_id="en"
.Instead, this PR changes
get_transcript()
to query fortranscript.video_id=video_id
only and then, since there could be more than one saved transcript with the samevideo_id
, it just returns the oldest saved transcript.This should work, and it'll also mean that the default transcripts of previously-annotated videos and assignments don't change when we change the default transcript selection algorithm.
What we should really do here (probably) is persist the choice of default transcript by adding a nullable
video.default_transcript
column to thevideo
table (foreign key to thetranscript
table). But for the sake of expedience this will do for now. We can always back-fill avideo.default_transcript
column in future by finding the oldest saved transcript for eachvideo_id
.