-
-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added chapter-level filtering #161
Conversation
My csharpier was broken (+ I forgot about the mismatched .NET versions). I'll go ahead and take care of that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than require that all engine implementations be able to parse the Scripture range string, I think it would be better to perform the parsing in Serval. Serval will already have a dependency on SIL.Machine for USFM generation, so we can do the same for the Scripture range parsing.
Reviewed 13 of 13 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93 and @johnml1135)
src/SIL.Machine.AspNetCore/Services/BiblicalRangeStringParser.cs
line 0 at r1 (raw file):
I order to better match up with Machine.py:
- this class should be moved to
SIL.Machine/Scripture
- Turned into a static class and renamed to
ScriptureRangeParser
- The
Parse
should be made static and renamed toGetChapters
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And then pass a dictionary rather than a string from Serval to Machine? You're not suggesting the other filtering logic is somehow put into Serval, right?
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit and @johnml1135)
src/SIL.Machine.AspNetCore/Services/BiblicalRangeStringParser.cs
line at r1 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
I order to better match up with Machine.py:
- this class should be moved to
SIL.Machine/Scripture
- Turned into a static class and renamed to
ScriptureRangeParser
- The
Parse
should be made static and renamed toGetChapters
.
That seems reasonable.
So, it's calculated for every row of text? That should be fixed... |
So, to be clear, if the verse is in either the source or the target versification, it will work? Is that what we want? |
Yes, apparently, I missed my own TODO haha. Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we would just parse the string into a data structure that we pass to Machine.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @Enkidu93 and @johnml1135)
Also this one needs pulled out of the loop too. |
Previously, johnml1135 (John Lambert) wrote…
Yep! Thank you |
We should probably only do these checks if there is a biblical range - if not, just skip it. Less computation? |
What is the purpose of these term renderings and other files? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 13 of 13 files at r1, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @ddaspit and @Enkidu93)
tests/SIL.Machine.AspNetCore.Tests/Services/BiblicalRangeStringParserTests.cs
line 73 at r1 (raw file):
yield return new TestCaseData("NT,OT,-MRK,-EXO", new Dictionary<string, List<int>>(), true); yield return new TestCaseData("OT,MAT1", new Dictionary<string, List<int>>(), true); yield return new TestCaseData("OT,MAT-LUK", new Dictionary<string, List<int>>(), true);
Is this the same list from python?
Hmmm. It could make errors easier to handle. Wouldn't it also be good then to pull textId filtering into Serval as well? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I think the idea is to keep the filtering in Machine, just move the actual parsing of the string to Machine. Ultimately, there will still be filtration code in all engine jobs regardless - it just means that Serval is responsible for parsing the range - which kind of makes sense semantically. This way too, we can catch that parsing error earlier on.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @ddaspit and @johnml1135)
tests/SIL.Machine.AspNetCore.Tests/Services/BiblicalRangeStringParserTests.cs
line 73 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Is this the same list from python?
Yep, all the same test cases
tests/SIL.Machine.AspNetCore.Tests/Services/data/paratext2/TermRenderings.xml
line 2 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
What is the purpose of these term renderings and other files?
Originally, our only Paratext NmtPreprocessBuildJob
tests were mapping a single Paratext project as source and target. That didn't allow me enough control to make good biblical range tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ddaspit I'm realizing the downside of this approach is that we don't have access to the versification in Serval (which the ScriptureRangeParser
is using to determine books/counts accurately). I'm not sure how we'd get around that unless you don't imagine that will really be an issue. Let me know as soon as possible what you think.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @ddaspit and @johnml1135)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to write code that accesses the versification for the USFM generation PR. I will abstract that functionality into a service that you can use.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @Enkidu93 and @johnml1135)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I guess if you're interdependent, you might as well go all the way haha. Thank you Let me know when that's in place.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @ddaspit and @johnml1135)
Chapters or ScriptureRange? |
Previously, johnml1135 (John Lambert) wrote…
I see Damien's comment at the bottom. Ok. |
That logic appears solid, namely:
The only thing I wonder is if could be optimized more (see here for intersection explanation - https://stackoverflow.com/questions/14527595/intersection-of-two-sets-in-most-optimized-way). An alternative could be:
It may save some cycles - what do you think? I am fine with it as is as I am not anticipating a meaningful performance hit even with O(NMP). |
Just to clarify how references work. There can be multiple references - is it the source and target reference? Are we assuming only one actual verse reference? And therefore only one book (Daniel vs. Bell and the Dragon)? It's a severe edge case - just checking. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 9 of 9 files at r2, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @ddaspit and @Enkidu93)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 8 of 9 files at r2, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @Enkidu93)
src/SIL.Machine/Scripture/ScriptureRangeParser.cs
line 8 at r2 (raw file):
using SIL.Scripture; public class ScriptureRangeParser
It would be nice to add a static convenience method GetChapters(string chapterSelection, ScrVers versification = null)
.
src/SIL.Machine.AspNetCore/Services/NmtPreprocessBuildJob.cs
line 138 at r2 (raw file):
{ Dictionary<string, List<int>>? rowChaptersPerBook = null; if (corpus.TrainOnChapters != null || corpus.PretranslateChapters != null)
This seems much simpler and clearer to me:
bool IsInChapters(Dictionary<string, List<int>> bookChapters, object rowRef)
{
if (rowRef is not VerseRef vr)
return false;
return bookChapters.TryGetValue(vr.Book, out List<int>? chapters)
&& chapters.Contains(vr.ChapterNum);
}
if (corpus.TrainOnChapters is not null)
isInTrainOnChapters = row.Refs.Any(r => IsInChapters(corpus.TrainOnChapters, r));
if (corpus.PretranslateChapters is not null)
isInPretranslateChapters = row.Refs.Any(r => IsInChapters(corpus.PretranslateChapters, r));
Also, it would be slightly more efficient ifTrainOnChapters
and PretranslateChapters
were Dictionary<string, HashSet<int>>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @ddaspit and @johnml1135)
src/SIL.Machine.AspNetCore/Services/NmtPreprocessBuildJob.cs
line 137 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
We should probably only do these checks if there is a biblical range - if not, just skip it. Less computation?
Yeah - and that should be rolled in with the refactoring of the logic Damien suggests.
src/SIL.Machine.AspNetCore/Services/NmtPreprocessBuildJob.cs
line 142 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
So, to be clear, if the verse is in either the source or the target versification, it will work? Is that what we want?
I'm not sure what you mean here. Are you saying we should just use one of either SourceRefs or TargetRefs here (whichever we use Serval-side)? The refs is kind of a shorthand for SourceRefs if there are any; otehrwise, targetrefs.
src/SIL.Machine.AspNetCore/Services/NmtPreprocessBuildJob.cs
line 138 at r2 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
This seems much simpler and clearer to me:
bool IsInChapters(Dictionary<string, List<int>> bookChapters, object rowRef) { if (rowRef is not VerseRef vr) return false; return bookChapters.TryGetValue(vr.Book, out List<int>? chapters) && chapters.Contains(vr.ChapterNum); } if (corpus.TrainOnChapters is not null) isInTrainOnChapters = row.Refs.Any(r => IsInChapters(corpus.TrainOnChapters, r)); if (corpus.PretranslateChapters is not null) isInPretranslateChapters = row.Refs.Any(r => IsInChapters(corpus.PretranslateChapters, r));
Also, it would be slightly more efficient if
TrainOnChapters
andPretranslateChapters
wereDictionary<string, HashSet<int>>
.
Good call. This can be collapsed in a way that was more difficult before when the parsing was here. My mind is working at flu speed at the moment, but I don't think your code would properly handle the 'empty list = all chapters' idiom, right? It's an easy addition, but it just brought to my attention that we should make sure we're all on the same page about that.
As for the HashSet - definitely, I can convert it to a hash set when it's being mapped (which is a cost in and of itself), but probably still cleaner.
src/SIL.Machine.AspNetCore/Services/NmtPreprocessBuildJob.cs
line 156 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
That logic appears solid, namely:
- Both train on chapters and rows are dictionaries: book (string): chapters (list of ints)
- Join the two, looking for intersections. If there is an intersection, well and good. If, not, if both row and filter have a book, but the filter has no chapter, it is valid as well. Any matches, include the verse.
The only thing I wonder is if could be optimized more (see here for intersection explanation - https://stackoverflow.com/questions/14527595/intersection-of-two-sets-in-most-optimized-way). An alternative could be:
- Per row verse reference, get filter's book. if ref chapter in filter list, good, or if filter list size == 0. Any hits and it is included.
- This would be O(# of verse references in row * number of chapters in book * rows)
- We could be crazy about optimization and make an array 256 boolean for the filter and then check the chapter as a lookup in that array and get O(# of verse references in row * rows), but it would be more obtuse and give diminishing returns.
It may save some cycles - what do you think? I am fine with it as is as I am not anticipating a meaningful performance hit even with O(NMP).
Definitely - the optimizations are endless haha. Because this is bounded (we know how many chapters there are), I don't know how crazy we need to get here. Let me know if you'd like a motion in this direction.
src/SIL.Machine.AspNetCore/Services/NmtPreprocessBuildJob.cs
line 183 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Just to clarify how references work. There can be multiple references - is it the source and target reference? Are we assuming only one actual verse reference? And therefore only one book (Daniel vs. Bell and the Dragon)? It's a severe edge case - just checking.
No, not source and target. If you peek at ParallelTextRow
, you'll see that the refs is shorthand for source refs if there are any; otherwise, target refs. For the most part, there should only be one verse reference. Damien can correct me, but I think the idea of multiple references is more that sometimes a chunk of verses is translated as a unit because of the peculiarities of a language (like say it's topic-comment and rearranging phrases from contiguous verses is necessary to make the reading natural). I could be way off haha. @ddaspit ?
Previously, Enkidu93 (Eli C. Lowry) wrote…
I think Damien resolves all optimization issues, including the hashset - bringing it to O(n). |
Previously, Enkidu93 (Eli C. Lowry) wrote…
It should be good enough. If we run into future problems, we can seek to address it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 5 files at r3, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @Enkidu93)
src/SIL.Machine.AspNetCore/Services/NmtPreprocessBuildJob.cs
line 183 at r2 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
No, not source and target. If you peek at
ParallelTextRow
, you'll see that the refs is shorthand for source refs if there are any; otherwise, target refs. For the most part, there should only be one verse reference. Damien can correct me, but I think the idea of multiple references is more that sometimes a chunk of verses is translated as a unit because of the peculiarities of a language (like say it's topic-comment and rearranging phrases from contiguous verses is necessary to make the reading natural). I could be way off haha. @ddaspit ?
@Enkidu93, you are correct. That is an accurate description.
src/SIL.Machine.AspNetCore/Services/NmtPreprocessBuildJob.cs
line 142 at r3 (raw file):
return false; return bookChapters.TryGetValue(vr.Book, out HashSet<int>? chapters) && (chapters.Contains(vr.ChapterNum) || chapters.Count() == 0);
You should use the Count
property instead of the Count()
LINQ method. Using Count()
can be dangerous, because in certain situations it computes the count by enumerating over the whole collection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 14 of 15 files reviewed, 2 unresolved discussions (waiting on @ddaspit)
src/SIL.Machine/Scripture/ScriptureRangeParser.cs
line 8 at r2 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
It would be nice to add a static convenience method
GetChapters(string chapterSelection, ScrVers versification = null)
.
Done.
src/SIL.Machine.AspNetCore/Services/NmtPreprocessBuildJob.cs
line 142 at r3 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
You should use the
Count
property instead of theCount()
LINQ method. UsingCount()
can be dangerous, because in certain situations it computes the count by enumerating over the whole collection.
Done. Thank you! Good catch. I'm so accustomed to using the method because I'm working with a generic IEnumerable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 5 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r4, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)
8d461f8
to
b9a9aab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r5, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)
Addresses sillsdev/serval#150
I'm noticing that (as far as I can tell) the "trainOn" logic is not present in SMT jobs. It makes sense why pretranslate logic wouldn't be there, but should we incorporate the trainOn logic? I won't add the biblicalRange logic to SMT for training until it's been verified that that was a mistake.
This change is