-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check the data sets #1
Comments
With the changes which lead to v0.2.0 the most severe issues in the data sets should be fixed. The fixes lead to a significant improvement in performance. I'll leave this issue open, as a 2nd pair of eyes would be really useful. |
Hello, |
Hello @papoteur-mga ! The method of checking is very open. What I did so far is:
So the task is actually very open and will require a lot of time. I marked it as "Good first issue" as it doesn't require an understanding of
Yes, one can do that and Now you have to be careful. If we would only do this then the transformer would likely optimize itself on MuseScore rendered files which you want to avoid. So having a different sources for your data sets is expected to be benificial. The semantic format is nothing official and seems to be only used in OMR research. So I don't think there is a standard tool for doing this, but the script mentioned above would be a starting point - it's far from perfect itself. |
Here is a thought I had, which might automate the process a bit - it has the downside of that you need to retrain the transformer which is time consuming and requires a GPU:
The idea is that if the data set was excluded from training, then the transformer itself can be used to get an understanding of how well the data matches with what was learned from the other data sets. Of course one has to be careful to not remove something which is novel in one set. This process likely requires some coding to apply the transformer on the excluded data set and to compare the semantic results. https://github.com/liebharc/homr/blob/main/validation/symbol_error_rate.py can be a template on how this is done. Oh and in general: If one is aware of another data set which can be used for training then we could add that and see how it affects the performance. And assuming that you are located in France: Happy holidays! |
Introduction
The efficacy of a transformer model is significantly influenced by the quality of its training data. However, the original training dataset utilized by https://github.com/NetEase/Polyphonic-TrOMR/tree/master remains unpublished. Consequently, this repository relies on https://github.com/liebharc/Polyphonic-TrOMR/tree/master, which trains the transformer on datasets sourced from https://grfia.dlsi.ua.es/primus/, https://sites.google.com/view/multiscore-project/datasets, and https://github.com/itec-hust/CPMS. Notably, for the grandstaff dataset, extensive preprocessing is essential, including the segmentation of the grandstaff into individual staves. In the past, significant improvements in performance have been achieved through rectifying errors in datasets, such as stave segmentation, accidental placement, or the conversion of humdrum files into the TrOMR semantic format.
The Task Itself
It would be helpful to have another set of eyes go through all the datasets, especially the grandstaff one. Just take a peek at some random staff images and their corresponding semantic representations. If you spot any issues, we should either tweak our preprocessing methods to fix them or just kick those problematic cases out of the datasets. That way, we won't confuse the transformer during training.
Update
The CPMS dataset has been removed for now. And the "Lieder" dataset has been added. The task itself remains important.
The text was updated successfully, but these errors were encountered: