Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check the data sets #1

Open
liebharc opened this issue May 15, 2024 · 4 comments
Open

Check the data sets #1

liebharc opened this issue May 15, 2024 · 4 comments
Labels
good first issue Good for newcomers help wanted Extra attention is needed

Comments

@liebharc
Copy link
Owner

liebharc commented May 15, 2024

Introduction

The efficacy of a transformer model is significantly influenced by the quality of its training data. However, the original training dataset utilized by https://github.com/NetEase/Polyphonic-TrOMR/tree/master remains unpublished. Consequently, this repository relies on https://github.com/liebharc/Polyphonic-TrOMR/tree/master, which trains the transformer on datasets sourced from https://grfia.dlsi.ua.es/primus/, https://sites.google.com/view/multiscore-project/datasets, and https://github.com/itec-hust/CPMS. Notably, for the grandstaff dataset, extensive preprocessing is essential, including the segmentation of the grandstaff into individual staves. In the past, significant improvements in performance have been achieved through rectifying errors in datasets, such as stave segmentation, accidental placement, or the conversion of humdrum files into the TrOMR semantic format.

The Task Itself

It would be helpful to have another set of eyes go through all the datasets, especially the grandstaff one. Just take a peek at some random staff images and their corresponding semantic representations. If you spot any issues, we should either tweak our preprocessing methods to fix them or just kick those problematic cases out of the datasets. That way, we won't confuse the transformer during training.

Update

The CPMS dataset has been removed for now. And the "Lieder" dataset has been added. The task itself remains important.

@liebharc liebharc added help wanted Extra attention is needed good first issue Good for newcomers labels May 15, 2024
@liebharc
Copy link
Owner Author

With the changes which lead to v0.2.0 the most severe issues in the data sets should be fixed. The fixes lead to a significant improvement in performance. I'll leave this issue open, as a 2nd pair of eyes would be really useful.

@papoteur-mga
Copy link

Hello,
Can you explain the method of checking ?
On another side, could be datasets be generated from scores already digitalized, producing images and semantics. Is there tools for that?

@liebharc
Copy link
Owner Author

Hello @papoteur-mga ! The method of checking is very open. What I did so far is:

  1. Run the steps mentioned in https://github.com/liebharc/homr/blob/main/Training.md to start the training process. You do this as this will download and process all datasets, you can abort the actual training as the results aren't needed for this task.
  2. Run the script https://github.com/liebharc/homr/blob/main/training/show_examples_from_index.py on the index of a dataset (the index is just a text file listing all files of the dataset)
  3. The script will print you random examples from the dataset and show you the image and the notation.
  4. You now have to check manually if image and notation really match. It does in most cases, which makes the task tedious.
  5. If an image doesn't match its segmentation, then it can be just an outlier and the image should be ignored as it can degrade the model's performance. Or one can try to find the reason for this mismatch and debug the preprocessing steps to see if there an improvement can be achieved.

So the task is actually very open and will require a lot of time. I marked it as "Good first issue" as it doesn't require an understanding of homritself. It's sufficient to get an understanding of the notation. It still might be a very time consuming task - just as a warning :). What would be amazing is if one can find an automated way of doing this. However, one likely can't use the homr transformer because it was trained on the data sets and if there is e.g. a systematic error then it will have learned that error. But perhaps sanity checks or utilizing another OMR software might help. That's all just very vague thoughts so far.

On another side, could be datasets be generated from scores already digitalized, producing images and semantics. Is there tools for that?

Yes, one can do that and homr is doing that already. E:g. in https://github.com/liebharc/homr/blob/main/training/convert_lieder.py it takes https://github.com/OpenScore/Lieder which essentially is a collection of *.mcsx files, renders them with MuseScore to sheet music, distorts the images (to make the transformer more robust against distortions), splits them into staff images and creates a data set from this.

Now you have to be careful. If we would only do this then the transformer would likely optimize itself on MuseScore rendered files which you want to avoid. So having a different sources for your data sets is expected to be benificial.

The semantic format is nothing official and seems to be only used in OMR research. So I don't think there is a standard tool for doing this, but the script mentioned above would be a starting point - it's far from perfect itself.

@liebharc
Copy link
Owner Author

liebharc commented Dec 19, 2024

Here is a thought I had, which might automate the process a bit - it has the downside of that you need to retrain the transformer which is time consuming and requires a GPU:

  1. Remove one data set from the training e.g. remove grandstaff
  2. Retrain the transformer, you likely don't need the best performance and therefore you can reduce the number of epochs in https://github.com/liebharc/homr/blob/main/training/transformer/train.py to e.g. 10
  3. Now run the transformer on the excluded dataset - in the example run it on grandstaff
  4. Compare the transformer output with the expected semantic data from the data set
  5. For all results which show poor performance investigate if:
  • this is an individual outlier: Exclude it from the data set
  • there is a systematic error then try to fix the preprocessing
  1. Repeat the same process for the next data set

The idea is that if the data set was excluded from training, then the transformer itself can be used to get an understanding of how well the data matches with what was learned from the other data sets. Of course one has to be careful to not remove something which is novel in one set.

This process likely requires some coding to apply the transformer on the excluded data set and to compare the semantic results. https://github.com/liebharc/homr/blob/main/validation/symbol_error_rate.py can be a template on how this is done.

Oh and in general: If one is aware of another data set which can be used for training then we could add that and see how it affects the performance.

And assuming that you are located in France: Happy holidays!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants