Release v3.0.3 Training with WordPiece and Unigram + abc files support · Natooz/MidiTok

Highlights

Support for abc files, which can be loaded and dumped with symusic similarly to MIDI files;
The tokenizers can now also be trained with the WordPiece and Unigram algorithms!
Tokenizer training and token ids encoding can now be performed "bar-wise" or "beat-wise", meaning the tokenizer can learn new tokens from successions of base tokens strictly within bars or beats. This is set by the encode_ids_split attribute of the tokenizer config;
symusic v0.4.3 or higher is now required to comply with the usage of the clip method;
Better handling of file loading errors in DatasetMIDI and DataCollator;
Introducing a new filter_dataset to clean a dataset of MIDI/abc files before using it;
MMM tokenizer has been cleaned up, and is now fully modular: it now works on top of other tokenizations (REMI, TSD and MIDILike) to allow more flexibility and interoperability;
TokSequence objects can now be sliced and concatenated (eg seq3 = seq1[:50] + seq2[50:]);
TokSequence objects tokenized from a tokenizer can now be split per bars or beats subsequences;
minor fixes, code improvements and cleaning;

Methods renaming

A few methods and properties were previously named after "bpe" and "midi". To align with the more general usages of these methods (support for several file formats and training algorithms), they have been renamed with more idiomatic and accurate names.

Methods renamed with depreciation warning:

midi_to_tokens --> encode;
tokens_to_midi --> decode;
learn_bpe --> train;
apply_bpe --> encode_token_ids;
decode_bpe --> decode_token_ids;
ids_bpe_encoded --> are_ids_encoded;
vocab_bpe --> vocab_model.
tokenize_midi_dataset --> tokenize_dataset;

Methods renamed without depreciation warning (less usages, reduces the code messiness):

MIDITokenizer --> MusicTokenizer;
augment_midi --> augment_score;
augment_midi_dataset --> augment_dataset ;
augment_midi_multiple_offsets --> augment_score_multiple_offsets;
split_midis_for_training --> split_files_for_training;
split_midi_per_note_density --> split_score_per_note_density;
get_midi_programs --> get_score_programs;
merge_midis --> merge_scores;
get_midi_ticks_per_beat --> get_score_ticks_per_beat;
split_midi_per_ticks --> split_score_per_ticks;
split_midi_per_beats --> split_score_per_beats;
split_midi_per_tracks --> split_score_per_tracks;
concat_midis --> concat_scores;

Protected internal methods (no depreciation warning, advanced usages):

MIDITokenizer._tokens_to_midi --> MusicTokenizer._tokens_to_score;
MIDITokenizer._midi_to_tokens --> MusicTokenizer._score_to_tokens;
MIDITokenizer._create_midi_events --> MusicTokenizer._create_global_events

There is no other compatibility issue beside these renaming.

Full Changelog: v3.0.2...v3.0.3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3.0.3 Training with WordPiece and Unigram + abc files support

Highlights

Methods renaming