v3.0.3 Training with WordPiece and Unigram + abc files support
Highlights
- Support for abc files, which can be loaded and dumped with symusic similarly to MIDI files;
- The tokenizers can now also be trained with the WordPiece and Unigram algorithms!
- Tokenizer training and token ids encoding can now be performed "bar-wise" or "beat-wise", meaning the tokenizer can learn new tokens from successions of base tokens strictly within bars or beats. This is set by the
encode_ids_split
attribute of the tokenizer config; - symusic v0.4.3 or higher is now required to comply with the usage of the
clip
method; - Better handling of file loading errors in
DatasetMIDI
andDataCollator
; - Introducing a new
filter_dataset
to clean a dataset of MIDI/abc files before using it; MMM
tokenizer has been cleaned up, and is now fully modular: it now works on top of other tokenizations (REMI
,TSD
andMIDILike
) to allow more flexibility and interoperability;TokSequence
objects can now be sliced and concatenated (egseq3 = seq1[:50] + seq2[50:]
);TokSequence
objects tokenized from a tokenizer can now be split per bars or beats subsequences;- minor fixes, code improvements and cleaning;
Methods renaming
A few methods and properties were previously named after "bpe" and "midi". To align with the more general usages of these methods (support for several file formats and training algorithms), they have been renamed with more idiomatic and accurate names.
Methods renamed with depreciation warning:
midi_to_tokens
-->encode
;tokens_to_midi
-->decode
;learn_bpe
-->train
;apply_bpe
-->encode_token_ids
;decode_bpe
-->decode_token_ids
;ids_bpe_encoded
-->are_ids_encoded
;vocab_bpe
-->vocab_model
.tokenize_midi_dataset
-->tokenize_dataset
;
Methods renamed without depreciation warning (less usages, reduces the code messiness):
MIDITokenizer
-->MusicTokenizer
;augment_midi
-->augment_score
;augment_midi_dataset
-->augment_dataset
;augment_midi_multiple_offsets
-->augment_score_multiple_offsets
;split_midis_for_training
-->split_files_for_training
;split_midi_per_note_density
-->split_score_per_note_density
;get_midi_programs
-->get_score_programs
;merge_midis
-->merge_scores
;get_midi_ticks_per_beat
-->get_score_ticks_per_beat
;split_midi_per_ticks
-->split_score_per_ticks
;split_midi_per_beats
-->split_score_per_beats
;split_midi_per_tracks
-->split_score_per_tracks
;concat_midis
-->concat_scores
;
Protected internal methods (no depreciation warning, advanced usages):
MIDITokenizer._tokens_to_midi
-->MusicTokenizer._tokens_to_score
;MIDITokenizer._midi_to_tokens
-->MusicTokenizer._score_to_tokens
;MIDITokenizer._create_midi_events
-->MusicTokenizer._create_global_events
There is no other compatibility issue beside these renaming.
Full Changelog: v3.0.2...v3.0.3