-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Will there be training code #6
Comments
its a llama based model - all you have todo is train the adaptor .. and the audio tokens - so even if they dont give you the code .. that's fairly easy and since its so "life" like - i wish they dont give anyone training code as there are so many legal considerations around that - at least not if the encoder is not fixed in place and places robust watermarking ( 11 labs / oai and everyone else marks there audio too) left right and center. mochi albeit audio quality wise lack luster - got the way of watermarking right |
@darkacorn there is no more risk than what we already have, not only we already have tts with good voice cloning we also have real time rvc so you could just pipe the audio through rvc to change the voice so they may as well allow us to do it natively. |
even with rvc .. a robust watermarking wont go away .. and they do that with mimi already anyway. as it imprints all over the mel spectrum - https://github.com/facebookresearch/audioseal is a good example but bypassable with local ai - mochi / mimi did that in the encoder - so it wont be changeable - and it will be classifyable as ai .. non removeable. noone cares how strong it can mimic a voice - as this is unlikely to have voice clone ability anyway - but even if you pump the output to rvc - the watermarking will be there what most companys want is ai classification that the better "audio models" be it tts/s2s or others will be detectable in scam situations. its not about what can be done .. just because a few guys can rob a lolipop mashine doesnt mean a bank should just ignore vaults does it now ? this is far more life like then most audio models and that comes with ethical considerations - and regulations |
@darkacorn the more reason not to gimp the model.
my point is that this argument is bad. and even if they don't we'll probably figure it out anyway. |
ya no - not so easy they may be able to figure out how to train the llm - but since the encoder is shipped as untrainable model - since its mimi - and that has the watermarking in it - which persists - that wont be broken anytime soon. same as you cant get rid of audio seal if its served in api's as i said its pretty much industry practise - everyone watermarks there generations. i dont think we get training code / and they for sure dont own us anything - we gotta be glad we get such a model in the first place - for a company however if they are in the western economic area ( and they are us based ) there are more legal implications then what some oss guy wants |
Neither of my parents or grandparents have ever been called up by a different number with my unique voice pretending to be me in need of money. Why not (or why doesn't every youtuber complain of being a target of this)? Are the scammers stupid? "It's already possible" cannot be a justification to proliferate capabilities blindly. It is clearly a matter of how attainable the capabilities are. Harmful things should not be readily attainable. Does releasing the models and/or training code lead to harmful things being more readily attainable? That's a question that takes some thinking about. Anyone who stops thinking at such a black-and-white early stage should consider the consequences of their motivated reasoning. We are all nerds who love to play with cool shiny things and to share them around, but trade-offs must be made and acknowledged. |
@CaelumF life comes with risk, if you think we should ban all things that are potentially harmful society couldn't exist. likewise, the same capabilities could be used AGAINST scammer, ie llm virtual grandma wasting scammers's time by going on tangents (which is already being done you can just search it up). also we can already do voice cloning pretty easily. but it's not actually been done a lot because most scammer are not very good with tech. |
I feel like you didn't actually read my message and are replying to the loose classification of messages you assigned it to at a glance |
@CaelumF i replied to you specifically. could it lead to harmful things, sure, but not more than what can already be done anyway, and you can't hold progress indefinitely, if they don't do it someone else will eventually. besides, the risks are not that high especially with the tech we already have right now. you "safety" guys just sound like openai shills that try to fuck with the market and possible competition. in general, i much rather have that technology out in the open than in the hands of a few megacorps that don't have our best interests in mind. |
i think there is a major version of tunnel vision -> a the encoder is mimi ( what they used for mochi) - that is non trainable and watermarked by the creator ( if there content is to be trusted) .. the tokens are embedded in the model and you can not retrain it without the embedder .. and thats not necessary for using the model (mochi did not come with that either .. for good reasons .. if you like it or not .. nothing todo with commercial interest) this argument here is a dead one .. i was advocating for watermarking which this model is by default anyway .. non removable - BY DESIGN .. not a question if someone wants it or not - and no you can not bypass that - as its the part that is NOT TRAINED by them - they dont have the code for that either - so go figure idk what kind of issue you got with differentiating ai vs non ai .. in general any services does that .. even if its not watermarked just by statistically sampling of the mel-spectrum you could get to a certain % threshold by default - even more so as the regular tokenizer defines the core probabilities this has also nothing todo with open or not - its open - im all for that -> its just not reusable for everything everyone wants - you dragging empty arguments - what is your scam that you NEED it that unmarked ? - noone else as a problem with inaudible marks id highly recommend to actually read the full post before going all hothead and starting to post a reply after line 2 |
It is easy to train such model based our open-sourced code. We enjoy the same model design with Sesame. You can refer to https://github.com/yangdongchao/RSTnet |
there are a few models like that - minicpm-2.6-o/ glm4voice as well - but this uses mimi - the corpa of 1mil hours of audio is a rather substancial investment so i would not say "easy" or "cheap" |
ok i stand corrected _https://github.com/yangdongchao/RSTnet/tree/main/MLLM_v2 |
@darkacorn my point was that someone that really wished could remove the watermarking through fancy postprocessing even if they can't train mimi. i don't care about watermarking though. |
ya then we did not have an argument at all - i just care about the ability to differentiate |
No, it is just not in a Pythonic style. Moreover, that person has already attacked me with vulgar language. |
@darkacorn no my argument was always against your statement
i want them to release training code, but i wasn't talking about mimi training code but more like the model in general. |
my bad then - probably got tunnel vision . |
ya but style and what not is fixable/ mind you alot of those guys come from a c/c++ background .. and there code will look quite less "pythonic" .. ive seen that all over the place - in the end of the day what really matters if it does work - and if it does .. fine - not like you use that as production code + refactoring exist, this appears at least on first glace to be further then anything else ive seen for mimi/mochi . if it works that is. |
Yes. However, when someone gets angry by simply otherone suggest the code style, it is another aspect that reveals that the code is not reliable. |
Will there be training and finetuning code for Sesame?
The text was updated successfully, but these errors were encountered: