Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Will there be training code #6

Open
jpgallegoar opened this issue Mar 2, 2025 · 22 comments
Open

Will there be training code #6

jpgallegoar opened this issue Mar 2, 2025 · 22 comments

Comments

@jpgallegoar
Copy link

Will there be training and finetuning code for Sesame?

@darkacorn
Copy link

darkacorn commented Mar 2, 2025

its a llama based model - all you have todo is train the adaptor .. and the audio tokens - so even if they dont give you the code .. that's fairly easy

and since its so "life" like - i wish they dont give anyone training code as there are so many legal considerations around that - at least not if the encoder is not fixed in place and places robust watermarking ( 11 labs / oai and everyone else marks there audio too) left right and center.

mochi albeit audio quality wise lack luster - got the way of watermarking right

@alkeryn
Copy link

alkeryn commented Mar 2, 2025

@darkacorn there is no more risk than what we already have, not only we already have tts with good voice cloning we also have real time rvc so you could just pipe the audio through rvc to change the voice so they may as well allow us to do it natively.

@darkacorn
Copy link

darkacorn commented Mar 2, 2025

even with rvc .. a robust watermarking wont go away .. and they do that with mimi already anyway. as it imprints all over the mel spectrum - https://github.com/facebookresearch/audioseal is a good example but bypassable with local ai - mochi / mimi did that in the encoder - so it wont be changeable - and it will be classifyable as ai .. non removeable.

noone cares how strong it can mimic a voice - as this is unlikely to have voice clone ability anyway - but even if you pump the output to rvc - the watermarking will be there

what most companys want is ai classification that the better "audio models" be it tts/s2s or others will be detectable in scam situations.

its not about what can be done .. just because a few guys can rob a lolipop mashine doesnt mean a bank should just ignore vaults does it now ?

this is far more life like then most audio models and that comes with ethical considerations - and regulations

@alkeryn
Copy link

alkeryn commented Mar 2, 2025

@darkacorn the more reason not to gimp the model.
but my point is that if someone wants to go around it they'll figure out how.
so it is dumb to reduce the model's capability out of fear of something that can already be done now.

"since its so "life" like - i wish they dont give anyone training code as there are so many legal considerations around that"

my point is that this argument is bad.
there is no risk that can't already be exploited now.
so no, i think they should release training code.

and even if they don't we'll probably figure it out anyway.

@darkacorn
Copy link

darkacorn commented Mar 2, 2025

ya no - not so easy they may be able to figure out how to train the llm - but since the encoder is shipped as untrainable model - since its mimi - and that has the watermarking in it - which persists - that wont be broken anytime soon. same as you cant get rid of audio seal if its served in api's as i said its pretty much industry practise - everyone watermarks there generations.

i dont think we get training code / and they for sure dont own us anything - we gotta be glad we get such a model in the first place - for a company however if they are in the western economic area ( and they are us based ) there are more legal implications then what some oss guy wants

@CaelumF
Copy link

CaelumF commented Mar 3, 2025

Neither of my parents or grandparents have ever been called up by a different number with my unique voice pretending to be me in need of money. Why not (or why doesn't every youtuber complain of being a target of this)? Are the scammers stupid?

"It's already possible" cannot be a justification to proliferate capabilities blindly. It is clearly a matter of how attainable the capabilities are. Harmful things should not be readily attainable.

Does releasing the models and/or training code lead to harmful things being more readily attainable? That's a question that takes some thinking about.

Anyone who stops thinking at such a black-and-white early stage should consider the consequences of their motivated reasoning. We are all nerds who love to play with cool shiny things and to share them around, but trade-offs must be made and acknowledged.

@alkeryn
Copy link

alkeryn commented Mar 3, 2025

@CaelumF life comes with risk, if you think we should ban all things that are potentially harmful society couldn't exist.

likewise, the same capabilities could be used AGAINST scammer, ie llm virtual grandma wasting scammers's time by going on tangents (which is already being done you can just search it up).

also we can already do voice cloning pretty easily.
any scammer that'd figure out how to train csm could definitely use the existing technology already.
and voice cloning models are now expressive enough to be indistinguishable from humans espcially over a phone line and for people that don't know tech enough such that they'd be fooled by it.

but it's not actually been done a lot because most scammer are not very good with tech.

@CaelumF
Copy link

CaelumF commented Mar 3, 2025

I feel like you didn't actually read my message and are replying to the loose classification of messages you assigned it to at a glance

@alkeryn
Copy link

alkeryn commented Mar 3, 2025

@CaelumF i replied to you specifically.

could it lead to harmful things, sure, but not more than what can already be done anyway, and you can't hold progress indefinitely, if they don't do it someone else will eventually.

besides, the risks are not that high especially with the tech we already have right now.

you "safety" guys just sound like openai shills that try to fuck with the market and possible competition.

in general, i much rather have that technology out in the open than in the hands of a few megacorps that don't have our best interests in mind.

@darkacorn
Copy link

darkacorn commented Mar 3, 2025

i think there is a major version of tunnel vision -> a the encoder is mimi ( what they used for mochi) - that is non trainable and watermarked by the creator ( if there content is to be trusted) .. the tokens are embedded in the model and you can not retrain it without the embedder .. and thats not necessary for using the model (mochi did not come with that either .. for good reasons .. if you like it or not .. nothing todo with commercial interest)

this argument here is a dead one .. i was advocating for watermarking which this model is by default anyway .. non removable - BY DESIGN .. not a question if someone wants it or not - and no you can not bypass that - as its the part that is NOT TRAINED by them - they dont have the code for that either - so go figure

idk what kind of issue you got with differentiating ai vs non ai .. in general any services does that .. even if its not watermarked just by statistically sampling of the mel-spectrum you could get to a certain % threshold by default - even more so as the regular tokenizer defines the core probabilities

this has also nothing todo with open or not - its open - im all for that -> its just not reusable for everything everyone wants - you dragging empty arguments - what is your scam that you NEED it that unmarked ? - noone else as a problem with inaudible marks

id highly recommend to actually read the full post before going all hothead and starting to post a reply after line 2

@yangdongchao
Copy link

Will there be training and finetuning code for Sesame?

It is easy to train such model based our open-sourced code. We enjoy the same model design with Sesame. You can refer to https://github.com/yangdongchao/RSTnet

@darkacorn
Copy link

darkacorn commented Mar 3, 2025

there are a few models like that - minicpm-2.6-o/ glm4voice as well - but this uses mimi - the corpa of 1mil hours of audio is a rather substancial investment so i would not say "easy" or "cheap"

@darkacorn
Copy link

Will there be training and finetuning code for Sesame?

It is easy to train such model based our open-sourced code. We enjoy the same model design with Sesame. You can refer to https://github.com/yangdongchao/RSTnet

ok i stand corrected _https://github.com/yangdongchao/RSTnet/tree/main/MLLM_v2
amazing share

@alkeryn
Copy link

alkeryn commented Mar 3, 2025

@darkacorn my point was that someone that really wished could remove the watermarking through fancy postprocessing even if they can't train mimi.

i don't care about watermarking though.

@MonolithFoundation
Copy link

Please, this code I dare not to use it.

Image

@darkacorn
Copy link

Please, this code I dare not to use it.

Image

im dont care for a typo - specially since the author is not native english speaking - just means its higher likely a human wrote it - what matters here if the methods are correct - fixing a spelling is fairly trivial

cant speak much about that either - as pretraining would require a substancial amount of data - and dataprep part is somewhat lackluster so far

@darkacorn
Copy link

@darkacorn my point was that someone that really wished could remove the watermarking through fancy postprocessing even if they can't train mimi.

i don't care about watermarking though.

ya then we did not have an argument at all - i just care about the ability to differentiate

@MonolithFoundation
Copy link

MonolithFoundation commented Mar 4, 2025

No, it is just not in a Pythonic style. Moreover, that person has already attacked me with vulgar language.

@alkeryn
Copy link

alkeryn commented Mar 4, 2025

@darkacorn no my argument was always against your statement

i wish they dont give anyone training code as there are so many legal considerations around that

i want them to release training code, but i wasn't talking about mimi training code but more like the model in general.
you were the one that started talking about watermarking specific stuff when it was unrelated to the original topic, which was whether or not they'd release training code (that was not mimi specific).

@darkacorn
Copy link

i want them to release training code, but i wasn't talking about mimi training code but more like the model in general.

my bad then - probably got tunnel vision .

@darkacorn
Copy link

No, it is just not in a Pythonic style. Moreover, that person has already attacked me with vulgar language.

ya but style and what not is fixable/ mind you alot of those guys come from a c/c++ background .. and there code will look quite less "pythonic" .. ive seen that all over the place - in the end of the day what really matters if it does work - and if it does .. fine - not like you use that as production code + refactoring exist, this appears at least on first glace to be further then anything else ive seen for mimi/mochi . if it works that is.

@MonolithFoundation
Copy link

Yes. However, when someone gets angry by simply otherone suggest the code style, it is another aspect that reveals that the code is not reliable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants