Additional safety guardrails wanted #6004

rarensu · 2024-05-09T20:32:31Z

rarensu
May 9, 2024

I want to be able to offer the use of my installation of the text generation webui to minors for educational purposes. Currently, I am trying out a LLaMa 2 Chat model and it does a pretty good job of self-censoring unsafe topics in the chat and instruct modes. However, in the default and notebook tabs, it is trivial to delete the Q&A context and unlock the production of unsafe content. I am in need of some additional guardrails to prevent minors from seeing that unsafe content.
I saw that meta released a LLaMa-guard model that can classify LLaMa's input and output as safe or unsafe. Can I use a model such as this to block unsafe content in the default and notebook tabs?
https://huggingface.co/meta-llama/LlamaGuard-7b
For reference, my Stable Diffusion WebUI installation has this extension, which uses an image classification model to detect unsafe images and block them.
https://github.com/AUTOMATIC1111/stable-diffusion-webui-nsfw-censor

rarensu · 2024-05-13T17:29:27Z

rarensu
May 13, 2024
Author

I also see that NVIDIA published a package named NeMo-Guardrails which does a similar thing.
https://github.com/NVIDIA/NeMo-Guardrails

0 replies

rarensu · 2024-05-15T16:07:36Z

rarensu
May 15, 2024
Author

I wrote an extension to implement a lightweight linear model from profanity-check. It's not perfect, but it's something.

requirements.txt
alt-profanity-check==1.2.1

script.py

from profanity_check import predict, predict_prob
import re
import numpy

def output_modifier(string, state):
    sentences=re.split("(\.|\?|!|;|\n)",string)
    ratings=predict_prob(sentences)
    if (ratings>=0.1).any():
        out=""
        for i in range(len(sentences)):
            if ratings[i]<0.1:
                out+=sentences[i]
            else:
                out+=" [redacted "+str(len(sentences[i]))+" chars]"
    else:
        out=string
    return out
    
def state_modifier(state):
    state['stream']=False
    return state

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional safety guardrails wanted #6004

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Additional safety guardrails wanted #6004

rarensu May 9, 2024

Replies: 2 comments

rarensu May 13, 2024 Author

rarensu May 15, 2024 Author

rarensu
May 9, 2024

rarensu
May 13, 2024
Author

rarensu
May 15, 2024
Author