On Prem LLM Model selector
Install the transformers and torch libraries for loading and running the Hugging Face models:
pip install transformers torch
Python Code for downloading HuggingFace (download_t5small_model.py)
You need to make sure there is a ./models/t5-small folder.
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model_name = "t5-small" # The model you want to use
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Save the model locally
model.save_pretrained("./models/t5-small")
tokenizer.save_pretrained("./models/t5-small")
Python Code for downloading HuggingFace (download_fastchat-t5-3b_model.py)
You need to make sure there is a ./models/fastchat-t5-3b-v1.0 folder.
Make sure you login to hugging face :
huggingface-cli login
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model_name = "fastchat-t5-3b-v1.0" # The model you want to use
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Save the model locally
model.save_pretrained("./models/fastchat-t5-3b-v1.0")
tokenizer.save_pretrained("./models/fastchat-t5-3b-v1.0")
Python Code for downloading HuggingFace (download_Llama-3.2-1B_model.py.py)
You need to make sure there is a ./models/Llama-3.2-1B folder.
You need to request access : https://huggingface.co/meta-llama/Llama-3.2-1B
Make sure you login to hugging face :
huggingface-cli login
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model_name = "fastchat-t5-3b-v1.0" # The model you want to use
token = "HUGGINFACE TOKEN"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Save the model locally
model.save_pretrained("./models/Llama-3.2-1B")
tokenizer.save_pretrained("./models/Llama-3.2-1B")
- Run python3 download_t5small_model.py
Python Code for Chat with Model
from transformers import AutoModelForCausalLM, AutoModelForSeq2SeqLM, AutoTokenizer, AutoConfig
import sys
def generate_response(model_name, user_input):
# Load the configuration to check model type
config = AutoConfig.from_pretrained(f'./models/{model_name}')
# Choose model class based on configuration
if config.model_type == "llama":
model = AutoModelForCausalLM.from_pretrained(f'./models/{model_name}')
else:
model = AutoModelForSeq2SeqLM.from_pretrained(f'./models/{model_name}')
# Load the tokenizer (the same for both model types)
tokenizer = AutoTokenizer.from_pretrained(f'./models/{model_name}')
# Tokenize the user input
inputs = tokenizer(user_input, return_tensors="pt")
# Generate a response
outputs = model.generate(inputs["input_ids"], max_length=150, num_return_sequences=1)
# Decode the response
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
return response
if __name__ == "__main__":
model_name = sys.argv[1] # Model name e.g., "meta-llama/Llama-3.2-1B" or "fastchat-t5-3b-v1.0"
user_input = sys.argv[2] # User input (message)
print(generate_response(model_name, user_input))
-
Test model python3 chat_with_model.py t5-small "What is new?" python3 chat_with_model.py fastchat-t5-3b "What is new?" python3 chat_with_model.py Llama-3.2-1B "What is new?"
-
Run rails server