Large Language Models

Salamandra7b-Instruct

Model: BSC-LT/salamandra-7b-instruct
Inference: https://cx5unbuv4o2z8fhp.us-east-1.aws.endpoints.huggingface.cloud
GPU: A100

Salamandra2b-Instruct

Model: BSC-LT/salamandra-2b-instruct
Inference: https://o9wl2lsjfs4966jz.eu-west-1.aws.endpoints.huggingface.cloud
GPU: A10

Salamandra2b-Instruct-Aina-hack (Recomended)

This is the new version of salamandra. (It is based on the same foundation model but is tuned to better follow the system prompt)

Model: BSC-LT/salamandra-2b-instruct-aina-hack
Inference: https://j292uzvvh7z6h2r4.us-east-1.aws.endpoints.huggingface.cloud
GPU: A10

Salamandra7b-Instruct-Aina-hack (Recomended)

This is the new version of salamandra. (It is based on the same foundation model but is tuned to better follow the system prompt).

Model: BSC-LT/salamandra-7b-instruct-aina-hack
Inference: https://hijbc1ux6ie03ouo.us-east-1.aws.endpoints.huggingface.cloud
GPU: A100

Code examples

OpenAI Chat Completions

#pip install openai
from dotenv import load_dotenv
import os
from openai import OpenAI
load_dotenv(".env")

HF_TOKEN = os.environ["HF_TOKEN"]
BASE_URL = os.environ["BASE_URL"]

#pip install openai
client = OpenAI(
       base_url=BASE_URL + "/v1/",
       api_key=HF_TOKEN
   )
messages = [{ "role": "system", "content": "you are a helpful assistant"}]
messages.append( {"role":"user", "content": "Tell me somthing about AI"})
stream = False
chat_completion = client.chat.completions.create(
   model="tgi",
   messages=messages,
   stream=stream,
   max_tokens=1000,
   # temperature=0.1,
   # top_p=0.95,
   # frequency_penalty=0.2,
)
text = ""
if stream:
 for message in chat_completion:
   text += message.choices[0].delta.content
   print(message.choices[0].delta.content, end="")
 print(text)
else:
 text = chat_completion.choices[0].message.content
 print(text)

Generate with requests

import requests
HF_TOKEN = os.environ["HF_TOKEN"]
BASE_URL = os.environ["BASE_URL"]
model_name = "BSC-LT/salamandra-7b-instruct-aina-hack"
tokenizer = AutoTokenizer.from_pretrained(model_name)

headers = {
    "Accept" : "application/json",
    "Authorization": f"Bearer {HF_TOKEN}",
    "Content-Type": "application/json"
}
system_prompt = "you are a helpful assistant"
text = "Tell me somthing about AI"
message = [ { "role": "system", "content": system_prompt} ]
message += [ { "role": "user", "content": text } ]
prompt = tokenizer.apply_chat_template(
   message,
   tokenize=False,
   add_generation_prompt=True,
)

payload = {
   "inputs": prompt,
   "parameters": {}
}
api_url = BASE_URL + "/generate"
response = requests.post(api_url, headers=headers, json=payload)
print(response.json())

Curl Chat Completions

URL=replace_with_endpoint_hf_url
TOKEN=replace_with_provided_token
curl "${URL}/v1/chat/completions" -X POST -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" -d '{
	"model": "tgi",
	"messages": [
    	{
        	"role": "user",
        	"content": "What is deep learning?"
    	}
	],
	"max_tokens": 150,
	"stream": true
}'

Curl Generate

URL=replace_with_endpoint_hf_url
TOKEN=replace_with_provided_token
curl "${URL}/generate" -X POST -H "Authorization: Bearer $TOKEN" -H "Content-Type: appli
cation/json" -d '{
	"model": "tgi",
	"inputs": "what is AI",
	"max_tokens": 150,
	"stream": false
}'

How to fine tune a model

You can follow this example from Meta, just pointing to the Salamandra models instead of Llama:

https://github.com/meta-llama/llama-recipes/blob/main/recipes/quickstart/finetuning/quickstart_peft_finetuning.ipynb

Last updated 7 months ago