Local LLM Deployment
To create assistants that run entirely on your machine, you must run a model locally. We recommend the OpenHermes-NeuralChat merged model that is 7 billion parameters and ~6GB. We have tested Rubra with this model, but you can use any model you want at your own risk. Let us know if you'd like support for other models by opening up a Github issue!
We leverage llamafile to distribute and run local LLMs.
Prerequisites
Make sure you meeting the prerequisites before you start.
Setup
-
Manually download the Llamafile for your OS from Hugging Face or run this command:
- macOS + Linux
- Windows
curl -L -o rubra.llamafile https://huggingface.co/rubra-ai/rubra-llamafile/resolve/main/rubra.llamafile
Downloads 2 files:
llamafile.exe
andopenhermes-2.5-neural-chat-v3-3-slerp.Q6_K.gguf
curl -L -o rubra.llamafile https://huggingface.co/rubra-ai/rubra-llamafile/resolve/main/llamafile.exe
curl -L -o openhermes-2.5-neural-chat-v3-3-slerp.Q6_K.gguf https://huggingface.co/rubra-ai/rubra-llamafile/resolve/main/openhermes-2.5-neural-chat-v3-3-slerp.Q6_K.gguf -
Give the file executable permissions:
- macOS + Linux
- Windows
chmod +x rubra.llamafile
llamafile.exe
should be executable by default. However, if you find that the llamafile is not executable or you want to ensure that it has the correct permissions, you can adjust the file properties through the file explorer or use theicacls
command in the command prompt to modify the file's access control lists.To set the execute permission on
llamafile.exe
using the GUI, you would:- Right-click on the file and select "Properties."
- Go to the "Security" tab.
- Click on "Edit..." to change permissions.
- Select the user or group you want to grant execute permissions to.
- Check the "Allow" box for "Read & execute" under the "Permissions for Users" section.
- Click "Apply" and then "OK."
To do something similar from the command line, you can use the
icacls
command:icacls "llamafile.exe" /grant Everyone:RX
-
Run the model:
- macOS + Linux
- Windows
./rubra.llamafile --ctx-size 16000
./llamafile.exe -m openhermes-2.5-neural-chat-v3-3-slerp.Q6_K.gguf --ctx-size 16000 --host 0.0.0.0 --port 1234 --nobrowser -ngl 35
- You must run the model on port 1234
note- (Optional) Increase/decrease the context window size with the
--ctx-size
flag. The default is 16000. A larger context window size will increase the memory usage of the model but will result in high quality responses. Those without a GPU and/or limited RAM (i.e. 8 GB) should keep this value low. - GPU Support:
-ngl
is the number of layers offloaded to the GPU. The default is 35. You can adjust this value to offload more/ess layers to the GPU. Add this to your command:./rubra.llamafile --ctx-size 16000 -ngl 35
- Apple Silicon on MacOS
- You need to have Xcode Command Line Tools installed for llamafile to be able to bootstrap itself
- If you use zsh and have trouble running llamafile, try running
sh -c ./rubra.llamafile --ctx-size 16000
. This is due to a bug that was fixed in zsh 5.9+
- NVIDIA GPUs
- AMD GPUs
- Install the ROCm SDK
Testing
Congrats! You have a model running on your machine. To test it out, you can run the following command:
curl http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"messages": [
{
"role": "system",
"content": "You are a friendly assistant"
},
{
"role": "user",
"content": "Hello world!"
}
]
}'