Local LLM Deployment

To create assistants that run entirely on your machine, you must run a model locally. We recommend the OpenHermes-NeuralChat merged model that is 7 billion parameters and ~6GB. We have tested Rubra with this model, but you can use any model you want at your own risk. Let us know if you'd like support for other models by opening up a Github issue!

We leverage llamafile to distribute and run local LLMs.

Prerequisites

Make sure you meeting the prerequisites before you start.

Setup

Manually download the Llamafile for your OS from Hugging Face or run this command:

macOS + Linux
Windows

curl -L -o rubra.llamafile https://huggingface.co/rubra-ai/rubra-llamafile/resolve/main/rubra.llamafile

Downloads 2 files: llamafile.exe and openhermes-2.5-neural-chat-v3-3-slerp.Q6_K.gguf

curl -L -o rubra.llamafile https://huggingface.co/rubra-ai/rubra-llamafile/resolve/main/llamafile.exe
curl -L -o openhermes-2.5-neural-chat-v3-3-slerp.Q6_K.gguf https://huggingface.co/rubra-ai/rubra-llamafile/resolve/main/openhermes-2.5-neural-chat-v3-3-slerp.Q6_K.gguf

Give the file executable permissions:
- macOS + Linux
- Windows
chmod +x rubra.llamafile
llamafile.exe should be executable by default. However, if you find that the llamafile is not executable or you want to ensure that it has the correct permissions, you can adjust the file properties through the file explorer or use the icacls command in the command prompt to modify the file's access control lists.
To set the execute permission on llamafile.exe using the GUI, you would:
1. Right-click on the file and select "Properties."
2. Go to the "Security" tab.
3. Click on "Edit..." to change permissions.
4. Select the user or group you want to grant execute permissions to.
5. Check the "Allow" box for "Read & execute" under the "Permissions for Users" section.
6. Click "Apply" and then "OK."
To do something similar from the command line, you can use the icacls command:
icacls "llamafile.exe" /grant Everyone:RX
Run the model:
- macOS + Linux
- Windows
./rubra.llamafile --ctx-size 16000
./llamafile.exe -m openhermes-2.5-neural-chat-v3-3-slerp.Q6_K.gguf --ctx-size 16000 --host 0.0.0.0 --port 1234 --nobrowser -ngl 35
- You must run the model on port 1234
note
- (Optional) Increase/decrease the context window size with the --ctx-size flag. The default is 16000. A larger context window size will increase the memory usage of the model but will result in high quality responses. Those without a GPU and/or limited RAM (i.e. 8 GB) should keep this value low.
- GPU Support:
  - -ngl is the number of layers offloaded to the GPU. The default is 35. You can adjust this value to offload more/ess layers to the GPU. Add this to your command: ./rubra.llamafile --ctx-size 16000 -ngl 35
  - Apple Silicon on MacOS
    
    You need to have Xcode Command Line Tools installed for llamafile to be able to bootstrap itself
    
    If you use zsh and have trouble running llamafile, try running sh -c ./rubra.llamafile --ctx-size 16000. This is due to a bug that was fixed in zsh 5.9+
  - NVIDIA GPUs
    
    Install CUDA and cuDNN on your machine
  - AMD GPUs
    
    Install the ROCm SDK

Testing

Congrats! You have a model running on your machine. To test it out, you can run the following command:

curl http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
  "messages": [
      {
          "role": "system",
          "content": "You are a friendly assistant"
      },
      {
          "role": "user",
          "content": "Hello world!"
      }
    ]
}'

Prerequisites​

Setup​

Testing​

Prerequisites

Setup

Testing