Introduction
Meta has made the llama2 open source, anyone can get the model by apply on meta ai and accept the license, provide email address. Meta will send download link in the email.
Download the llama2
- Get the download.sh file, store it on mac
- open mac terminal, execute
chmod +x ./download.sh
to give the authority. - run the
./download.sh
to start the download process - copy the download link from email, paste to terminal
- download the 13B-chat,70B-chat only
Install system dependencies
Xcode must be installed to compile the C++ project. If you don’t have it, please do the following:
xcode-select --install
Next, install dependencies for building the C++ project.
brew install pkgconfig cmake
Finally, we install Torch.
If you do not have python3 installed, please installed by
brew install python@3.11
Create a virtual env like this:
/opt/homebrew/bin/python3.11 -m venv venv
Activate the venv. I am using bash.
source venv/bin/activate
Install PyTorch:
pip install --pre torch torchvision --extra-index-url https://download.pytorch.org/whl/nightly/cpu
Compile llama.cpp
Clone the repo:
git clone https://github.com/ggerganov/llama.cpp.git
Install dependencies:
pip3 install -r requirements.txt
Compile it:
LLAMA_METAL=1 make
Move the downloaded 13B and 70B models to the llama.cpp project, under models folder.
Convert Model to ggml format
It’s different for 13B and 70B. convert-pth-to-ggml.py has been DEPRECATED, use convert.py instead
13B-chat:
python3 convert.py --outfile ./models/llama-2-13b-chat/ggml-model-f16.bin --outtype f16 ./models/llama-2-13b-chat
70B-chat
python3 convert.py --outfile models/llama-2-70b-chat/ggml-model-f16.bin --outtype f16 ./models/llama-2-70b-chat
Quantize the model:
In order to run these huge LLMs in our small laptops we will need to reconstruct and quantize the model with the following commands, here we will convert the model’s weights from float16 to int4 requiring less memory to be executed and only losing a little bit of quality in the process.
13B-chat:
./quantize ./models/llama-2-13b-chat/ggml-model-f16.bin ./models/llama-2-13b-chat/ggml-model-q4_0.bin q4_0
70B-chat
./quantize ./models/llama-2-70b-chat/ggml-model-f16.bin ./models/llama-2-70b-chat/ggml-model-q4_0.bin q4_0
Run the model
And now we are already able to run the interactive Chat in the terminal with our model running entirely in our CPU without needing GPUs, internet, OpenAI or any cloud provider.
13B-chat cpu only:
./main -m ./models/llama-2-13b-chat/ggml-model-q4_0.bin -t 4 -c 2048 -n 2048 --color -i -r '### Question:' -p '### Question:'
You can enable GPU inference with the -ngl 1
command-line argument. Any value larger than 0 will offload the computation to the GPU. For example:
./main -m ./models/llama-2-13b-chat/ggml-model-q4_0.bin -t 4 -c 2048 -n 2048 --color -i -ngl 1 -r '### Question:' -p '### Question:'
It will be about 25% faster than pure cpu on my mac for testing.
For 70B-chat we need pass -gqa 8
as command-line argument to make it works. You can read more in the llama2 issue, llama.cpp pr.
70B-chat cpu only:
./main -m ./models/llama-2-70b-chat/ggml-model-q4_0.bin --no-mmap --ignore-eos -t 8 -c 2048 -n 2048 --color -i -gqa 8 -r '### Question:' -p '### Question:'
Currently, only cpu.