How to run llama2(13B/70B) on mac

zhi tao
3 min readJul 27, 2023

--

Introduction

Meta has made the llama2 open source, anyone can get the model by apply on meta ai and accept the license, provide email address. Meta will send download link in the email.

Download the llama2

  1. Get the download.sh file, store it on mac
  2. open mac terminal, execute chmod +x ./download.sh to give the authority.
  3. run the ./download.sh to start the download process
  4. copy the download link from email, paste to terminal
  5. download the 13B-chat,70B-chat only

Install system dependencies

Xcode must be installed to compile the C++ project. If you don’t have it, please do the following:

xcode-select --install

Next, install dependencies for building the C++ project.

brew install pkgconfig cmake

Finally, we install Torch.

If you do not have python3 installed, please installed by

brew install python@3.11

Create a virtual env like this:

/opt/homebrew/bin/python3.11 -m venv venv

Activate the venv. I am using bash.

source venv/bin/activate

Install PyTorch:

pip install --pre torch torchvision --extra-index-url https://download.pytorch.org/whl/nightly/cpu

Compile llama.cpp

Clone the repo:

git clone https://github.com/ggerganov/llama.cpp.git

Install dependencies:

pip3 install -r requirements.txt

Compile it:

LLAMA_METAL=1 make

Move the downloaded 13B and 70B models to the llama.cpp project, under models folder.

Convert Model to ggml format

It’s different for 13B and 70B. convert-pth-to-ggml.py has been DEPRECATED, use convert.py instead

13B-chat:

 python3 convert.py --outfile ./models/llama-2-13b-chat/ggml-model-f16.bin --outtype f16 ./models/llama-2-13b-chat

70B-chat

python3 convert.py --outfile models/llama-2-70b-chat/ggml-model-f16.bin --outtype f16 ./models/llama-2-70b-chat

Quantize the model:

In order to run these huge LLMs in our small laptops we will need to reconstruct and quantize the model with the following commands, here we will convert the model’s weights from float16 to int4 requiring less memory to be executed and only losing a little bit of quality in the process.

13B-chat:

./quantize ./models/llama-2-13b-chat/ggml-model-f16.bin ./models/llama-2-13b-chat/ggml-model-q4_0.bin q4_0

70B-chat

./quantize ./models/llama-2-70b-chat/ggml-model-f16.bin ./models/llama-2-70b-chat/ggml-model-q4_0.bin q4_0

Run the model

And now we are already able to run the interactive Chat in the terminal with our model running entirely in our CPU without needing GPUs, internet, OpenAI or any cloud provider.

13B-chat cpu only:

./main -m ./models/llama-2-13b-chat/ggml-model-q4_0.bin -t 4 -c 2048 -n 2048 --color -i -r '### Question:' -p '### Question:'

You can enable GPU inference with the -ngl 1 command-line argument. Any value larger than 0 will offload the computation to the GPU. For example:

./main -m ./models/llama-2-13b-chat/ggml-model-q4_0.bin -t 4 -c 2048 -n 2048 --color -i -ngl 1 -r '### Question:' -p '### Question:'

It will be about 25% faster than pure cpu on my mac for testing.

For 70B-chat we need pass -gqa 8 as command-line argument to make it works. You can read more in the llama2 issue, llama.cpp pr.

70B-chat cpu only:

./main -m ./models/llama-2-70b-chat/ggml-model-q4_0.bin --no-mmap --ignore-eos -t 8 -c 2048 -n 2048 --color -i -gqa 8 -r '### Question:' -p '### Question:' 

Currently, only cpu.

--

--