Source: MidJourney
Yesterday I was writing a blog on Polars and suddenly my LinkedIn feed was flooded with the news of Llama. It is a great day for the Open Source EcoSystem. Congratulations to M&M (Microsoft and Meta).
ℹ️ Details
🛎️ Models
🛎️ Demo
🛎️ Paper
🛎️ User Guide
🛎️ Model Card
📈 Business
Satya Nadella is probably the best dealmaker of our generation. Little surprise; under Satya's leadership Microsoft’share price has grown 940% in 9 years. Can you name any other current CEO who can to move a 2600 billion dollar behemoth at such a pace?
🔹 Early Invest in OpenAI.
🔹 Now Investment in the Open-Source.
🕵️♀️ Observations
🔹 A threat to open-source LLM startups. Mosaic(sold), Red Pajama, etc. are in significant trouble.
🔹 This pumps Meta onto the AI scene. With this announcement, Zuck is signaling how strong Meta's AI position is. They will now own one of the most widely adopted LLMs + have one of the best training data sets in the world.
🔹 This further strengthens Microsoft's dominant position in the AI space. With this partnership, they now have exclusive partnerships with the top LLMs (OpenAI, Meta), priority access to Nvidia GPUs, and strategic assets like GitHub and Azure. Certainly, It will push Azure.
🔹 The collaboration between Microsoft and AMD will be stronger. Like tensortRT or NeuronSDK, AMD will publish its quantize-SDK for the llama model to be run on an AMD chip.
🔹 Qualcomm Chips To Run Meta's AI on Mobile Devices By 2024. Data from Whatsapp and Thread (Insta) will be processed on the edge device. Threat to Privacy.
🏋️ Training Cost
This is an important plot from the LLaMa 2 paper. It directly outlines the pretraining hours for the model! Costs below, assuming $1.50 / A100 from LambdaAPI.
🔹 The 7B model cost $276,480
🔹The 13B model cost $552,960
🔹 The 34B model cost $1.03M!
🔹 The 70B model cost $1.7M!
🌻 Model Download
🔹 Please request here
🔹 I received the email from Meta in an hour with the token.
🔹 chmod 755 download.sh
🔹 Execute the file with the token - Now, when passing the URL to the download script, make sure you're pasting an URL that begins with https://download.llamameta.net
and not with https://l.facebook.com
.
If you copy the link address from the e-mail, you will get an https://l.facebook.com
address, that's incorrect. Better copy-paste it as plain text through a text editor first, before passing it on to the download script.
🧑🏫 Evaluation
🔹 The paper evaluates their model in many ways.
While reviewing these results, it is important to note that human evaluations can be noisy due to limitations of the prompt set, subjectivity of the review guidelines, subjectivity of individual raters, and the inherent difficulty of comparing generations.
I evaluated LLaMA-2 Chat! It seems to be similar quality as the latest Vicuna's. Excited to see how much the community will be able to improve it using the LLaMA-2 base and their fine-tuning pipelines!
🔹 There are some claims that it is better than chatgpt or star coder - but no solid proof.
🪪 License
🔹 Watched The Episode of `Mark Zuckerberg: Future of AI at Meta, Facebook, Instagram, and WhatsApp | Lex Fridman Podcast #383`. Lex asked many times about the license issue of LLAMA-1 Models because the weights were leaked before release. Facebook LLAMA is being openly distributed via torrents - It was really hard to detect whether any organization is using the models in production or not.
🔹 So now businesses can use the models but - LLaMA2 is available for commercial use if you don't have >700M MAU. Products that cross that threshold: -
YouTube (~2.5B)
WeChat (1.3B)
TikTok (1B)
LinkedIn (900M)
Snap (750M) - Snapchat: "We have 750M MAUs" - hmm that’s why 700M 🤔
💻 Model Execution on Mac
🔹 LLama2 weights have already been quantized and available in cpp for local inference! Details
🔹 Run Llama2 on your MacBook with GPU support!
# Clone llama.cpp | |
git clone https://github.com/ggerganov/llama.cpp.git | |
cd llama.cpp | |
# Build it | |
LLAMA_METAL=1 make | |
# Download model | |
export MODEL=llama-2-13b-chat.ggmlv3.q4_0.bin | |
wget "https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/${MODEL}" | |
# Run | |
echo "Prompt: " \ | |
&& read PROMPT \ | |
&& ./main \ | |
-t 8 \ | |
-ngl 1 \ | |
-m ${MODEL} \ | |
--color \ | |
-c 2048 \ | |
--temp 0.7 \ | |
--repeat_penalty 1.1 \ | |
-n -1 \ | |
-p "### Instruction: ${PROMPT} \n### Response:" |
🔹 Both Llama 2 7B and 13B are now available on MLC LLM through CLI. 7B model generating ~46 tok/s on Apple M2 Max and ~156 tok/s on RTX 4090. Stay tuned for the web version, as well as more soon-landing 13B and 70B model optimizations! Details
🌸 Fine Tune Llama-2 with few lines of code!
🔹 With Llama2 landing in the Hugging Face ecosystem, It is very easy to fine-tune this architecture using various tools from the HF ecosystem (TRL, PEFT, ..) - 4bit quantization and PEFT to fine-tune llama2-7b on a single Google Colab instance!
🏁 Deployment
🔹 The intuition was that the 70B model needs 70B*4 (parameters are fp32 so 4 bytes)= 280Gb ~ 340 to 360 GB(Adam needs max 40% to run) GPU.
🔹 I have found this script was tested on a ray cluster of 4 x g5.24xlarge (32xA10Gs) ~ 384Gb and works for all of the model sizes (7B, 13B, and 70B).
🔹 The PR
🤯 Load Test
💫 Cost of Summarization: 7 Billion Self-Hosted Model
💰 Pricing of the Model
Instance Cost: p4d.24xlarge ~ 10 Euro (With AWS Discount)
🤖 Model Details:
🏎️ Cost of Formula
Total Tokens (Input + Output) * ( 1 / Tokens Per Hour) * Node Cost(Hour)
💰 Derived Details of the Model
Cost of Toeken Per Million:
(10 <node cost> / (2725.52 <tokens_per_s> * 8 <8 Core machine> * 3600)) * 1000000
=$0.13
Average Latency:
~120 Ms
💵 Total Cost
Total Tokens (Input + Output): 9 Billion
Input Tokens - 6 Billion: 6 Million Articles with 1000 tokens
Output Tokens - 3 Billion: 6 Million Articles with 500 tokens)
9 Billion * 1 / ( 2725.52
<That I have achieved with a single node and single core >
* 8 * 3600) * 109000000000 * (1 / (2725.52 * 8 * 3600)) * 10
=$1146
ℹ️ Details
backend vLLM
dur_s 65.96
tokens_per_s 2725.52
qps 1.52
successful_responses 100
prompt_token_count 74999
response_token_count 104779,
median_token_latency=0.1208686280123731,
median_e2e_latency=45.40771961212158
⚙️ Configuration
Let's talk about a scenario - We need to summarize 6 Million Wikipedia Articles.
So I have used 1000 tokens as Input and 500 tokens per output.
Followed the philosophy to implement the sample prompt - the prompt token length varies from 900 to 1100 and the response token length varies from 450 to 550 tokens.
Have used vllm and continuous batching in the backend.
I have not tested the fine-tuned model in this experiment. If you fine-tune with different datatype the cost will be lower.
For comparison the cost will be higher if you use the openai or another service - will publish the detailed load test analysis in the next edition.
**
I will publish the next Edition on Sunday.
This is the 4th Edition, If you have any feedback please don’t hesitate to share it with me, And if you love my work, do share it with your colleagues.
Cheers!!
Raahul
**
I was waiting for the update on Llama 2.
Great write up.