⏳ The 5th Edition: Load Test Comparison of In-house Self-hosted LLM Models
Load Test Comparison of In-house Self-hosted LLM Models
💫 Cost of Summarization: GPT 4 - 8K Context Length
💰Pricing of the Model
Cost of Prompt:
$30 Per million of tokens
Cost of Response:
$60 Per million of tokens
🏎️ Cost of Formula
Tokens(1000 Per Article) * No of Articles (Total in 1000s) * Unit Cost (Per 1 million of tokens)
💸 Cost of Input
1K (Tokens/Article) * 6 Million * $30 = $180,000
🏦 Cost of Output
0.5K (Tokens/Article) * 6 Million * $60 = $180,000
💵 Total Cost
Input Cost + Output Cost = $180,000 + $180,000 = $360,000
💫 Cost of Summarization: GPT 4 - 32K Context Length
💰Pricing of the Model
Cost of Prompt:
$60 Per million of tokens
Cost of Response:
$120 Per million of tokens
🏎️ Cost of Formula
Tokens(1000 Per Article) * No of Articles (Total in 1000s) * Unit Cost (Per 1 million of tokens)
💸 Cost of Input
1K (Tokens/Article) * 6 Million * $60 = $360,000
🏦 Cost of Output
0.5K (Tokens/Article) * 6 Million * $120 = $360,000
💵 Total Cost
Input Cost + Output Cost = $360,000 + $360,000 = $720,000
💫 Cost of Summarization: Anthropic Claude V1
💰Pricing of the Model
Cost of Prompt:
$11 Per million of tokens
Cost of Response:
$32 Per million of tokens
🏎️ Cost of Formula
Tokens(1000 Per Article) * No of Articles (Total in 1000s) * Unit Cost (Per 1 million of tokens)
💸 Cost of Input
1K (Tokens/Article) * 6 Million * $11 = $66,000
🏦 Cost of Output
0.5K (Tokens/Article) * 6 Million * $32 = $96,000
💵 Total Cost
Input Cost + Output Cost = $66,000 + $96,000 = $162,000
💫 Cost of Summarization: InstructGPT DaVinci
💰Pricing of the Model
Cost of Prompt:
$20 Per million of tokens
Cost of Response:
$20 Per million of tokens
🏎️ Cost of Formula
Tokens(1000 Per Article) * No of Articles (Total in 1000s) * Unit Cost (Per 1 million of tokens)
💸 Cost of Input
1K (Tokens/Article) * 6 Million * $20 = $120,000
🏦 Cost of Output
0.5K (Tokens/Article) * 6 Million * $20 = $60,000
💵 Total Cost
Input Cost + Output Cost = $120,000 + $60,000 = $180,000
💫 Cost of Summarization: Curie
💰Pricing of the Model
Cost of Prompt:
$2 Per million tokens
Cost of Response:
$2 Per million tokens
🏎️ Cost of Formula
Tokens(1000 Per Article) * No of Articles (Total in 1000s) * Unit Cost (Per 1 million of tokens)
💸 Cost of Input
1K (Tokens/Article) * 6 Million * $2 = $12000
🏦 Cost of Output
0.5K (Tokens/Article) * 6 Million * $2 = $6,000
💵 Total Cost
Input Cost + Output Cost = $12,000 + $6,000 = $18,000
⚠️ Disclaimer
It's not a direct apple-to-apple comparison -
If you employ an API service like Azure/Openai then you don't need to assemble a layer of load balancer, autoscaling, and other parts. The API encapsulates everything. You need to bear the final price. On the contrary, if you operate the self-hosted API then you must take care of the batcher service and the cluster operation (Kubernetes).
The API services use other frameworks(RLHF or toxicity detection layer) including the LLM models to provide a better-moderated response.
💫 Cost of Summarization: 1.3 Billion Self-Hosted Model
💰 Pricing of the Model
Instance Cost: p4d.24xlarge ~ 10 Euro
🤖 Model Details
facebook/opt-1.3b · Hugging Face
🏎️ Cost of Formula
Total Tokens (Input + Output) * ( 1 / Tokens Per Hour) * Node Cost(Hour)
💰 Derived Details of the Model
Cost of Toeken Per Million:
(10 <node cost> / (9135.57 <tokens_per_s> * 8 <8 Core machine> * 3600)) * 1000000
=$0.038
Average Latency:
~50 Ms
💵 Total Cost
Total Tokens (Input + Output): 9 Billion
Input Tokens - 6 Billion: 6 Million Articles with 1000 tokens
Output Tokens - 3 Billion: 6 Million Articles with 500 tokens)
9 Billion * 1 / (
9135.57
<That I have achieved with a single node and single core >
* 8 * 3600) * 109000000000 * (1 / (9135.57 * 8 * 3600)) * 10
=$342
backend vLLM
dur_s 20.12
tokens_per_s 9135.57
qps 4.97
successful_responses 100
prompt_token_count 74999
response_token_count 108811,
median_token_latency=0.05344977245685902,
median_e2e_latency=20.09678626060486
💫 Cost of Summarization: 2.7 Billion Self-Hosted Model
💰 Pricing of the Model
Instance Cost: p4d.24xlarge ~ 10 Euro
🤖 Model Details
facebook/opt-2.7b · Hugging Face
🏎️ Cost of Formula
Total Tokens (Input + Output) * ( 1 / Tokens Per Hour) * Node Cost(Hour)
💰 Derived Details of the Model
Cost of Toeken Per Million:
(10 <node cost> / (7538.69 <tokens_per_s> * 8 <8 Core machine> * 3600)) * 1000000
=$0.046
Average Latency:
~63 Ms
💵 Total Cost
Total Tokens (Input + Output): 9 Billion
Input Tokens - 6 Billion: 6 Million Articles with 1000 tokens
Output Tokens - 3 Billion: 6 Million Articles with 500 tokens)
9 Billion * 1 / (
7538.69
<That I have achieved with a single node and single core >
* 8 * 3600) * 109000000000 * (1 / (7538.69 * 8 * 3600)) * 10
=$414.5
backend vLLM
dur_s 24.25
tokens_per_s 7538.69
qps 4.12
successful_responses 100
prompt_token_count 74999 r
esponse_token_count 107817,
median_token_latency=0.06350673893664746,
median_e2e_latency=23.876333951950073
💫 Cost of Summarization: 6.7 Billion Self-Hosted Model
💰 Pricing of the Model
Instance Cost: p4d.24xlarge ~ 10 Euro
🤖 Model Details
facebook/opt-6.7b · Hugging Face
🏎️ Cost of Formula
Total Tokens (Input + Output) * ( 1 / Tokens Per Hour) * Node Cost(Hour)
💰 Derived Details of the Model
Cost of Toeken Per Million:
(10 <node cost> / (4285.98 <tokens_per_s> * 8 <8 Core machine> * 3600)) * 1000000
=$0.0810
Average Latency:
~70 Ms
💵 Total Cost
Total Tokens (Input + Output): 9 Billion
Input Tokens - 6 Billion: 6 Million Articles with 1000 tokens
Output Tokens - 3 Billion: 6 Million Articles with 500 tokens)
9 Billion * 1 / (
4285.98 <That I have achieved with a single node and single core >
* 8 * 3600) * 109000000000 * (1 / (4285.98 * 8 * 3600)) * 10
=$729.12
backend vLLM
dur_s 42.51
tokens_per_s 4285.98
qps 2.35
successful_responses 100
prompt_token_count 74999
response_token_count 107180,
median_token_latency=0.06988549321815504,
median_e2e_latency=26.218546986579895
💫 Cost of Summarization: 13 Billion Self-Hosted Model
💰 Pricing of the Model
Instance Cost: p4d.24xlarge~ 10 Euro (With AWS Discount)
🤖 Model Details
facebook/opt-13b · Hugging Face
🏎️ Cost of Formula
Total Tokens (Input + Output) * ( 1 / Tokens Per Hour) * Node Cost(Hour)
💰 Derived Details of the Model
Cost of Toeken Per Million:
(10 <node cost> / (1564.10 <tokens_per_s> * 8 <8 Core machine> * 3600)) * 1000000
=$0.22
Average Latency:
~190 Ms
💵 Total Cost
Total Tokens (Input + Output): 9 Billion
Input Tokens - 6 Billion: 6 Million Articles with 1000 tokens
Output Tokens - 3 Billion: 6 Million Articles with 500 tokens)
9 Billion * 1 / (
1564.10 <That I have achieved with a single node and single core >
* 8 * 3600) * 109000000000 * (1 / (1564.10 * 8 * 3600)) * 10
=$1998
backend vLLM
dur_s 118.59
tokens_per_s 1564.10
qps 0.84
successful_responses 100
prompt_token_count 74999
response_token_count 110482,
median_token_latency=0.1900191184933304,
median_e2e_latency=71.35204672813416
💫 Cost of Summarization: 30 Billion Self-Hosted Model
💰 Pricing of the Model
Instance Cost: p4d.24xlarge ~ 11 Euro (With AWS Discount)
🤖 Model Details
facebook/opt-30b · Hugging Face
🏎️ Cost of Formula
Total Tokens (Input + Output) * ( 1 / Tokens Per Hour) * Node Cost(Hour)
💰 Derived Details of the Model
Cost of Toeken Per Million:
(10 <node cost> / (370.2 <tokens_per_s> * 8 <8 Core machine> * 3600)) * 1000000
=$0.937
💵 Total Cost
Total Tokens (Input + Output): 9 Billion
Input Tokens - 6 Billion: 6 Million Articles with 1000 tokens
Output Tokens - 3 Billion: 6 Million Articles with 500 tokens)
9 Billion * 1 / (
370.2 <That I have achieved with a single node and single core >
* 8 * 3600) * 109000000000 * (1 / (370.2 * 8 * 3600)) * 10
=$8441
The load test was not completed properly due to a Cuda Memory error
. In production, 8 shards of the node will be used. So won't face the problem. That's why I couldn't generate the latency figures.
💫 Cost of Summarization: 7 Billion LLAMA-2 Self-Hosted Model
💰 Pricing of the Model
Instance Cost: p4d.24xlarge ~ 10 Euro (With AWS Discount)
🤖 Model Details
🏎️ Cost of Formula
Total Tokens (Input + Output) * ( 1 / Tokens Per Hour) * Node Cost(Hour)
💰 Derived Details of the Model
Cost of Toeken Per Million:
(10 <node cost> / (2725.52 <tokens_per_s> * 8 <8 Core machine> * 3600)) * 1000000
=$0.13
Average Latency:
~120 Ms
💵 Total Cost
Total Tokens (Input + Output): 9 Billion
Input Tokens - 6 Billion: 6 Million Articles with 1000 tokens
Output Tokens - 3 Billion: 6 Million Articles with 500 tokens)
9 Billion * 1 / ( 2725.52
<That I have achieved with a single node and single core >
* 8 * 3600) * 109000000000 * (1 / (2725.52 * 8 * 3600)) * 10
=$1146
backend vLLM
dur_s 65.96
tokens_per_s 2725.52
qps 1.52
successful_responses 100
prompt_token_count 74999
response_token_count 104779
median_token_latency=0.1208686280123731
median_e2e_latency=45.40771961212158
💻 Self-hosted Backend
The Inhouse Code Repo
Continuous Batching: A Distributed Serving System for Transformer-Based Generative Models
vllm
- PagedAttention
🧐 Observation
The node(A100 - p4d.24xlarge) is really hard to get. Spent 20+ hours with AWS to reserve the capacity for the load test. I would recommend going with 3 years reservation because it is affordable. The cost has been calculated with the 3-year reserve price.
I have not tested the
fine-tuned model
in this experiment. I have found that a 7b model would cost around~$350
for the same scenario. In production, we should go with the fine-tuned model.
I explicitly test the
facebook/opt model family
because there are different size variations.Thefalcon, llama, musicml-mpt models
are built on modern optimized architecture. So the inference time and cost will be lower from the opt family.So I have used 1000 tokens as Input and 500 tokens per output.
Followed the philosophy to implement the sample prompt - the prompt token length varies from 900 to 1100 and the response token length varies from 450 to 550 tokens.
Have used vllm and continuous batching in the backend.
📜 The One Liner
**
I will publish the next Edition on Thursday.
This is the 5th Edition, If you have any feedback please don’t hesitate to share it with me, And if you love my work, do share it with your colleagues.
Cheers!!
Raahul
**