Cost Of Inference
I was curious about the cost of GPT-4.5 inference compared to the cost of meat brains (“hooomans”), such as data scientists or machine learning engineers in Shenzhen, Manila, Cairo, or Bogota, who might earn $10–50/hr, or 4X as much in SF, London, or Singapore. I’m ignoring training costs since those can scale to billions of users (unlike inference).
When you interact with ChatGPT (or similar generative AI), your input (text, image, sound) is ultimately represented as numbers (“tokens”). The neural network analyzes token relationships through layers of abstraction and encodes meaning using “attention” (see Google’s paper Attention Is All You Need).
Punchline: Attention lets neural networks grasp context by relating tokens to each other.
Once trained, the network generates output by predicting the next token based on patterns it learned. But generative AI isn’t creating genuinely new knowledge; it just predicts what’s statistically likely based on previous data using the backpropagation algorithm. If it had been trained on data from 30 years ago, it couldn’t predict discoveries made since, like those from the Hubble telescope. This highlights a fundamental limitation: neural nets iteratively encode existing knowledge rather than creating fundamentally new insights. I don’t believe this approach will achieve Artificial General Intelligence. Our meat brains remain superior at that.
My opinion differs from what marketing teams at Microsoft or OpenAI will tell you. Maybe I’m wrong. Regardless, the cost of generative AI, especially large models like GPT-4.5 (estimated around 7+ trillion parameters—roughly 10x larger than GPT-4.0), becomes interesting compared to human labor.
The detailed analysis below estimates the hourly cost of GPT-4.5 inference, carefully accounting for hardware, electricity, and infrastructure. As we scale to even larger models (GPT-5.0 could be another 10x larger), these costs rise dramatically, raising important questions about the physical limits of our current approach (backpropagation on GPUs) compared to human brains.
Here’s the detailed analysis (generated using GPT-4.5 deep research), starting with the conclusion first.
Running a one-hour GPT-4.5 conversation on dedicated hardware could easily cost on the order of $40-70 in a straightforward setup. With optimized deployment (sharing GPUs across many sessions, bulk pricing, or owning hardware) the effective cost might drop to the low double-digits or even single-digit dollars per hour. This cost estimate includes the GPU rental or depreciation, power (~$1 or less/hr), cooling/infra (perhaps another ~$1/hr), and maintenance (a few cents to a dollar/hr).
Given all these factors, a reasonable estimate is that GPT-4.5 costs on the order of $50 per hour of conversation if using cloud on-demand, whereas a well-utilized in-house system might handle it for maybe half. In practice, exact figures will vary, but it’s clear that supporting even a single GPT-4.5 chat session for an hour is quite expensive, more so than most data scientists or machine learning engineers around the world.
–
Below is the research returned by GPT-4.5
–
Cost Analysis for Running a GPT‑4.5 Conversation (1 Hour)
Estimated Model Size of GPT‑4.5
GPT-4.5 is expected to be extremely large in terms of parameters. OpenAI hasn’t released official numbers, but industry observers give some estimates. GPT-4 was rumored to use a mixture-of-experts architecture with over 1 trillion total parameters (though only ~200 billion “active” at once) (GPT-4.5: “Not a frontier model”? - by Nathan Lambert). Building on that, GPT-4.5 is believed to be significantly bigger – one analysis suggests on the order of 5–7 trillion parameters total, with roughly 600 billion active parameters during use (GPT-4.5: “Not a frontier model”? - by Nathan Lambert). Some rumors go even higher, claiming GPT-4.5 might approach ≈12 trillion parameters in total (about an order of magnitude above GPT-4) (GPT-4.5: “Not a frontier model”? | Hacker News). In any case, GPT-4.5 was described as “the largest model [OpenAI] had ever built” (GPT-4.5 explained: Everything you need to know), so it likely substantially exceeds GPT-4’s size. This massive scale drives up the resource requirements and cost for running the model.
GPU Requirements for Real-Time Inference
To have a natural conversation with GPT-4.5 at roughly 13,000 tokens per hour (≈3.6 tokens/sec), substantial hardware is needed. The model is far too large for a single GPU’s memory, so it must be sharded across multiple GPUs for inference. For example, a ~600B active parameter model in half-precision would require on the order of 1+ terabytes of memory, meaning easily 8–16 high-memory GPUs just to load the model. (By comparison, GPT-4’s ~200B active params reportedly already needed multiple GPUs in parallel (GPT-4.5: “Not a frontier model”? - by Nathan Lambert).)
In addition to memory, we need enough compute to generate ~3–4 tokens per second. GPT-4’s throughput has been observed around ~12 tokens/sec in practice, using substantial compute power, whereas GPT-4.5 (with ~3× more active parameters) would likely be slower per GPU (Inference Race To The Bottom – Make It Up On Volume? – SemiAnalysis). Thus, achieving real-time conversational speed might require distributing the inference across many GPUs working in parallel. For instance, a cluster of 8 top-tier GPUs (or more) can feasibly handle a single GPT-4.5 conversation with low latency, whereas a smaller setup could become a bottleneck and lag behind the ~13k tokens/hour pace. In short, running GPT-4.5 in real-time is likely to demand a server-class system (or cloud instance) with at least 8–16 GPUs dedicated to the task.
GPU Hardware Choices: H100 vs A100 vs TPUv5
The choice of accelerator heavily influences both performance and cost. Below we compare NVIDIA’s popular GPUs and Google’s TPU:
-
NVIDIA A100 (80GB) – This was the previous-generation flagship for AI. It provides excellent FP16/BF16 throughput (~312 TFLOPs FP16) and high memory bandwidth (~2 TB/s). An A100 80GB (SXM form factor) draws about 400 W at full load ([NVIDIA H100 vs A100 vs L40S: Which GPU Should You Choose? HorizonIQ](https://www.horizoniq.com/blog/h100-vs-a100-vs-l40s/#:~:text=What%20are%20the%20power%20and,form%20factor%20considerations)). It is still widely used for large-model inference due to its large memory and mature software support. Cost-wise, A100s are now cheaper to acquire or rent: cloud pricing is around $3.5–$4 per hour for one 80GB A100 on Azure (Cloud GPU Pricing Comparison in 2025 — Blog — DataCrunch), and purchase prices range $10k–$18k per GPU depending on configuration ([How much is an Nvidia A100? Modal Blog](https://modal.com/blog/nvidia-a100-price-article#:~:text=How%20much%20is%20an%20Nvidia,for%20a%2080GB%20SXM%20model)). However, GPT-4.5 would likely require many A100s working together, which can increase latency unless carefully optimized. In some scenarios, older A100s can actually be more cost-effective per memory-bandwidth than newer GPUs for inference, since memory (not compute) can be the bottleneck (Inference Race To The Bottom – Make It Up On Volume? – SemiAnalysis). -
NVIDIA H100 (80GB) – NVIDIA’s latest-generation “Hopper” GPU offers major performance gains but at higher cost. The H100 has roughly 3–4× the throughput of A100 on transformer models (up to ~4× faster than A100 in training large models) ([Google is rapidly turning into a formidable opponent to BFF Nvidia — the TPU v5p AI chip powering its hypercomputer is faster and has more memory and bandwidth than ever before, beating even the mighty H100 TechRadar](https://www.techradar.com/pro/google-is-rapidly-turning-into-a-formidable-opponent-to-bff-nvidia-the-tpu-v5p-ai-chip-powering-its-hypercomputer-is-faster-and-has-more-memory-and-bandwidth-than-ever-before-beating-even-the-mighty-h100#:~:text=Google%27s%20TPU%20v4%2C%20meanwhile%2C%20is,any%20conclusions%20can%20be%20drawn)). It also supports faster data formats (FP8) that can accelerate inference if the model is quantized. Each H100 80GB has huge memory bandwidth (~3 TB/s) and can consume up to 700 W power at peak load ([NVIDIA H100 vs A100 vs L40S: Which GPU Should You Choose? HorizonIQ](https://www.horizoniq.com/blog/h100-vs-a100-vs-l40s/#:~:text=What%20are%20the%20power%20and,form%20factor%20considerations)). The H100 excels at serving large language models and would likely reduce the number of GPUs needed (or boost tokens/sec) for GPT-4.5. The trade-off is cost: cloud rentals for H100s are expensive (on AWS on-demand an 8×H100 instance is ~$98.3/hour, ~$12.3 per GPU-hour (Cloud GPU Pricing Comparison in 2025 — Blog — DataCrunch), though committed rates can bring it to ~$5–$7 per GPU-hour ([Unlocking Savings: Why Buying NVIDIA H100 GPUs Beat AWS Rental Costs TRG Datacenters](https://www.trgdatacenters.com/resource/unlocking-savings-why-nvidia-h100-gpus-beat-aws-rental-costs/#:~:text=AWS%20offers%20the%20powerful%20EC2,for%20a%20single%20H100%20GPU)) (Cloud GPU Pricing Comparison in 2025 — Blog — DataCrunch)). Buying H100s outright costs around $25k–$30k+ each ([Buy Used NVIDIA H100 GPU Brightstar Systems](https://brightstarsystems.com/product/nvidia-h100-gpu/?srsltid=AfmBOop0kK5L5N4ffmN0zQXJ0bH_53nwosk8OZ-JlgniD0UidCErF7cI#:~:text=Buy%20Used%20NVIDIA%20H100%20GPU,Exact%20prices%20may%20differ)). In summary, H100s provide top performance for GPT-4.5 but at a high price; they’re best when maximum speed or lower latency is required. -
Google TPU v5 – Google’s TPUs are custom AI accelerators used internally and via Google Cloud. The TPU v5 family (recently deployed for models like Google’s Gemini) is designed for both training and inference at massive scale. The high-end TPU v5p pods have been reported as on par or faster than NVIDIA H100 in performance – roughly 3.4–4.8× the speed of an A100 ([Google is rapidly turning into a formidable opponent to BFF Nvidia — the TPU v5p AI chip powering its hypercomputer is faster and has more memory and bandwidth than ever before, beating even the mighty H100 TechRadar](https://www.techradar.com/pro/google-is-rapidly-turning-into-a-formidable-opponent-to-bff-nvidia-the-tpu-v5p-ai-chip-powering-its-hypercomputer-is-faster-and-has-more-memory-and-bandwidth-than-ever-before-beating-even-the-mighty-h100#:~:text=Google%27s%20TPU%20v4%2C%20meanwhile%2C%20is,any%20conclusions%20can%20be%20drawn)), which puts v5p in the same league as H100 in raw capability. Each TPU v5 chip also has large onboard memory (v5 pods offer up to 95 GB HBM per chip) ([Google is rapidly turning into a formidable opponent to BFF Nvidia — the TPU v5p AI chip powering its hypercomputer is faster and has more memory and bandwidth than ever before, beating even the mighty H100 TechRadar](https://www.techradar.com/pro/google-is-rapidly-turning-into-a-formidable-opponent-to-bff-nvidia-the-tpu-v5p-ai-chip-powering-its-hypercomputer-is-faster-and-has-more-memory-and-bandwidth-than-ever-before-beating-even-the-mighty-h100#:~:text=system,HBM%20RAM%20in%20TPU%20v4)) ([Google is rapidly turning into a formidable opponent to BFF Nvidia — the TPU v5p AI chip powering its hypercomputer is faster and has more memory and bandwidth than ever before, beating even the mighty H100 TechRadar](https://www.techradar.com/pro/google-is-rapidly-turning-into-a-formidable-opponent-to-bff-nvidia-the-tpu-v5p-ai-chip-powering-its-hypercomputer-is-faster-and-has-more-memory-and-bandwidth-than-ever-before-beating-even-the-mighty-h100#:~:text=the%20company%27s%20own%20data)). Google also offers a cost-optimized TPU v5e variant aimed at superior price-efficiency for inference workloads. TPUs tend to run cooler and at lower power per chip than a comparable GPU (Google prioritized total cost of ownership over sheer FLOPS) (TPUv5e: The New Benchmark in Cost-Efficient Inference and Training for – SemiAnalysis). Performance/Cost: Google claims its TPU-based inference systems deliver 2–4× more performance per dollar than GPU setups (Performance per dollar of GPUs and TPUs for AI inference). In practical terms, if one could use TPUv5 for GPT-4.5, it might require a similar number of accelerator chips, but the cost per hour could be lower on Google Cloud’s TPU pricing vs. renting equivalent H100 instances. The caveat is that TPUs are generally available only through Google’s cloud (not for purchase), and software compatibility needs to be considered (one would need a TensorFlow/TFRC implementation of GPT-4.5). Still, TPUs present a compelling alternative especially for large-scale deployments where efficiency is critical.
(Other hardware: Some organizations might consider alternative AI chips like AMD’s MI300X GPUs or even custom accelerators. However, H100-class GPUs and Google TPUs represent the current cutting edge, so we focus on them.)
Cloud Hosting vs. On-Premise Deployment
Cloud hosting offers on-demand access to this hardware, while on-premise deployment means buying or leasing your own servers with GPUs. Each approach has trade-offs:
-
Cloud Costs: Renting the required GPUs by the hour is expensive but flexible. As noted, on-demand prices for an 8×GPU server can range widely – e.g. ~$98/hour for 8×H100 on AWS (us-east-1) (Cloud GPU Pricing Comparison in 2025 — Blog — DataCrunch), or about $44.50/hour on a discounted plan ([Unlocking Savings: Why Buying NVIDIA H100 GPUs Beat AWS Rental Costs TRG Datacenters](https://www.trgdatacenters.com/resource/unlocking-savings-why-nvidia-h100-gpus-beat-aws-rental-costs/#:~:text=AWS%20offers%20the%20powerful%20EC2,for%20a%20single%20H100%20GPU)). That equates to roughly $5–$12 per GPU-hour for H100s in the cloud. For A100s, on-demand costs are lower (AWS offers 8×A100 40GB at ~$32.77/hour, ~$4/GPU-hour (Cloud GPU Pricing Comparison in 2025 — Blog — DataCrunch)). Microsoft and Google clouds have similar pricing (Azure ~$6.98/hour for one H100, ~$3.67/hour for one A100 (Cloud GPU Pricing Comparison in 2025 — Blog — DataCrunch) (Cloud GPU Pricing Comparison in 2025 — Blog — DataCrunch)). Using cloud, you would likely need the cluster of GPUs for the entire duration of the conversation. For one hour of GPT-4.5 usage, the cloud rental cost is on the order of tens of dollars – e.g. on Azure, 8×H100 for an hour ≈ $56 (8×$6.98) (Cloud GPU Pricing Comparison in 2025 — Blog — DataCrunch); on AWS on-demand it could be $90–$100. The cloud provider’s cost includes infrastructure, power, and maintenance built-in (and their profit margin), so it’s essentially a pay-as-you-go model with no upfront investment. This is ideal if you only occasionally need to run such a model or want to scale up/down dynamically. -
On-Premise Costs: Purchasing and hosting your own hardware is very costly up front but can be cheaper in the long run for heavy use. For instance, an H100 80GB GPU costs around $25k–$30k each ([Buy Used NVIDIA H100 GPU Brightstar Systems](https://brightstarsystems.com/product/nvidia-h100-gpu/?srsltid=AfmBOop0kK5L5N4ffmN0zQXJ0bH_53nwosk8OZ-JlgniD0UidCErF7cI#:~:text=Buy%20Used%20NVIDIA%20H100%20GPU,Exact%20prices%20may%20differ)). An 8-GPU server might be $200k–$250k+ when you include networking and CPUs. You’d also need appropriate cooling and power delivery. However, if you amortize that cost over, say, a 3- to 5-year lifespan and keep the hardware busy, the per-hour cost plummets. One analysis shows that buying an H100 and colocating it (with ~$3,600/year in facility power/cooling costs) leads to about $6,600 per year total cost per GPU ([Unlocking Savings: Why Buying NVIDIA H100 GPUs Beat AWS Rental Costs TRG Datacenters](https://www.trgdatacenters.com/resource/unlocking-savings-why-nvidia-h100-gpus-beat-aws-rental-costs/#:~:text=If%20a%20company%20opts%20to,business%20back%20%243%2C600%20per%20year)). If utilized 24/7, that’s roughly $0.75 per GPU-hour – an order of magnitude cheaper than cloud renting ([Unlocking Savings: Why Buying NVIDIA H100 GPUs Beat AWS Rental Costs TRG Datacenters](https://www.trgdatacenters.com/resource/unlocking-savings-why-nvidia-h100-gpus-beat-aws-rental-costs/#:~:text=When%20you%20purchase%20an%20NVIDIA,3%2C600%20%3D%20%246%2C600%20per%20year)). In that scenario, the capital expense pays off in ~8–12 months versus cloud costs ([Unlocking Savings: Why Buying NVIDIA H100 GPUs Beat AWS Rental Costs TRG Datacenters](https://www.trgdatacenters.com/resource/unlocking-savings-why-nvidia-h100-gpus-beat-aws-rental-costs/#:~:text=,Cost%20of%20Purchased%20GPU%3A%20%246%2C600)). On-premise deployment becomes cost-effective only if you have near-constant usage and the expertise to manage the infrastructure. For a single one-hour conversation, purchasing hardware is obviously impractical. But for companies like OpenAI or others serving many requests, owning the servers significantly lowers marginal costs per conversation (after the initial investment).
In summary, cloud is best for flexibility and low utilization, whereas on-premise wins for high, steady utilization (despite large upfront costs). Some large users even do a hybrid: e.g. handle baseline load on owned hardware and burst to cloud for spikes.
Power Consumption and Energy Cost
Running a GPT-4.5 model is power-intensive. The GPUs themselves draw a lot of electricity, and that translates to cost and heat. Each NVIDIA H100 (SXM5) can consume up to 700 watts at peak load (NVIDIA H100 vs A100 vs L40S: Which GPU Should You Choose? | HorizonIQ), and an A100 (SXM4) about 400 W. That means a server with 8×H100 could pull around 5.6 kW of power when the model is running full throttle. Over one hour, that’s 5.6 kilowatt-hours of energy used. If we assume an electricity rate of ~$0.10 per kWh (typical for data centers in the US), the direct electricity cost for the GPUs would be roughly $0.56 for that hour. An 8×A100 setup (~3.2 kW) would use ~3.2 kWh, costing maybe $0.30 in power.
However, total power cost includes more than just the GPUs. The server’s CPUs, memory, and other components also draw power, and the data center cooling systems consume additional energy. Data center efficiency is often measured by PUE (Power Usage Effectiveness) – many modern facilities operate with PUE around 1.2–1.5 (meaning for every 1 W to IT equipment, ~0.2–0.5 W is used for cooling, networking, etc.) (Importance of PUE on Data Center Costs - AKCP). So that 5.6 kW of GPU power might really be ~7 kW including overhead. That doubles the hourly energy cost to around $1.10/hour for power in the H100 cluster example.
Importantly, in cloud rentals you don’t pay this electricity directly – it’s baked into the rental price. But for on-premise, the power cost is an ongoing operational expense. Over a year, an 8×H100 server running 24/7 could consume ~ 49,000 kWh. At $0.10/kWh that’s ~$4,900/year in electricity (which aligns with the ~$3,600/year per GPU figure cited when scaled up ([Unlocking Savings: Why Buying NVIDIA H100 GPUs Beat AWS Rental Costs | TRG Datacenters](https://www.trgdatacenters.com/resource/unlocking-savings-why-nvidia-h100-gpus-beat-aws-rental-costs/#:~:text=If%20a%20company%20opts%20to,business%20back%20%243%2C600%20per%20year))). For a single hour of usage, the electricity cost is only on the order of $1 or less, which is a tiny fraction of the GPU rental cost (which might be $50–$100). This shows that hardware cost (and profit margin) far outweighs raw power cost for short-term use. But at massive scale, power adds up and becomes a significant part of TCO (total cost of ownership). |
Infrastructure, Cooling, and Maintenance
Beyond power, running such a model requires robust infrastructure and incurs cooling and maintenance costs:
-
Cooling: High-powered GPUs like the H100 generate a lot of heat – an 8×H100 server can output ~5–7 kW of heat that must be dissipated. Data centers use HVAC chillers or liquid cooling to maintain safe temperatures. Cooling equipment and electricity can account for roughly 20%–50% extra power as noted with PUE. In dollar terms, if our GPUs use $0.50/h in power, cooling might add ~$0.25–$0.50/h on top. Advanced methods (liquid cooling, heat reuse) can reduce this overhead, but there’s always some cost. The cooling infrastructure (CRAC units, cooling towers or liquid loops, etc.) is part of facility costs often rolled into a colocation fee (e.g. the earlier figure of $3,600/year per H100 included power and cooling services in a colo data center ([Unlocking Savings: Why Buying NVIDIA H100 GPUs Beat AWS Rental Costs TRG Datacenters](https://www.trgdatacenters.com/resource/unlocking-savings-why-nvidia-h100-gpus-beat-aws-rental-costs/#:~:text=If%20a%20company%20opts%20to,business%20back%20%243%2C600%20per%20year))). -
Maintenance & Operations: If you deploy on-prem, you need skilled staff to maintain the hardware and software. GPUs occasionally fail or need driver updates; servers need spare parts, monitoring, and security. There’s also the space and networking – rack space in a data center and high-speed network fabric (like InfiniBand or NVLink switches for multi-GPU communication) which can be costly. These factors are generally included in cloud prices, but on-prem they translate to additional costs. A rule of thumb might be to add a certain percentage of hardware cost per year for maintenance – for example, maybe 5-10% of the hardware value per year in support contracts or staff time. For one hour of use, these are negligible, but for completeness: if an 8-GPU system costs $200k and one assumes ~10%/year maintenance+support, that’s $20k/year (~$2.28/hour if 24/7) added overhead.
-
Infrastructure Depreciation: Hardware becomes obsolete – one must plan to upgrade GPUs every few years as newer models (like NVIDIA’s next-gen Blackwell GPUs, priced ~$30–$40k each (Nvidia’s latest AI chip will cost more than $30000, CEO says - CNBC)) come out with far better efficiency. Companies often depreciate servers over 3-5 years. That depreciation (the effective “cost per year” of the capital) is a key part of on-prem cost. In our earlier calculation, $30k per GPU over 5 years was $6k/year (which matched the $3k/year depreciation + $3.6k/year ops = $6.6k) ([Unlocking Savings: Why Buying NVIDIA H100 GPUs Beat AWS Rental Costs TRG Datacenters](https://www.trgdatacenters.com/resource/unlocking-savings-why-nvidia-h100-gpus-beat-aws-rental-costs/#:~:text=If%20a%20company%20opts%20to,business%20back%20%243%2C600%20per%20year)). If usage is lower (not 24/7), the effective cost per hour of ownership rises because you’re paying the same yearly fixed costs for fewer compute-hours.
In a cloud scenario, infrastructure and maintenance are the provider’s responsibility – you just pay the hourly rate. In an on-prem scenario, you must account for these ongoing costs in your cost per hour. For a rough estimate, infrastructure and maintenance might add on the order of a few dollars per hour for an 8×GPU system (amortized), which is far less than the raw cloud rental rate, but in line with the much lower operating cost when hardware is fully utilized by its owner.
Estimated Total Cost per Hour of GPT‑4.5
Bringing it all together, we can estimate a cost range for a one-hour GPT-4.5 conversation:
-
Using Cloud GPUs: On-demand with top hardware, expect on the order of $50–$100 per hour. For instance, renting 8 H100 GPUs for one hour is ~$90 on AWS (Cloud GPU Pricing Comparison in 2025 — Blog — DataCrunch) (or ~$56 on Azure (Cloud GPU Pricing Comparison in 2025 — Blog — DataCrunch)). If using slightly cheaper or fewer GPUs (or A100s), it might be on the lower end (maybe $20–$40/hour with enough A100s to do the job). These costs are largely driven by rental pricing and assume the hardware is dedicated to your one conversation (no cost-sharing with other tasks).
-
On-Premise (fully utilized): The effective cost can drop to single-digit dollars per hour if you already own the hardware and keep it busy. Using our earlier amortization: one hour on an 8×H100 server might cost on the order of ~$6–$10 in terms of hardware depreciation and power (e.g. ~$6/hour if fully amortized over 5 years, plus ~$1 for electricity, etc.). This is dramatically lower than cloud. However, this assumes the infrastructure was bought for continuous use. If you only occasionally run a conversation, the unused time doesn’t “cost nothing” – the capital is just idle. So for a one-off hour on owned hardware, it’s not literally $6; it’s only that cheap if you’re doing it all the time and averaging the costs.
-
Including Overheads: If we factor in cooling, facilities, and maintenance on-prem, maybe add a couple dollars per hour. So maybe it comes to ~$8–$15/hour true cost for 8×H100 equivalent when amortized. Cloud providers, by contrast, incorporate those overheads into the high hourly rates you pay them.
-
OpenAI API Pricing (for context): OpenAI’s API price for GPT-4.5 is reportedly $75 per 1M input tokens and $150 per 1M output tokens (GPT-4.5 has an API price of $75/1M input and $150/1M output. ChatGPT Plus users are going to get 5 queries per month with this level of pricing. : r/OpenAI). At that rate, a 1-hour conversation (~13k tokens in+out) would cost the customer only $2–$3. This implies that OpenAI’s cost per hour per conversation is in that ballpark or lower – achieved through serving many users on one machine. They likely batch multiple conversations on the same GPU or otherwise utilize the hardware efficiently to amortize costs. High utilization and model optimizations (like faster kernels or quantization) allow the actual cost per user to be brought down significantly from the raw “one model on one box” cost.
Sources: The above analysis draws on known industry data and trends, including model size extrapolations (GPT-4.5: “Not a frontier model”? - by Nathan Lambert) ([GPT-4.5: “Not a frontier model”? | Hacker News](https://news.ycombinator.com/item?id=43230965#:~:text=sigmoid10%20%20%2022%20,24%20%5B%E2%80%93)), GPU performance/power specs ([NVIDIA H100 vs A100 vs L40S: Which GPU Should You Choose? | HorizonIQ](https://www.horizoniq.com/blog/h100-vs-a100-vs-l40s/#:~:text=What%20are%20the%20power%20and,form%20factor%20considerations)) ([Google is rapidly turning into a formidable opponent to BFF Nvidia — the TPU v5p AI chip powering its hypercomputer is faster and has more memory and bandwidth than ever before, beating even the mighty H100 | TechRadar](https://www.techradar.com/pro/google-is-rapidly-turning-into-a-formidable-opponent-to-bff-nvidia-the-tpu-v5p-ai-chip-powering-its-hypercomputer-is-faster-and-has-more-memory-and-bandwidth-than-ever-before-beating-even-the-mighty-h100#:~:text=Google%27s%20TPU%20v4%2C%20meanwhile%2C%20is,any%20conclusions%20can%20be%20drawn)), cloud pricing from AWS/Azure (Cloud GPU Pricing Comparison in 2025 — Blog — DataCrunch) (Cloud GPU Pricing Comparison in 2025 — Blog — DataCrunch), reports of on-premise costs ([Unlocking Savings: Why Buying NVIDIA H100 GPUs Beat AWS Rental Costs | TRG Datacenters](https://www.trgdatacenters.com/resource/unlocking-savings-why-nvidia-h100-gpus-beat-aws-rental-costs/#:~:text=If%20a%20company%20opts%20to,business%20back%20%243%2C600%20per%20year)), and OpenAI’s own pricing disclosures (GPT-4.5 has an API price of $75/1M input and $150/1M output. ChatGPT Plus users are going to get 5 queries per month with this level of pricing. : r/OpenAI). Where exact numbers are unknown, we’ve provided reasoned estimates and ranges based on these sources and current-generation hardware capabilities. This should give a representative picture of the cost landscape for running GPT-4.5 in a conversational setting. |