← Back to Blog

Last week I sat down to do something I should have done six months ago. I worked out what each of our backend models actually costs us per call. Not the marketing number on the pricing page. The real number, after the free credits run out.

I had been quoting "around a cent per call" to anyone who asked. Turns out that number is true for one of our three backends and very wrong for the other two. The gap between them is much bigger than I expected.

What follows is what I found. It is not a recommendation. The right answer depends on your data, your customers, and your tolerance for being wrong.


01 / Setup

The setup, briefly

InspectDoc is the fraud detection product we run at Dheemai. A user uploads a receipt or invoice. Our backend runs ten or so checks on the image. Math analysis, C2PA signature, EXIF metadata, error level analysis for tampering. If we still aren't sure, we send the image to a vision LLM for a final read.

The VLM is the expensive part. Everything else is local Python and finishes in under a second.

We have three backends wired up behind a single environment variable.

  • Gemini 2.5 Pro. Google's flagship vision model. Cloud API.
  • Qwen2.5-VL-72B via OpenRouter. Alibaba's 72-billion-parameter open model, hosted by OpenRouter.
  • Qwen2.5-VL-7B locally. Same architecture family, smaller (7B), running on a machine we control.

Each call sends roughly the same thing. An image worth about 1,500 tokens, a prompt worth another 1,500, and we ask for around 500 tokens of structured JSON back.


02 / The Numbers

What it actually costs (ignoring startup credits)

Backend Input Output Per call At 10k/mo
Gemini 2.5 Pro $1.25 / 1M tok $10 / 1M tok ~$0.009 ~$90
Gemini 2.5 Flash $0.30 / 1M tok $2.50 / 1M tok ~$0.002 ~$20
Qwen 72B (OpenRouter) $0.20 / 1M tok $0.20 / 1M tok ~$0.0007 ~$7
Qwen 7B local (Apple M4) n/a n/a $0 per call $0

There is the headline number everyone wants. Qwen 7B local is free at the model layer. Qwen 72B is roughly twelve times cheaper than Gemini Pro. Gemini Flash sits in between.

If you stop reading here, you will probably choose wrong.


03 / What the headline hides

What the per-call number leaves out

Fixed infrastructure

The 7B is free to call but it has to run somewhere. On my MacBook, "free". On a Cloud Run instance with an L4 GPU, around $520 a month always-on. That is roughly 58,000 Gemini Pro calls before the GPU breaks even, or about 740,000 OpenRouter Qwen 72B calls. Below those volumes, "free" is more expensive than the API.

The hosted options have the opposite shape. Zero infrastructure, you pay only when you call. Predictable below 10,000 calls a month, less predictable above.

Accuracy on your data

I ran all three through 14 documents. Five fakes generated with ChatGPT and Gemini, nine real ones from Indian businesses. Subway, Swiggy, Amazon, KSTDC, an Indian Railways receipt, a couple of local pharmacy bills.

On this set Gemini Pro got 14 out of 14 right. Qwen 7B local I had to abandon midway. The model was loading in float32 because I had configured it that way two months ago and forgotten. At float32 the 7B is roughly 28 GB. My MacBook has 16. The machine was paging memory like it was 2008 and each inference was taking over five minutes.

This says more about my configuration than about Qwen. With bfloat16 on a proper GPU it should run fine, and I will re-run the test that way. The point is just that "free" comes with a setup tax that the per-call price does not show.

For the Qwen 72B numbers via OpenRouter I have not yet finished a full 14-file run on the InspectDoc pipeline. I will publish that separately.

Cost of a wrong answer

When we flag a genuine receipt as fake, the customer writes in, somebody on my team replies, we re-run, we apologise, sometimes we comp the call. Easily half an hour of attention. At any reasonable cost-of-time number that is worth thousands of API calls regardless of which provider you picked.

So the comparison is not "which one is cheapest". It is "how often does each one get the answer wrong, and what does each wrong answer cost me". That second number is the one most pricing pages do not help you with.

Match to your domain

Different models are stronger on different kinds of documents. Indian documents (GSTIN format, regional language receipts, government forms) are where we live, and the three models perform differently on them. English-only receipts, American business forms, European invoices will all skew the comparison in different directions. Run your own evaluation set before trusting anyone else's.

What the per-call number hides

The cheap model is rarely the cheap model once you add fixed infrastructure, the cost of a wrong answer, and the fit to your specific document mix. The right comparison is total cost of being right, not dollars per API call.

"The comparison is not which one is cheapest. It is how often does each one get the answer wrong, and what does each wrong answer cost me." On choosing a VLM in production

04 / Infrastructure

What about the infrastructure underneath

The compute the backend itself runs on is roughly constant. You pay for it whether you serve zero calls or fifty thousand.

  • AWS ECS Fargate, ap-south-1. 0.5 vCPU and 1 GB of RAM works out to about $20 a month. Good for flat traffic.
  • GCP Cloud Run, asia-south1, pay-per-request. Cheaper than Fargate if your traffic is bursty. More expensive if it is flat. Scale-to-zero possible.
  • GCP Cloud Run with an L4 GPU for running a local Qwen in production. $520 a month always-on, or scale-to-zero with a cold start penalty of a few minutes while the model loads.

The $520 GPU number is the one that ends most "should we self-host" conversations until volume justifies it.


05 / What we run

How we run it today

Not because it is the right answer for everyone, but as a data point.

  • Cloud VLM as the default, with a second cloud VLM wired up as automatic fallback when the first one returns a 503 or hits quota.
  • Local model in the codebase but not on by default. We use it for a customer who specifically wants air-gapped, and they run the GPU.
  • A cheaper, faster model from the same family held in reserve. If we ever go viral, we degrade to it before we throttle anyone.

A different team with English-only documents and 100,000 calls a month would probably make different choices. A team with one tightly regulated customer who needs everything on-prem would make different choices again. The numbers in the table above are the same. What you do with them is not.

Manjula Sridhar

← Back to Blog