Cost vs Accuracy: The Tradeoffs Nobody Documents
Every ML system operates on a cost-accuracy frontier. Moving along that frontier is the core engineering challenge, but it's rarely discussed in technical content.
Model selection is the biggest lever. GPT-4 might give you 95% accuracy, but GPT-4o-mini at 88% accuracy might be 20x cheaper. For many use cases, that 7% gap doesn't matter to users — but the cost difference determines whether your feature is viable.
Caching is the second biggest lever. If 30% of your queries are semantically similar, a good semantic cache can save you 30% of inference costs with zero accuracy loss. The investment in building the cache pays for itself within weeks.
Prompt engineering before fine-tuning. I've seen teams jump to fine-tuning when careful prompt engineering would have closed the gap. Fine-tuning has ongoing costs (data, training, evaluation, versioning) that prompt optimization doesn't.
Batch vs real-time inference. Not everything needs to be real-time. If you can batch process overnight, you can use larger models, longer prompts, and more sophisticated pipelines without the latency constraint.
Document your tradeoffs. For every system I build, I maintain a simple table: approach, accuracy, cost per 1K requests, latency P95. This makes it trivial to justify decisions and evaluate alternatives.