Latency is the time gap between when a request is submitted and when the AI responds. It is determined by multiple factors: retrieval time, model size, the amount of text being processed, and where the infrastructure is located.
For real-time customer interactions, high latency breaks the experience. For batch processing, it affects throughput. An e-commerce company whose customer chat agent must respond within two seconds might use a smaller, faster model for initial classification and reserve a larger model only for the final response. Managing latency is a real engineering constraint, not just a performance footnote.