Scaling AI Workloads 5x with a Distributed Persistent Queue

Orchestrating AI at Scale

Salesforce's Agentforce Lead Nurturing team faced a significant engineering challenge: how to allow autonomous agents and human workflows to coexist without exceeding shared LLM rate limits. When thousands of leads are processed simultaneously, naive execution leads to infrastructure failures and broken engagement chains.

To solve this, the engineering team developed a distributed persistent queue. This layer acts as an orchestration engine, shifting work from "execute-now" to a controlled, capacity-aware dispatch model.

Solving Contentious Capacity

AI agents and human sellers share a 300 RPM LLM limit, but their usage patterns differ significantly. The queue handles this via two primary strategies:

Fair-Share Distribution: By grouping work by execution context (e.g., bot user or individual owner), the system employs a round-robin strategy to ensure no single entity monopolizes shared capacity.
Three-Tier Priority Queue: The system classifies outreach into three tiers:
1. High Priority: Reply emails (active conversations).
2. Medium Priority: Introductory emails.
3. Low Priority: Nudge emails.

To ensure maximum throughput, the engine uses adaptive backfill. If high-priority slots are not fully utilized in a dispatch cycle, the system fills remaining capacity with lower-priority work, preventing wasted cycles.

Adaptive Rate Modulation

Traditional retry mechanisms often exacerbate cascading failures. Instead, the team implemented proactive capacity-based admission control.

Before any dispatch, the system checks real-time organizational usage. If usage nears infrastructure limits, the dispatcher automatically scales down volume for that cycle. This prevents rate-limit violations before they occur, protecting critical multi-step AI conversations.

Dual-Path Architecture for Human-in-the-Loop

To maintain compliance while supporting autonomous agents, the team designed a dual-path pipeline:

Autonomous Path: Standard generation and immediate dispatch, subject to real-time capacity checks.
Human-Review Path:
- Drafting occurs during the initial phase.
- Messages are stored in a "queued-approved" state.
- Upon human approval, the system schedules delivery, bypassing the expensive generation step to avoid unnecessary LLM overhead.

Performance Gains

By optimizing database query patterns and implementing per-user capacity tracking, the team achieved a 5x increase in throughput. The queue allows for constant-time database access, ensuring that performance remains predictable even as queue size fluctuates.

Key Takeaways

Orchestration beats execution: Moving from immediate execution to a persistent queue allows for sophisticated flow control in AI systems.
Prioritize context, not just tasks: Using round-robin distribution across execution contexts prevents resource starvation.
Proactive vs. Reactive: Use capacity-based admission control rather than relying on error-prone retry logic.
Dual-path processing: Decouple generation from dispatch to accommodate human-in-the-loop requirements without sacrificing LLM efficiency.

Scaling AI Workloads 5x with a Distributed Persistent Queue

Orchestrating AI at Scale

Solving Contentious Capacity

Adaptive Rate Modulation

Dual-Path Architecture for Human-in-the-Loop

Performance Gains

Key Takeaways

Related Articles

Agentforce Prompt Template: Accessing Notes and Attachments

Salesforce AI Strategy: Engineers Are Augmented, Not Replaced

Scaling Agentforce: Handling 1M+ AI Recommendations

Comments

Leave a Comment