It was summer 2024, and “2025 the year of the agents” felt just around the corner. Contemporary language agents were untrained, made of a foundation model such as Claude 3.5 connected to a task-specific toolset.
Aviary began with the goal of actually training our agents, improving their ability to complete scientific tasks beyond the skills of foundation models wrapped in a simple tool-calling loop. We also were aware of the lack of an overarching mathematical framework for language agents and tools, where every agent paper featured on a flowchart diagram of emoji entities.
Aviary and LDP (Language Decision Process) were sister frameworks built concurrently within FutureHouse to address these problems and actually train agents.
- Aviary models stochastic environments designed for interaction with language agents. We included high-quality implementations of five environments for scientific tasks such as literature-based question answering.
- LDP connects trainable agents and stochastic environments together in what we called a “language decision process”, a special case of a partially observable Markov decision process (POMDP) where actions and observations are in natural language. Agents are implemented with a stochastic compute graph traversable by an optimizer, and house learnable behaviors such as language models, memories, or prompts.
Our paper was accepted to ICLR 2025’s Scaling Self-Improving Foundation Models workshop and featured in the Jack Clarks Import AI newsletter.
Findings#
Aviary and LDP bridge the gap between reinforcement learning and language agents + environments. This is made practical by our open source Python frameworks and accompanying environment/agent implementations. Other notable findings include:
- Expert iteration is a simple and effective method for training agents.
- Trained open-weight agents can attain the same or better performance, with less than 1% the inference cost as closed frontier models.
- Majority vote remains applicable to language decision processes, unlocking an additional 10% accuracy over majority voting upon the base LLM.
One LDP feature we did not fully explore was the utility of backing agents with a stochastic compute graph. We explored prompt and long-term memory optimization, but did not have the time to fully bake these techniques.
As of summer 2025, Aviary and LDP are actively maintained, and their paradigm taken remains relevant. Going forward it will be interesting to see if:
- State space: perhaps code writing systems such as Cursor or Claude Code move away from a message-base primitive.
- Action space: new agents approaches may not use function calling. For example, Biomni uses code execution of standalone function calling.