Merge pull request #125 from plastic-labs/ben/conclusion

conclusion
This commit is contained in:
Courtland Leer 2025-12-19 14:10:37 -05:00 committed by GitHub
commit e614e2cf9c
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -33,7 +33,7 @@ We also tune Honcho for various use cases. For example, the message batch size w
We currently use 3 different benchmarks to evaluate Honcho: [LongMem](https://arxiv.org/abs/2410.10813), [LoCoMo](https://arxiv.org/abs/2402.17753), and [BEAM](https://arxiv.org/abs/2510.27246). We currently use 3 different benchmarks to evaluate Honcho: [LongMem](https://arxiv.org/abs/2410.10813), [LoCoMo](https://arxiv.org/abs/2402.17753), and [BEAM](https://arxiv.org/abs/2510.27246).
## LongMem ## LongMem
LongMem S is a data set containing 500 "needle in a haystack" questions, each with about 550 messages distributed over 50 sessions, totalling ~115,000 tokens of context per question. LongMem S is a data set containing 500 "needle in a haystack" questions, each with about 550 messages distributed over 50 sessions, totaling ~115,000 tokens of context per question.
After ingesting this context, a single query is made and judged. The correct answer hinges on information divulged in one or a handful of messages: these are the "needles." Everything else is "hay." The questions come in six flavors: After ingesting this context, a single query is made and judged. The correct answer hinges on information divulged in one or a handful of messages: these are the "needles." Everything else is "hay." The questions come in six flavors:
@ -83,7 +83,7 @@ After over a year, this is no longer the case. Gemini 3 Pro can run LongMem S, e
Honcho passes both of these tests. Running LongMem S directly with Gemini 3 Pro costs about \$115 for input tokens alone (the relevant part for retrievaloutput tokens don't really change). Honcho with the same model had a mean token efficiency of 16% -- bringing ingestion cost down to \$18.40. Adding the cost of running Honcho's ingestion system with Gemini 2.5 flash-lite, a model quite effective for the task, brings total cost up to \$47.15 -- a **60% cost reduction**. The Honcho managed service *does not charge for ingestion* -- we operate our own fine-tuned models for the task. For more discussion of cost efficiency, see section 3. Honcho passes both of these tests. Running LongMem S directly with Gemini 3 Pro costs about \$115 for input tokens alone (the relevant part for retrievaloutput tokens don't really change). Honcho with the same model had a mean token efficiency of 16% -- bringing ingestion cost down to \$18.40. Adding the cost of running Honcho's ingestion system with Gemini 2.5 flash-lite, a model quite effective for the task, brings total cost up to \$47.15 -- a **60% cost reduction**. The Honcho managed service *does not charge for ingestion* -- we operate our own fine-tuned models for the task. For more discussion of cost efficiency, see section 3.
## LoCoMo ## LoCoMo
We stated regarding LongMem that it "does not test a memory system's ability to recall across truly large quantities of data": this is even more the case for LoCoMo. It takes a similar format to LongMem, but instead of 115,000 tokens per question, it provides a meager 16,000 tokens on average of context. Then, each of these 16k token conversations have a battery of 100 or more questions applied to them. We stated regarding LongMem that it "does not test a memory system's ability to recall across truly large quantities of data": this is even more the case for LoCoMo. It takes a similar format to LongMem, but instead of 115,000 tokens per question, it provides a meager 16,000 tokens on average of context. Then, each of these 16k token conversations has a battery of 100 or more questions applied to them.
Given that models routinely offer a context window of 200,000 or more tokens nowadays, a 16,000 token conversation really isn't useful at all for evaluating a memory framework. Given that models routinely offer a context window of 200,000 or more tokens nowadays, a 16,000 token conversation really isn't useful at all for evaluating a memory framework.
@ -108,7 +108,7 @@ BEAM's judge is thoroughly defined, including a rubric, tool calls, and detailed
BEAM scoring is different from LongMem and LoCoMo: rather than setting a pass/fail criterion and scoring the overall test by pass rate, BEAM's judge grades each question individually, and the overall test grade is the average of these scores. The LLM judge is instructed to, and naturally leans towards, grading in a step-function pattern: each question's rubric makes it relatively easy to "pass" with a 0.5, and quite difficult to "ace" the question and score 1.0. A score of 0.5 would count as a "pass" in both the LongMem and LoCoMo judge systems. This property gives BEAM scores a much higher ceiling of excellence, and testing Honcho with BEAM has given insight into how we can improve our system beyond just "perfect recall." BEAM scoring is different from LongMem and LoCoMo: rather than setting a pass/fail criterion and scoring the overall test by pass rate, BEAM's judge grades each question individually, and the overall test grade is the average of these scores. The LLM judge is instructed to, and naturally leans towards, grading in a step-function pattern: each question's rubric makes it relatively easy to "pass" with a 0.5, and quite difficult to "ace" the question and score 1.0. A score of 0.5 would count as a "pass" in both the LongMem and LoCoMo judge systems. This property gives BEAM scores a much higher ceiling of excellence, and testing Honcho with BEAM has given insight into how we can improve our system beyond just "perfect recall."
Honcho produces state-of-the-art scores according to the judgement framework proposed by the paper. On the smallest test, BEAM 100K, we observe a baseline score of **0.53** from Claude Haiku 4.5, and a Honcho score of **0.63**. Since Haiku only has a context window of 200,000 tokens, the baseline scores stop there. But Honcho continues: Honcho produces state-of-the-art scores according to the judgment framework proposed by the paper. On the smallest test, BEAM 100K, we observe a baseline score of **0.53** from Claude Haiku 4.5, and a Honcho score of **0.63**. Since Haiku only has a context window of 200,000 tokens, the baseline scores stop there. But Honcho continues:
| BEAM | Top Score in Paper | Honcho Score | Dreaming | Batching | BEAM | Top Score in Paper | Honcho Score | Dreaming | Batching
|---|---|---|---|---| |---|---|---|---|---|
@ -156,6 +156,14 @@ Over the course of just a few days, a user chatting regularly with an AI assista
A matchmaking system with thousands of users wants to use Claude Opus 4.5 to create high-quality "date me" docs for each user who has volunteered their email inbox as source material to extract their lifestyle and preference data. Rather than having Opus churn through 10M+ tokens per inbox, costing \$50+ per user, use Honcho to ingest the data and perform a dozen targeted chat queries: using the same configuration as the BEAM 10M benchmark run above, this would cost about \$6. A matchmaking system with thousands of users wants to use Claude Opus 4.5 to create high-quality "date me" docs for each user who has volunteered their email inbox as source material to extract their lifestyle and preference data. Rather than having Opus churn through 10M+ tokens per inbox, costing \$50+ per user, use Honcho to ingest the data and perform a dozen targeted chat queries: using the same configuration as the BEAM 10M benchmark run above, this would cost about \$6.
## Conclusion
We're glad to hit state-of-the-art numbers—they validate that our architecture works—but scores on LongMem or LoCoMo are not the final goal. And it turns out that these benchmarks are starting to lead people astray from what agent memory really means. Other memory systems, focused only on recall, are beginning to *underperform* baseline with the latest models on small-context tests. The good news, for us and for agent builders, is that we're only interested in recall as a step towards true theory-of-mind simulation. Getting agents to correctly model users and other agents is far more of a roadblock to production-ready agents than simple recall. The goal is to build systems that form perfect representations of identity through [logical reasoning over social information](https://blog.plasticlabs.ai/blog/Memory-as-Reasoning), not systems that can regurgitate facts from a conversation history. We're looking for answers to questions like: What does this person *want*? What do they *think about*? What would *surprise* them? Benchmarks don't test for that.
BEAM 10M proves Honcho can reason over token counts that exceed any model's context window. That unlocks use cases that were previously impossible: agents with years of continuous memory, products that actually know their users, AI that improves its model of you faster than you update your model of yourself. We'll keep publishing benchmark results as we improve, but we'd rather you judge Honcho by what it enables you to build. Try the [managed service](https://honcho.dev), dig into the [open source](https://github.com/plastic-labs/honcho), and let us know what works and what doesn't.
[^1]: A theme throughout these benchmarks is the use of an LLM judge. All scores *must* be considered with variance in mind: not only is the model answering the question non-deterministic, so too is the judge model (yes, the judge prompt is run at temperature 0, but no, this does not equal determinism, plus, minute differences in the wording the answer being judged can trigger large changes even at temperature 0). These non-deterministic data sources combine to form fairly high variance. [^1]: A theme throughout these benchmarks is the use of an LLM judge. All scores *must* be considered with variance in mind: not only is the model answering the question non-deterministic, so too is the judge model (yes, the judge prompt is run at temperature 0, but no, this does not equal determinism, plus, minute differences in the wording the answer being judged can trigger large changes even at temperature 0). These non-deterministic data sources combine to form fairly high variance.
[^2]: In some cases the answer provided in the benchmark is arguably wrong. See the issues section of the [LongMem GitHub repo](https://github.com/xiaowu0162/LongMemEval). [^2]: In some cases the answer provided in the benchmark is arguably wrong. See the issues section of the [LongMem GitHub repo](https://github.com/xiaowu0162/LongMemEval).