mirror of
https://github.com/jackyzha0/quartz.git
synced 2026-02-04 06:25:41 -06:00
Merge branch 'v4' into ben/conclusion
This commit is contained in:
commit
1b33cdfd52
@ -149,7 +149,7 @@ easily be ported over to the `Peer` paradigm by simply creating a `Peer` for the
|
||||
agent, and then different `Peers` for each human user.
|
||||
|
||||
We can push the Peer Paradigm even further with several 2nd-order features.
|
||||
## Local & Global Representations
|
||||
## Scoped Representations
|
||||
By default, Honcho will create representations of `Peers` for every `Message` they
|
||||
send, giving it the source of truth on the behavior of that entity. However,
|
||||
there are situations where a developer would only want a `Peer` to have access to
|
||||
@ -160,8 +160,7 @@ would want to create its own model of every other player to try and guess their
|
||||
next move. Take another example of the game _Diplomacy_, which involves players
|
||||
having private conversations along with group ones. It wouldn’t make sense for a
|
||||
`Peer` “Alice” to be able to chat with a representation of another `Peer` “Bob” that
|
||||
knew about all of “Alice’s” secret conversations. Enabling local representations
|
||||
is as easy as changing a configuration value.
|
||||
knew about all of “Alice’s” secret conversations. Enabling perspective-based representations is as easy as changing a configuration value.
|
||||
|
||||
```python
|
||||
from honcho import Honcho
|
||||
@ -190,7 +189,7 @@ charlie.chat("Can I trust that Alice won't attack me", target=alice)
|
||||
```
|
||||
|
||||
Honcho can now serve the dual purposes of containing the source of truth on a
|
||||
`Peer`'s identity and imbuing a `Peer` with social cognition, all without
|
||||
`Peer`'s identity and imbuing a `Peer` with continual learning, all without
|
||||
duplicating data between different `Apps` or `Workspaces`.
|
||||
## Get_Context
|
||||
We make mapping the Peer Paradigm back to the User-Assistant paradigm trivial
|
||||
@ -225,8 +224,7 @@ anthropic_messages = context.to_anthropic(assistant=alice)
|
||||
|
||||
```
|
||||
|
||||
Developers no longer need to meticulously curate their context windows. Honcho will automatically summarize the conversation and provide
|
||||
the most salient information to let conversations continue endlessly.
|
||||
Developers no longer need to meticulously curate their context windows. Honcho will automatically summarize the conversation and provide the most salient information to let conversations continue endlessly.
|
||||
# What's Now Possible
|
||||
The Peer Paradigm provides the essential primitives—persistent identity and direct communication—that make it possible to build truly sophisticated multi-agent systems:
|
||||
|
||||
|
||||
@ -18,35 +18,35 @@ description: Plastic Labs announces $5.4M pre-seed funding & launches Honcho as
|
||||
|
||||
*We're granting early access to power personal context management for AI agents & applications starting today!*
|
||||
|
||||
*Honcho is now a simple, complete, hosted solution for adaptive agent memory, social cognition, & personalization.*
|
||||
*Honcho is now a simple, complete, hosted solution for adaptive agent memory, reasoning, & personalization.*
|
||||
|
||||
2. **Our pre-seed raise of $5.4M to solve personal identity for the agentic world.**
|
||||
# Individual Alignment
|
||||
Most AI products focus on being palatable to the average user. This neglects the potential for personalization their generative nature affords. It limits the scope of personally useful behaviors and results in poor UX, high churn, and handicapped abilities.
|
||||
|
||||
AI systems need mechanisms to understand each of us on an individual level. They need methods for cohering to our psychology and personality. They need social cognition to eliminate cold starts and build long-term relationships.
|
||||
AI systems need mechanisms to understand each of us on an individual level. They need methods for cohering to our psychology and personality. They need continual learning to eliminate cold starts and build long-term relationships.
|
||||
|
||||
They need Honcho.
|
||||
# Honcho Platform Early Access
|
||||
Today we're launching early access to the hosted [Honcho](https://honcho.dev) platform.
|
||||
|
||||
It's the most powerful personal identity and social cognition solution for AI apps and agents.
|
||||
It's the most powerful continual learning memory solution for AI apps and agents.
|
||||
|
||||
Honcho is a cloud-based API that enables more personalized and contextually aware user experiences. It simplifies the process of maintaining context across conversations and interactions, allowing developers to create more responsive and customized agents without managing complex infrastructure.
|
||||
|
||||
Honcho combines flexible memory, [[ARCHIVED; Theory of Mind Is All You Need|theory of mind]] inference, self-improving user representations, and a [[ARCHIVED; Introducing Honcho's Dialectic API|dialectic API]] to get your application the context it needs about each user for every inference.
|
||||
Honcho combines [[Memory as Reasoning|reasoning]], self-improving [[Beyond the User-Assistant Paradigm; Introducing Peers|peer]] representations, and both custom and opinionated retrieval methods to get your application the context it needs about each user for every inference.
|
||||
|
||||
All this happens ambiently, with no additional overhead to your users--no surveys, no hard coded questions, no BYO data requirements needed to get started. Honcho learns about each of your users in the background as they interact with your application.
|
||||
|
||||
When your agent needs information about a user, it simply asks and Honcho responds with the right personal context--in natural language--which can be injected into any part of your architecture.
|
||||
|
||||
Context from Honcho is far richer than simply retrieving over session data or cramming it into the context window because Honcho is always running theory of mind inference over that organic data. It's expert at deriving everything there is to infer about a user from their inputs.
|
||||
Context from Honcho is far richer than simply retrieving over session data or cramming it into the context window because Honcho is always reasoning over that organic data. It's expert at reasoning toward everything there is to conclude about a user from their inputs.
|
||||
|
||||
The result is a living, thinking reservoir of synthetic data about each user. Honcho gets to the bottom of up-to-date user preferences, history, psychology, personality, values, beliefs, and desires. It maps personal identity. This creates a self-improving representation of each user that transcends the raw data in information density and furnishes much more robust and useful context to your app when it needs it.
|
||||
|
||||
To put it simply, this creates magical experiences for users that they don't even know to expect from AI applications. Honcho-powered agents retain state, adapt over time, build relationships, and evolve with their users.
|
||||
|
||||
That's why Honcho needed to be built. It's memory infrastructure that goes way deeper than anything else on offer. Theory of mind and identity mapping are tasks to optimize for, requiring serious machine learning, expertise in the cognitive sciences, and AI-native solutions.
|
||||
That's why Honcho needed to be built. It's memory infrastructure that goes way deeper than anything else on offer. Continual learning and identity mapping are tasks to optimize for, requiring serious machine learning, expertise in the cognitive sciences, and AI-native solutions.
|
||||
|
||||
If you want to deliver best-in-class personalization, memory, time-to-value, trust, and unlock truly novel experiences to your users, we want to work with you.
|
||||
|
||||
@ -57,7 +57,7 @@ We're giving early access to teams & developers today.
|
||||
^d958ce
|
||||
The release of Honcho as a platform is just the start, the next step is Honcho as a network.
|
||||
|
||||
An engine for social cognition and deeply grokking personal identity is a game changing tool for AI apps, but owning your personal Honcho representation and taking it with you to every agent in your growing stack is world changing.
|
||||
An engine for deeply grokking personal identity is a game changing tool for AI apps, but owning your personal Honcho representation and taking it with you to every agent in your growing stack is world changing.
|
||||
|
||||
It's what's required to truly realize Plastic's mission to decentralize alignment--to give every human access to personally aligned, scalable intelligence.
|
||||
|
||||
@ -97,7 +97,7 @@ Instead of imposing opaque alignment schemes, we should be subverting the proble
|
||||
|
||||
If you want work [IRL in NYC](https://www.therefineryatdomino.com/) on the most impactful and important work in artificial intelligence, we're hiring for four more roles immediately:
|
||||
|
||||
- [[Founding ML Engineer]] - Shape the future of AI at Plastic, tackle challenges across the ML stack, train cutting edge theory of mind models
|
||||
- [[Founding ML Engineer]] - Shape the future of AI at Plastic, tackle challenges across the ML stack, train cutting edge reasoning models
|
||||
|
||||
- [[Platform Engineer]] - Build & scale Honcho's infrastructure, define performance & security for the future of AI personalization
|
||||
|
||||
|
||||
@ -1,21 +1,20 @@
|
||||
---
|
||||
title: Benchmarking Honcho
|
||||
subtitle: Honcho Achieves SOTA Scores on Benchmarks–So What?
|
||||
date: 12.19.25
|
||||
tags:
|
||||
- announcements
|
||||
- dev
|
||||
- honcho
|
||||
- benchmarks
|
||||
- evals
|
||||
- state-of-the-art
|
||||
author: Ben McCormick & Courtland Leer
|
||||
subtitle: Honcho Achieves SOTA Scores on Benchmarks–So What?
|
||||
description: Discussing the latest Honcho Benchmark results and what they mean for state of the art agent memory+reasoning.
|
||||
description: Honcho achieves state-of-the-art performance and pareto dominance across the LongMem, LoCoMo, and BEAM memory benchmarks.
|
||||
---
|
||||
# TL;DR
|
||||
*Honcho achieves state-of-the-art performance across the LongMem, LoCoMo, and BEAM memory benchmarks. 90.4% on LongMem S (92.6% with Gemini 3 Pro), 89.9% on LoCoMo ([beating our previous score of 86.9%](https://blog.plasticlabs.ai/research/Introducing-Neuromancer-XR)), and top scores across all BEAM tests. We do so while maintaining competitive token efficiency.
|
||||
|
||||
**TL;DR: Honcho achieves state-of-the-art performance across the LongMem, LoCoMo, and BEAM memory benchmarks. 90.4% on LongMem S (92.6% with Gemini 3 Pro), 89.9% on LoCoMo ([beating our previous score of 86.9%](https://blog.plasticlabs.ai/research/Introducing-Neuromancer-XR)), and top scores across all BEAM tests. We do so while maintaining competitive token efficiency. But recall tested in benchmarks which fit within a context window is no longer particularly meaningful. Beyond simple recall, Honcho reasons over memory and empowers frontier models to reason across more tokens than their context windows support. Go to [evals.honcho.dev](https://evals.honcho.dev) for charts and comparisons.**
|
||||
|
||||
## 1. A primer on Honcho's architecture
|
||||
But recall tested in benchmarks which fit within a context window is no longer particularly meaningful. Beyond simple recall, Honcho reasons over memory and empowers frontier models to reason across more tokens than their context windows support. Go to [evals.honcho.dev](https://evals.honcho.dev) for charts and comparisons.*
|
||||
# 1. A primer on Honcho's architecture
|
||||
|
||||
Read [Honcho's documentation](https://docs.honcho.dev) for a full understanding of how Honcho works, but a brief overview is important for understanding our benchmarking methodology and how Honcho achieves state-of-the-art results:
|
||||
|
||||
@ -30,15 +29,11 @@ Read [Honcho's documentation](https://docs.honcho.dev) for a full understanding
|
||||
For the sake of reproducibility, all benchmark results published here were generated using **gemini-2.5-flash-lite** as the ingestion model and **claude-haiku-4-5** as the chat endpoint model. In practice, Honcho uses a variety of models for these roles as well as within the dreaming processes.
|
||||
|
||||
We also tune Honcho for various use cases. For example, the message batch size when ingesting messages and the amount of tokens spent on dreaming both have an effect on performance. Notes on the configuration for each benchmark are included, and the [full configuration for each run is included in the data](https://github.com/plastic-labs/honcho-benchmarks).
|
||||
|
||||
|
||||
## 2. Memory Benchmarks
|
||||
# 2. Memory Benchmarks
|
||||
|
||||
We currently use 3 different benchmarks to evaluate Honcho: [LongMem](https://arxiv.org/abs/2410.10813), [LoCoMo](https://arxiv.org/abs/2402.17753), and [BEAM](https://arxiv.org/abs/2510.27246).
|
||||
|
||||
### **LongMem**
|
||||
|
||||
**LongMem S is a data set containing 500 "needle in a haystack" questions, each with about 550 messages distributed over 50 sessions, totaling ~115,000 tokens of context per question.**
|
||||
## LongMem
|
||||
LongMem S is a data set containing 500 "needle in a haystack" questions, each with about 550 messages distributed over 50 sessions, totaling ~115,000 tokens of context per question.
|
||||
|
||||
After ingesting this context, a single query is made and judged. The correct answer hinges on information divulged in one or a handful of messages: these are the "needles." Everything else is "hay." The questions come in six flavors:
|
||||
|
||||
@ -73,16 +68,11 @@ Full results:
|
||||
| Multi-Session | 113 | 133 | 85.0% |
|
||||
|
||||
Configuration: 16,384 tokens per message batch, dreaming OFF. [Full data](https://github.com/plastic-labs/honcho-benchmarks/tree/main/a1d689b).
|
||||
|
||||
#### LongMem M is the big brother to S. Each question has roughly 500 sessions, equivalent to over 1M tokens.
|
||||
|
||||
### LongMem M is the big brother to S. Each question has roughly 500 sessions, equivalent to over 1M tokens.
|
||||
We asked Honcho 98 questions from LongMem M, and scored 88.8% using the same configuration that we used for S. That's a less than 2% dropoff when injecting about a million extra tokens of "hay" into the source material: real evidence that Honcho is effectively expanding the ability of a model to reason over tokens beyond context window limits.
|
||||
|
||||
But just adding extra noise to a conversation history isn't really getting at what we think of when we use the word "memory." Eliminating irrelevant data mostly comes down to optimizing RAG strategy and designing good search tools for an agent. True memory involves processing everything, even the "irrelevant" data, and using it to form a mental model of the author. The retrieval questions in LongMem don't get more nuanced with more data, and Honcho can easily eliminate noise to find the answer while doing much more behind the scenes.
|
||||
|
||||
|
||||
#### A note on model selection
|
||||
|
||||
### A note on model selection
|
||||
LongMem has been fashionable over the past year as a benchmark for anyone releasing an agent memory system. It's important to remember that when the benchmark was first released, GPT-4o scored 60.6% on LongMem S without augmentation. It was a clear demonstration that token-space memory augmentation had a place even in the scale of 100,000 tokens or less, even before questions of cost-efficiency.
|
||||
|
||||
After over a year, this is no longer the case. Gemini 3 Pro can run LongMem S, easily fitting the per-question ~115k tokens into its context window, and score **92.0%**. By itself. This score is higher than any published LongMem score by a memory framework project, including two that *actually used* Gemini 3 Pro as their response-generating model for the eval. Their systems are *degrading* the latent capability of the model[^6]. Honcho with Gemini 3 Pro scores 92.6%. We're not impressed by that marginal improvement, though it's good to know we're not actively impeding the model. All these results reveal is that from here on out, memory frameworks cannot merely announce scores on low-token-count tests. There are two ways to prove a memory framework is useful:
|
||||
@ -92,10 +82,7 @@ After over a year, this is no longer the case. Gemini 3 Pro can run LongMem S, e
|
||||
2. Demonstrate *cost efficiency*: calculate the cost of ingesting a certain number of tokens with a top-tier model to produce a correct answer, then get the same answer by spending less money on input tokens using a memory tool.
|
||||
|
||||
Honcho passes both of these tests. Running LongMem S directly with Gemini 3 Pro costs about \$115 for input tokens alone (the relevant part for retrieval–output tokens don't really change). Honcho with the same model had a mean token efficiency of 16% -- bringing ingestion cost down to \$18.40. Adding the cost of running Honcho's ingestion system with Gemini 2.5 flash-lite, a model quite effective for the task, brings total cost up to \$47.15 -- a **60% cost reduction**. The Honcho managed service *does not charge for ingestion* -- we operate our own fine-tuned models for the task. For more discussion of cost efficiency, see section 3.
|
||||
|
||||
|
||||
### **LoCoMo**
|
||||
|
||||
## LoCoMo
|
||||
We stated regarding LongMem that it "does not test a memory system's ability to recall across truly large quantities of data": this is even more the case for LoCoMo. It takes a similar format to LongMem, but instead of 115,000 tokens per question, it provides a meager 16,000 tokens on average of context. Then, each of these 16k token conversations has a battery of 100 or more questions applied to them.
|
||||
|
||||
Given that models routinely offer a context window of 200,000 or more tokens nowadays, a 16,000 token conversation really isn't useful at all for evaluating a memory framework.
|
||||
@ -110,10 +97,7 @@ Even still, Honcho ekes out better performance on the test than a model acting a
|
||||
| Temporal | 74 | 96 | 77.1% |
|
||||
|
||||
Configuration: 128 tokens per message batch (1-5 messages per batch, in practice), dreaming ON. [Full data](https://github.com/plastic-labs/honcho-benchmarks/tree/main/a1d689b).
|
||||
|
||||
|
||||
### **BEAM**
|
||||
|
||||
## BEAM
|
||||
At this point, you might be asking yourself: can *any* so-called memory benchmarks really test a memory framework?
|
||||
|
||||
[BEAM](https://arxiv.org/abs/2510.27246), "BEyond A Million" Tokens, is your answer.
|
||||
@ -139,7 +123,6 @@ Notably, there's no dropoff in recall performance until 10 million tokens (thoug
|
||||
|
||||
---------
|
||||
|
||||
|
||||
Some patterns emerge across all benchmarks. Questions that simply require recall of an entity's preference or a biographical fact about them are easy: Honcho pretty much aces these, and baseline tests fare well too. Across single-session-user and single-session-assistant questions in LongMem, for example, we pass 95%. We score 0.95–nearly perfect–on BEAM 500K's preference-following section.
|
||||
|
||||
Questions that ask about temporal reasoning are trickier: 88.7% in LongMem, 77% in LoCoMo, 0.49 in BEAM 500K. Frustratingly, these are some of the most common types of questions that a user comes across when subjectively evaluating an agent's memory. While Honcho significantly improves an LLM's ability to deal with questions about time, this is a genuine weak point of all models available today. It's part of what leads many users to continually underestimate the intellect of various models. Many models, when asked, will refuse to believe the current date if told, instead insisting that their training cutoff defines the current moment. We will continue to research this flaw and apply best-in-class solutions.
|
||||
@ -147,9 +130,7 @@ Questions that ask about temporal reasoning are trickier: 88.7% in LongMem, 77%
|
||||
No benchmark is perfect. Across all three, we've noticed a scattering of questions that are either outright incorrect or trigger high variance in models. These are especially prevalent in temporal reasoning questions: if a user has a discussion with an assistant in 2025 about having first met their spouse in 2018, and having been together for five years, there's meaningful ambiguity about how long the user knew their spouse before dating. Ambiguity arises both in measurements of time (when in 2018 did they meet?) and semantics (did they start *dating* when they first met, and have been *married* for five years, or did they meet and then actually start dating two years later?). Each benchmark has dozens of questions with ambiguous answers, with at least a couple outright wrong answers. These are the perils of synthetic data.
|
||||
|
||||
We also find that the best answer for a benchmark does not always align with the best answer for an interactive tool. Like a multiple-choice test, benchmarks reward confidently guessing and moving on if the answer is unclear. In the real world, we would prefer Honcho to interact with the user or agent and prompt them to clarify what they meant, and we've stuck to this behavior even in the configurations of Honcho that we run benchmarks on.
|
||||
|
||||
|
||||
## 3. Benchmarking cost efficiency
|
||||
# 3. Benchmarking cost efficiency
|
||||
|
||||
Honcho demonstrates excellent cost efficiency and can be used to significantly reduce the cost of using expensive LLMs in production applications.
|
||||
|
||||
|
||||
@ -11,19 +11,19 @@ description: Meet Neuromancer XR--our custom reasoning model that achieves state
|
||||
---
|
||||
![[opengraph_neuromancer.png]]
|
||||
# TL;DR
|
||||
*Memory is a foundational pillar of social cognition. As a key component of [Honcho](https://honcho.dev), we approach it as a combined reasoning and retrieval problem. In this post, we introduce Neuromancer XR, the first in a series of custom reasoning models that works by extracting and scaffolding atomic conclusions from user messages across two strictly defined levels of logical certainty: explicit and deductive. It's the result of fine-tuning Qwen3-8B on a manually curated dataset mapping conversation turns to atomic conclusions. Using Neuromancer XR as the reasoning engine behind our core product Honcho leads to 86.9% accuracy on the [LoCoMo](https://snap-research.github.io/locomo/) benchmark, compared to 69.6% using the base Qwen3-8B model, and 80.0% when using Claude 4 Sonnet as baseline, to achieve state of the art results. The next model in the series, Neuromancer MR will extract and scaffold observations at two further levels along the spectrum of certainty: inductive and abductive. This will allow us to front-load most of the inference needed to improve LLMs' social cognition skills, powering AI-native products that truly understand any peer in a system, be it a user or an agent.*
|
||||
*Memory is a foundational pillar of social cognition. As a key component of [Honcho](https://honcho.dev), we approach it as a combined reasoning and retrieval problem. In this post, we introduce Neuromancer XR, the first in a series of custom reasoning models that works by extracting and scaffolding atomic conclusions from user messages across two strictly defined levels of logical certainty: explicit and deductive. It's the result of fine-tuning Qwen3-8B on a manually curated dataset mapping conversation turns to atomic conclusions. Using Neuromancer XR as the reasoning engine behind our core product Honcho leads to 86.9% accuracy on the [LoCoMo](https://snap-research.github.io/locomo/) benchmark, compared to 69.6% using the base Qwen3-8B model, and 80.0% when using Claude 4 Sonnet as baseline, to achieve state of the art results. The next model in the series, Neuromancer MR will extract and scaffold conclusions at two further levels along the spectrum of certainty: inductive and abductive. This will allow us to front-load most of the inference needed to improve LLMs' social cognitive skills, powering AI-native products that truly understand any peer in a system, be it a user or an agent.*
|
||||
# Table Stakes
|
||||
At Plastic, we want to enable builders to create AI applications and agents with exceptional social intelligence: tools that are able to understand who you are and what you mean, whether it's an AI tutor that adapts to your learning style or a multi-agent system that anticipates your needs. These applications all require something fundamental that's only recently begun to draw attention: memory.
|
||||
|
||||
Most approaches treat memory as an end product or top-level [[Memory as Reasoning#Memory is ~~Storage~~ Prediction|feature]], enabling information to persist across chatbot sessions, but we consider it the foundation of something much bigger: the ability for LLMs to build mental models of their users and one another and draw from those representations in real time. This capability is essential for personalization, engagement, and retention. Not to mention multi-agent systems, individual alignment, and the trust required for agentic behavior. It's the difference between an AI that merely responds to queries and one that genuinely understands and adapts to the person it's talking to; the difference between out-of-the-box experiences and ones cohered to a user’s personal identity
|
||||
|
||||
To do anything approaching the social cognition required, Honcho must be state-of-the-art in memory: able to recall observations about users across conversations with superhuman fidelity. Today, we're sharing our approach and early results from training a specialized model that treats [[Memory as Reasoning|memory as a reasoning task]] rather than simple static storage.
|
||||
To do anything approaching the social cognition required, Honcho must be state-of-the-art in memory: able to recall conclusions about users across conversations with superhuman fidelity. Today, we're sharing our approach and early results from training a specialized model that treats [[Memory as Reasoning|memory as a reasoning task]] rather than simple static storage.
|
||||
# Memory as Reasoning
|
||||
Reasoning models continue to surge in capability and popularity. And with them, our approach to memory. Why not design it as a reasoning task concerned with deliberating over the optimal context to synthesize and remember? We turned to formal logic to develop four methods of reasoning, along a spectrum of certainty, toward conclusions to derive from conversational data:
|
||||
|
||||
- **Explicit**: Information directly stated by a participant.
|
||||
- **Deductive**: Certain conclusions that necessarily follow from explicit information.
|
||||
- **Inductive**: Connective patterns and generalizations that are likely to be true based on multiple observations.
|
||||
- **Inductive**: Connective patterns and generalizations that are likely to be true based on multiple conclusions.
|
||||
- **Abductive**: Probable explanations for observed behaviors that are reasonable hypotheses given available information, but not guaranteed to be true.
|
||||
|
||||
> [!example]+ Example Conversations and Conclusions
|
||||
@ -85,7 +85,7 @@ Reasoning models continue to surge in capability and popularity. And with them,
|
||||
> > > - Erin likely values education and intellectual engagement (participates in book club, attends parent-teacher conferences, reads while exercising)
|
||||
> > > - Erin probably has a growth mindset (transformed health concern into athletic goal, combines activities like reading while running)
|
||||
|
||||
Having clear definitions for these four types of reasoning and their corresponding levels of certainty also allows us to establish how different kinds of observations relate to one another. Specifically, we require observations to scaffold only on top of observations with higher certainty: an abduction (e.g. "Erin values her health proactively") can use a deduction (e.g. "Erin exercises regularly") or induction (e.g. "Erin prioritizes healthy eating during weekdays") as one of its premises, but not the other way around. That is, one can speculate given a certain conclusion, but one cannot attempt to conclude something logically from prediction. Implied in this is that the model must show its work. A conclusion must include its premises, its evidence and support.
|
||||
Having clear definitions for these four types of reasoning and their corresponding levels of certainty also allows us to establish how different kinds of conclusions relate to one another. Specifically, we require conclusions to scaffold only on top of conclusions with higher certainty: an abduction (e.g. "Erin values her health proactively") can use a deduction (e.g. "Erin exercises regularly") or induction (e.g. "Erin prioritizes healthy eating during weekdays") as one of its premises, but not the other way around. That is, one can speculate given a certain conclusion, but one cannot attempt to conclude something logically from prediction. Implied in this is that the model must show its work. A conclusion must include its premises, its evidence and support.
|
||||
# Neuromancer XR: Training a Logical Reasoning Specialist for Memory
|
||||
To implement this vision, we need a model that can reliably extract and categorize conclusions from conversations. Our initial focus for the memory task, given its focus on factual recall, is on the first two certainty levels: explicit and deductive knowledge--that is, conclusions we know to be true given what users (or agents) state in their messages.
|
||||
|
||||
@ -108,7 +108,7 @@ LoCoMo consists of samples, each involving multiple conversations between two sp
|
||||
|
||||
The evaluation consisted of the following steps:
|
||||
- **Ingestion**: for each session in the LoCoMo dataset, we create [[Beyond the User-Assistant Paradigm; Introducing Peers|peers]] in Honcho. For each conversation in the session, we store each message in the conversation, linking it to its respective peer.
|
||||
- As part of the ingestion, the evaluated model is used for **conclusion derivation**, producing a series of explicit and deductive observations that are stored in Honcho's peer-specific storage.
|
||||
- As part of the ingestion, the evaluated model is used for **conclusion derivation**, producing a series of explicit and deductive conclusions that are stored in Honcho's peer-specific storage.
|
||||
- **Evaluation**: the questions for each question in the LoCoMo dataset are run through Honcho's dialectic endpoint. Honcho's answers are compared to the ground truth answers in the LoCoMo dataset using an LLM-as-judge that outputs a binary 1/0 correctness score, using a prompt available in Appendix A. We measure mean accuracy (percentage of correctly answered questions) across question categories, as well as overall (across the entire dataset).
|
||||
The independent variable in our experiment is the model used in the observation extraction step: Qwen3-8B, Claude 4 Sonnet, and Neuromancer XR. The dependent variable is the mean accuracy in answering the questions. Regardless of what model we're evaluating for conclusion derivation, in order to isolate the effect of the conclusion derivation model, the model used for the final question-answering inference is always Claude 4 Sonnet, which is the model we use in production for this generation step.
|
||||
|
||||
@ -130,9 +130,9 @@ The results also show that fine-tuning Qwen3-8B on our dataset of curated conclu
|
||||
|
||||
After inspecting the ingestion and evaluation traces, we can see that the base Qwen3-8B model exhibits several failure modes that are not present in Neuromancer XR after the fine-tune. These include:
|
||||
- Outputting multiple atomic facts in a single explicit conclusion, e.g. "Joanna provides care for her dog. - Joanna has a dog. - Joanna has a dog bed" in a single conclusion.
|
||||
- Generating observations that lack enough knowledge to be self-contained, e.g. "Joanna is responding to Nate's comment about the turtles".
|
||||
- Generating conclusions that lack enough knowledge to be self-contained, e.g. "Joanna is responding to Nate's comment about the turtles".
|
||||
- Not respecting the provided definition of "deductive" by going beyond what can be certainly concluded based on explicitly stated information, and veering into speculation, e.g. "Joanna is likely seeking reassurance or validation about the feasibility of pet ownership”.
|
||||
- Occasionally generating verbose observations in excess of 500 characters and that span various different concepts.
|
||||
- Occasionally generating verbose conclusions in excess of 500 characters and that span various different concepts.
|
||||
|
||||
This can lead to poor embedding quality, making retrieval more difficult, or add noise at generation time. We hypothesize that all of the failure modes described above would lead to considerably high loss during the fine-tuning process when provided with training examples that were curated to be under a specific length, follow a specific syntax, and avoid specific words that suggest speculation, making them somewhat easy to address via fine-tuning.
|
||||
|
||||
@ -147,8 +147,8 @@ One of the advantages of this memory framework is that it allows us to front-loa
|
||||
|
||||
Most other LLM frameworks store atomic, low-level "facts" about users and include them as context at generation time. This, in theory, and with enough carefully prompted inference-time compute, would allow a good enough model to develop abstract theories about the user's mental state as it tries to answer a query about the user. However, it would have to happen implicitly in the model's thought process, which in turn means that the theories about the user's mental state are ephemeral, opaque and unpredictable. Approaches such as this therefore are inconsistent and inefficient, and would further struggle to meet the challenges of true social cognition.
|
||||
|
||||
Our approach, on the other hand, shifts most of the load of reasoning about the peer from generation time to the earlier stages of the process, when messages are processed and ingested. By the time observations are retrieved for generation, low-level messages have already been distilled and scaffolded into a hierarchical, certainty-labeled, and easy to navigate tree containing a high-fidelity user representation.
|
||||
## Beyond recall: toward social intelligence
|
||||
Our approach, on the other hand, shifts most of the load of reasoning about the peer from generation time to the earlier stages of the process, when messages are processed and ingested. By the time conclusions are retrieved for generation, low-level messages have already been distilled and scaffolded into a hierarchical, certainty-labeled, and easy to navigate tree containing a high-fidelity user representation.
|
||||
## Beyond recall: toward continual learning
|
||||
Evaluations and benchmarks are essential tools on our path to develop better frameworks for the development of AI-native tools. However, they don't tell the whole story: no evaluation is perfect, and hill-climbing can easily mislead us into optimizing for higher scores rather than the true north star: the overall quality of our product. For us, that means treating memory not as a hill to die on, but as table-stakes in our pursuit of social cognition that can truly transform the way AI-native tools understand us. Although success at this broader goal is much harder to quantify in conventional benchmarks, given the complex and under-specified nature of social cognition, we will continue to implement the evaluations that we find the most helpful for our agile development process.
|
||||
|
||||
In that spirit, we have our sights set on the remaining two levels of certainty we introduced at the beginning of this blog post: inductive and abductive. In our manual, preliminary testing, including all four levels of reasoning resulted in incredibly rich user representations being extracted from even the simplest interactions. What lies ahead of us is the exciting task of harnessing these representations and delivering them via Honcho in the fastest, most flexible and most agentic way.
|
||||
|
||||
Loading…
Reference in New Issue
Block a user