mirror of
https://github.com/jackyzha0/quartz.git
synced 2025-12-19 10:54:06 -06:00
added descriptions to all posts for SEO, updated tags, added authors to all, fixed legacy header & tl;dr formatting across the board, & lots more
This commit is contained in:
parent
0da8de6308
commit
f1aa5ec235
@ -4,12 +4,14 @@ date: 05.09.2024
|
||||
tags:
|
||||
- blog
|
||||
- dev
|
||||
- archive
|
||||
author: Vineeth Voruganti
|
||||
description: A deep dive into SDK design patterns, comparing object-oriented vs singleton approaches & evaluating code generation platforms for API client libraries.
|
||||
---
|
||||
> [!custom] WELCOME TO THE PLASTIC [[archive|ARCHIVE]]
|
||||
> This blog post has been archived because it's legacy content that's out-of-date or deprecated. We keep this content around so those interested can dig into the evolution of our projects & thinking.
|
||||
>
|
||||
> This post contains Vineeth's (Plastic Co-founder/CTO) notes on the early design of Honcho's SDKs. For the most up to date SDK reference, check out the [Honcho Docs](https://docs.honcho.dev).
|
||||
> This post contains Vineeth's (Plastic's Co-founder & CTO) notes on the early design of Honcho's SDKs. For the most up-to-date SDK reference, check out the [Honcho Docs](https://docs.honcho.dev).
|
||||
>
|
||||
> Enjoy.
|
||||
|
||||
@ -18,7 +20,7 @@ author: Vineeth Voruganti
|
||||
After several months of managing the SDKs for Honcho manually, we decided to
|
||||
take a look at the options available for automatically generating SDKs.
|
||||
|
||||
From our research we picked a platform and have made brand new SDKs for Honcho
|
||||
From our research we picked a platform and have made brand-new SDKs for Honcho
|
||||
that use idiomatic code, are well documented, and let us support more languages.
|
||||
# Introduction
|
||||
For the past few months I have been working on managing the [Honcho](https://honcho.dev) project and its associated SDKs. We've been taking the approach of developing the SDK manually as we are focused on trying to find the best developer UX and maximize developer delight.
|
||||
@ -165,8 +167,7 @@ the end.
|
||||
|
||||
At the time of this research there was no follow-up post.
|
||||
|
||||
[Ask HN: Best practices (and examples) for designing client libraries for
|
||||
APIs?](https://news.ycombinator.com/item?id=23283551)
|
||||
[Ask HN: Best practices (and examples) for designing client libraries for APIs?](https://news.ycombinator.com/item?id=23283551)
|
||||
|
||||
The first comment actually advocates for an object-oriented model but just using
|
||||
the top level client object for authentication and setup stuff.
|
||||
@ -290,8 +291,7 @@ Some key insights
|
||||
- Have modular design patterns that make it easy to extend and pick and choose
|
||||
features.
|
||||
|
||||
[Should I implement OOP in a REST
|
||||
API?](https://www.reddit.com/r/flask/comments/1755ob0/should_i_implement_oop_in_a_rest_api/)
|
||||
[Should I implement OOP in a REST API?](https://www.reddit.com/r/flask/comments/1755ob0/should_i_implement_oop_in_a_rest_api/)
|
||||
|
||||
Most people seem to be saying a full OOP method is overkill, but there are
|
||||
people advocating for having a controller class with methods that take data
|
||||
@ -333,10 +333,7 @@ the two.
|
||||
|
||||
Again and again, the best way to approach SDK development is to just do whatever
|
||||
is easier, and create tons of documentation that will help developers navigate
|
||||
your [API Ladder](https://blog.sbensu.com/posts/apis-as-ladders/). Someone will
|
||||
get confused regardless of what you do, so the key is to make sure the SDK makes
|
||||
sense (even if it's not the most efficient or clean) and remove hurdles for
|
||||
users to navigate errors and mistakes.
|
||||
your [API Ladder](https://blog.sbensu.com/posts/apis-as-ladders/). Someone will get confused regardless of what you do, so the key is to make sure the SDK makes sense (even if it's not the most efficient or clean) and remove hurdles for users to navigate errors and mistakes.
|
||||
# SDK Generation Platforms
|
||||
With a sense of the best standards for SDK design and additional features that
|
||||
should be supported in the SDK I want to look at a few different options to
|
||||
|
||||
@ -4,6 +4,9 @@ date: 04.16.24
|
||||
tags:
|
||||
- blog
|
||||
- honcho
|
||||
- archive
|
||||
author: Courtland Leer
|
||||
description: A beginner-friendly guide to Honcho, the AI personalization platform that helps LLM applications get to know users via storage, insights, & retrieval.
|
||||
---
|
||||
> [!custom] WELCOME TO THE PLASTIC [[archive|ARCHIVE]]
|
||||
> This post has been archived because it's legacy content that
|
||||
@ -13,10 +16,8 @@ tags:
|
||||
|
||||
> [!NOTE] Welcome to our quick, "explain it like I'm 5" guide to [Honcho](https://honcho.dev)!
|
||||
> We'll keep it simple, covering [[ARCHIVED; A Simple Honcho Primer#^ef795f|what Honcho is]], [[ARCHIVED; A Simple Honcho Primer#^x125da|why we built it]], [[ARCHIVED; A Simple Honcho Primer#^cd2d3c|how to use it]], and [[ARCHIVED; A Simple Honcho Primer#^ca46d7|where the product is going]]. But throughout, we'll link to places you can dive deeper.
|
||||
|
||||
## What Is Honcho?
|
||||
# What Is Honcho?
|
||||
^ef795f
|
||||
|
||||
Honcho is a personalization platform for large language model (LLM) applications built by [Plastic Labs](https://plasticlabs.ai).
|
||||
|
||||
It's software infrastructure that lets AI apps "get to know" their users, resulting in delightful experiences and optimized time to value.
|
||||
@ -36,10 +37,8 @@ If you've heard of [Retrieval Augmented Generation](https://en.wikipedia.org/wik
|
||||
Behind the scenes, Honcho learns about users as people--[[ARCHIVED; User State is State of the Art|richly modeling identity]]. It seeks to understand their beliefs, hopes, dreams, history, interests, and preferences.
|
||||
|
||||
It then acts as [[ARCHIVED; Introducing Honcho's Dialectic API|an oracle to each user]], allowing apps to ask for any personal context they need to improve UX and giving them access to a social cognition layer.
|
||||
|
||||
## Why We Built Honcho
|
||||
# Why We Built Honcho
|
||||
^x125da
|
||||
|
||||
Plastic Labs was founded as an edtech company. The original mission was to build an AI tutor that [[ARCHIVED; Open Sourcing Tutor-GPT#^x527dc|could reason like]] the best human instructors. We quickly found the key limitation was data not on the subject matter, but on the student. To overcome it, the tutor needed [[ARCHIVED; Theory of Mind Is All You Need|a way to]] get to know *each* of its students deeply.
|
||||
|
||||
Honcho was born by running up against this challenge, building technology to solve it, and realizing all AI applications are going to need the same solutions. The promise of *generative* AI isn't one-size-fits-all products, but bespoke experiences in each moment for each user. The same limitation emerges--how well do you know your user?
|
||||
@ -57,11 +56,9 @@ But it's not intuitive for a few reasons:
|
||||
|
||||
Still, when interacting with an AI app, there's a sense that it *should* be getting to know us. In fact, we're often surprised when we realize it's not learning about us over time. And probably annoyed at having to start over.
|
||||
|
||||
Think about personalization here as more like the experience of close human companionship or white glove services than the attention hacking mechanisms of TikTok. There's [[ARCHIVED; Announcing Honcho's Private Beta#^xb6ef1|enormous potenial]] for more positive-sum use of user data and for aligning AI applications more closely with user needs and preferences[^2].
|
||||
|
||||
## How to Use Honcho
|
||||
Think about personalization here as more like the experience of close human companionship or white-glove services than the attention-hacking mechanisms of TikTok. There's [[ARCHIVED; Announcing Honcho's Private Beta#^xb6ef1|enormous potential]] for more positive-sum use of user data and for aligning AI applications more closely with user needs and preferences[^2].
|
||||
# How to Use Honcho
|
||||
^cd2d3c
|
||||
|
||||
Honcho is first and foremost a **storage** framework. Think of it like an open source version of the OpenAI Assistants API. User `sessions` store both user and AI generated `messages` as well as any intermediate inferences you might want to store as `metamessages`:
|
||||
|
||||
```python
|
||||
@ -87,10 +84,8 @@ session.chat("What are the user's interests?")
|
||||
```
|
||||
|
||||
There are a [[ARCHIVED; Introducing Honcho's Dialectic API#How It Works|ton of ways]] to use Honcho, this primer only scratches the surface[^3].
|
||||
|
||||
## What's Next for Honcho?
|
||||
# What's Next for Honcho?
|
||||
^ca46d7
|
||||
|
||||
Beyond improving our internal AI models so they can get to know users as richly as possible, we see three natural extensions in [[ARCHIVED; Announcing Honcho's Private Beta#^eb15f3|Honcho's future]]:
|
||||
|
||||
1. [[ARCHIVED; Announcing Honcho's Private Beta#^x2dd3b|Monitoring & Evaluation]] - developer tools to understand & assess the impact of personalization + machine learning tools to build personalized datasets
|
||||
@ -98,9 +93,7 @@ Beyond improving our internal AI models so they can get to know users as richly
|
||||
3. [[ARCHIVED; Announcing Honcho's Private Beta#^ebf071|Honcho Application Ecosystem]] - a network of apps contributing to & sharing Honcho data, user-owned & stored in confidential environments
|
||||
|
||||
And in just a few weeks, we'll be launching a demo platform where anyone can interact with (& eventually build) Honcho powered apps.
|
||||
|
||||
## Join the Beta
|
||||
|
||||
# Join the Beta
|
||||
[Sign-up for the private beta](https://plasticlabs.typeform.com/honchobeta) and start building personalized experiences.
|
||||
|
||||
[Join Discord](https://discord.gg/plasticlabs), introduce yourself, and tell us what you're working on.
|
||||
|
||||
@ -6,16 +6,16 @@ tags:
|
||||
- dev
|
||||
- ml
|
||||
- blog
|
||||
- archive
|
||||
author: Courtland Leer
|
||||
description: Introducing Honcho's private beta--a hosted platform for agent personalization with user-centric storage, theory of mind inference, & our Dialectic API
|
||||
---
|
||||
![[honcho_thumb_blog_white.png]]
|
||||
## TL;DR
|
||||
|
||||
Today we're announcing the launch of [Honcho's](https://honcho.dev) private beta. [Sign-up for the waitlist here](https://plasticlabs.typeform.com/honchobeta).
|
||||
|
||||
This is a hosted version of our agent personalization platform. It integrates user data storage and theory of mind inference accessible via [[ARCHIVED; Introducing Honcho's Dialectic API|our Dialectic API]]. You can now inject per-user social cognition anywhere in your AI app's architecture.
|
||||
|
||||
## The Problem
|
||||
# TL;DR
|
||||
*Today we're announcing the launch of [Honcho's](https://honcho.dev) private beta. [Sign-up for the waitlist here](https://plasticlabs.typeform.com/honchobeta).*
|
||||
|
||||
*This is a hosted version of our agent personalization platform. It integrates user data storage and theory of mind inference accessible via [[ARCHIVED; Introducing Honcho's Dialectic API|our Dialectic API]]. You can now inject per-user social cognition anywhere in your AI app's architecture.*
|
||||
# The Problem
|
||||
Most AI apps are still just demos.
|
||||
|
||||
We're seeing new capabilities every day, but great product experiences are few and far between. It's hard to go from knocking down a benchmark or prototyping task completion to a sticky production grade app.
|
||||
@ -33,34 +33,25 @@ Users don't want to learn confusing prompt engineering, redundantly establish st
|
||||
But we're finding consistently that the work we offload to AI apps comes back mediocre at best. What's missing? It's not just about [[Machine learning is fixated on task performance|doing the thing generally]], it's doing the thing just like *I* would do it, given the inclination or expertise.
|
||||
|
||||
To earn the trust to act autonomously, to graduate from toys to life changing tools, agents need access to dynamic user models and social cognition.
|
||||
|
||||
## The Solution
|
||||
|
||||
# The Solution
|
||||
Why use Honcho to start modeling users and incorporate social cognition?
|
||||
|
||||
You need to discover your users' unmet needs so you know how your product should evolve.
|
||||
|
||||
### Features
|
||||
|
||||
## Features
|
||||
Here's what the private beta currently includes, and what's on the way:
|
||||
|
||||
#### User-Centric Storage
|
||||
### User-Centric Storage
|
||||
^x15f37
|
||||
|
||||
Honcho allows you to [store](https://docs.honcho.dev/getting-started/architecture) `users`, `messages`, `sessions`, & `metamessages`. That is, you can effortlessly record each user interaction with you application, organized on a per-user basis, and the product of any intermediate steps in between user message and application response.
|
||||
|
||||
It also supports `documents` and `collections`. The former to store discrete user embeddings and the latter to organize them globally across sessions. These primitives are used by Honcho's personalization engine to begin modeling user identity based on each interaction. They can also be used to "bring you own" user data or context to be computed over and utilized by Honcho.
|
||||
|
||||
#### Personalization Engine
|
||||
### Personalization Engine
|
||||
^x53717
|
||||
|
||||
Here's where the magic happens. Honcho leverages everything in storage to run theory of mind inference and automatically learn about each user.
|
||||
|
||||
The personalization engine both pulls out user desires, history, beliefs, emotions, etc from the data and surfaces it on demand. You can use it to answer queries, run prediction, build training sets, hydrate prompts, or cache for later. Deterministically inject specific types of context or let your LLM dynamically decide what's most useful in each moment.
|
||||
|
||||
Honcho is always updating user identity, so it's ready when you need it.
|
||||
|
||||
##### Dialectic API
|
||||
### Dialectic API
|
||||
^ee4516
|
||||
|
||||
Our [[ARCHIVED; Introducing Honcho's Dialectic API|Dialectic API]] is how your app-side LLM interfaces with the Honcho-side agent sitting on top of each user identity. This is done in natural language. It's an AI-native endpoint for direct LLM-to-LLM communication.
|
||||
@ -68,10 +59,8 @@ Our [[ARCHIVED; Introducing Honcho's Dialectic API|Dialectic API]] is how your a
|
||||
It allows you to inject personal context and social cognition directly into your app's cognitive architecture wherever you need it, sync or async. Agent-to-agent chat over each user.
|
||||
|
||||
[[ARCHIVED; Introducing Honcho's Dialectic API#^57acc3|Here's an extended list of possible ways to use it]].
|
||||
|
||||
#### User-Specific Monitoring (coming soon...)
|
||||
### User-Specific Monitoring (coming soon...)
|
||||
^x2dd3b
|
||||
|
||||
Soon, Honcho will support a suite of tools to get the most out of our personalization platform.
|
||||
|
||||
- **Visualization tools** - it's hard to grok and track everything going on within a session, we're building clean ways to visualize this an its relationship to all the background inference
|
||||
@ -81,11 +70,8 @@ Soon, Honcho will support a suite of tools to get the most out of our personaliz
|
||||
- **Evaluation & Benchmarking** - the state of theory of mind research is highly compelling, but [[Achieving SOTA on OpenToM with DSPy#^0b4f2e|we need practical, app & user specific evals]]
|
||||
|
||||
- **Training Set Curation** - building datasets with personal context [[ARCHIVED; Introducing Honcho's Dialectic API#^f19646|allows more robust, domain-specific training]], we're building tools for anyone to easily construct then train on
|
||||
|
||||
### The Future of Honcho
|
||||
|
||||
## The Future of Honcho
|
||||
^eb15f3
|
||||
|
||||
At [Plastic Labs](https://plasticlabs.ai), we're dedicated to radically extending human agency and identity. That means giving AI superpowers to every individual.
|
||||
|
||||
This only works in a world with a rich ecosystem of personalized agents--individually-aligned, highly distributed, and universally accessible.
|
||||
@ -103,10 +89,8 @@ All that guides a roadmap including, but not limited to:
|
||||
- **User owned data & confidential computing environments** - re-centralizing personal data around the person, then allowing approved applications to *compute-to* that data in a privacy preserving way
|
||||
|
||||
- **User-facing controls** - empower users to curate their Honcho identities, authenticate with Honcho, and define sensitive data sharing policies in natural language ^a84f44
|
||||
|
||||
### Who Is This For?
|
||||
## Who Is This For?
|
||||
^xb6ef1
|
||||
|
||||
We want to build with diverse projects at all stages of development--from ideation to production.
|
||||
|
||||
We've already begun working with assistant, browsing, ecommerce, education, health, and productivity projects. Many more already on the waitlist are building in co-pilots, crypto, entertainment, finance, gaming, matchmaking, PKM, real estate, social media, & more.
|
||||
@ -114,9 +98,7 @@ We've already begun working with assistant, browsing, ecommerce, education, heal
|
||||
Which AI applications could benefit from knowing the users better, predicting their unmet needs, and personalizing UX? We think the latent list is vast.
|
||||
|
||||
Any app producing generative experiences for users has a lot to gain from Honcho. If you're looking to out-compete foundation models, build unique training sets, solve user context storage, or--more importantly--produce delightful experiences, hit us up.
|
||||
|
||||
## Join the Beta
|
||||
|
||||
# Join the Beta
|
||||
[Sign-up for the private beta](https://plasticlabs.typeform.com/honchobeta) and start building personalized agent experiences.
|
||||
|
||||
[Join Discord](https://discord.gg/plasticlabs), introduce yourself, and tell us what you're working on.
|
||||
|
||||
@ -8,15 +8,15 @@ tags:
|
||||
- philosophy
|
||||
- ml
|
||||
- announcements
|
||||
- archive
|
||||
author: Courtland Leer & Vince Trost
|
||||
description: Introducing Honcho, an open-source user context management framework for LLM applications that enables personalized, user-first AI experiences at scale.
|
||||
---
|
||||
![[missing_piece.png]]
|
||||
*The missing piece of the stack*
|
||||
## TL;DR
|
||||
|
||||
Today we drop the first release of a project called [*Honcho*](https://github.com/plastic-labs/honcho/tree/main), an open-source version of the OpenAI Assistants API. Honcho manages your AI app data on a per-user basis, allowing for multiple concurrent sessions. Glaringly absent from the existing stack, Honcho will, at full maturity, usher the advent of atomic, disposable agents that are user-first by default.
|
||||
|
||||
## Plastic Lore
|
||||
|
||||
# TL;DR
|
||||
*Today we drop the first release of a project called [Honcho](https://github.com/plastic-labs/honcho/tree/main), an open-source version of the OpenAI Assistants API. Honcho manages your AI app data on a per-user basis, allowing for multiple concurrent sessions. Glaringly absent from the existing stack, Honcho will, at full maturity, usher the advent of atomic, disposable agents that are user-first by default.*
|
||||
# Plastic Lore
|
||||
[Plastic Labs](https://plasticlabs.ai) was conceived as a research group exploring the intersection of education and emerging technology. Our first cycle focused on how the incentive mechanisms and data availability made possible by distributed ledgers might be harnessed to improve learning outcomes. But with the advent of ChatGPT and a chorus of armchair educators proclaiming tutoring solved by the first nascent consumer generative AI, we shifted our focus to large language models. ^09f185
|
||||
|
||||
As a team with with backgrounds in both machine learning and education, we found the prevailing narratives overestimating short-term capabilities and under-imagining longterm potential. Fundamentally, LLMs were and still are 1-to-many instructors. Yes, they herald the beginning of a revolution in personal access not to be discounted, but every student is still ultimately getting the same experience. And homogenized educational paradigms are by definition under-performant on an individual level. If we stop here, we're selling ourselves short.
|
||||
@ -24,7 +24,7 @@ As a team with with backgrounds in both machine learning and education, we found
|
||||
![[zombie_tutor_prompt.jpg]]
|
||||
*A well intentioned but monstrously deterministic [tutor prompt](https://www.oneusefulthing.org/p/assigning-ai-seven-ways-of-using).* ^dfae31
|
||||
|
||||
Most edtech projects we saw emerging actually made foundation models worse by adding gratuitous lobotomization and coercing deterministic behavior. The former stemmed from the typical misalignments plaguing edtech, like the separation of user and payer. The latter seemed to originate with deep misunderstandings around what LLMs are and continues to translate to a huge missed opportunities.
|
||||
Most EdTech projects we saw emerging actually made foundation models worse by adding gratuitous lobotomization and coercing deterministic behavior. The former stemmed from the typical misalignments plaguing EdTech, like the separation of user and payer. The latter seemed to originate with deep misunderstandings around what LLMs are and continues to translate to a huge missed opportunities.
|
||||
|
||||
So we set out to build a non-skeuomorphic, AI-native tutor that put users first. The same indeterminism so often viewed as LLMs' greatest liability is in fact their greatest strength. Really, it's what they _are_. When great teachers deliver effective personalized instruction, they don't consult some M.Ed flowchart, they leverage the internal personal context they have on the student and reason (consciously or basally) about the best pedagogical intervention. LLMs are the beginning of this kind of high-touch learning companion being _synthetically_ possible.
|
||||
|
||||
@ -32,9 +32,7 @@ So we set out to build a non-skeuomorphic, AI-native tutor that put users first.
|
||||
*We're not so different after all ([@anthrupad](https://twitter.com/anthrupad)).*
|
||||
|
||||
Our [[ARCHIVED; Open Sourcing Tutor-GPT|experimental tutor]], Bloom, [[ARCHIVED; Theory of Mind Is All You Need|was remarkably effective]]--for thousands of users during the 9 months we hosted it for free--precisely because we built [cognitive architectures](https://blog.langchain.dev/openais-bet-on-a-cognitive-architecture/) that mimic the theory-of-mind expertise of highly efficacious 1:1 instructors.
|
||||
|
||||
## Context Failure Mode
|
||||
|
||||
# Context Failure Mode
|
||||
But we quickly ran up against a hard limitation. The failure mode we believe all vertical specific AI applications will eventually hit if they want to be sticky, paradigmatically different than their deterministic counterparts, and realize the latent potential. That's context, specifically user context--Bloom didn't know enough about each student.
|
||||
|
||||
We're consistently blown away by how many people don't realize large language models themselves are stateless. They don't remember shit about you. They're just translating context they're given into probable sequences of tokens. LLMs are like horoscope writers, good at crafting general statements that *feel* very personal. You would be too, if you'd ingested and compressed that much of the written human corpus.
|
||||
@ -53,9 +51,7 @@ The real magic of 1:1 instruction isn't subject matter expertise. Bloom and the
|
||||
Large language models can be good at this too. With similar compression and generation abilities, they're uniquely suited (among existing technology) to get to know you. We really can have shared culture and relationships with LLMs, absent (if we like) any cringy anthropomorphism.
|
||||
|
||||
Bloom needed a mechanism to harvest and utilize more context about the student. So we built it one.
|
||||
|
||||
## Research Solutions
|
||||
|
||||
# Research Solutions
|
||||
Prediction algorithms have become phenomenal at hacking attention using tabular engagement and activity data. But if we're thinking LLM-natively, a few questions emerge:
|
||||
|
||||
1. How are LLMs uniquely positioned to understand users?
|
||||
@ -75,21 +71,16 @@ Late last year we published a [research pre-print on this topic](https://arxiv.o
|
||||
*A [predictive coding inspired metacognitive architecture](https://youtu.be/PbuzqCdY0hg?feature=shared), from our research.*
|
||||
|
||||
We added it to Bloom and found the missing piece to overcoming the failure mode of user context. Our tutor could now learn about the student and use that knowledge effectively to produce better learning outcomes.
|
||||
|
||||
## Blast Horizon
|
||||
|
||||
Building and maintaining a production-grade AI app for learning catapulted us to this missing part of the stack. Lots of users, all growing in unique ways, all needing personalized attention that evolved over multiple longform sessions, forced us to confront the user context management problem with all it's thorny intricacy and potential.
|
||||
# Blast Horizon
|
||||
Building and maintaining a production-grade AI app for learning catapulted us to this missing part of the stack. Lots of users, all growing in unique ways, all needing personalized attention that evolved over multiple long-form sessions, forced us to confront the user context management problem with all it's thorny intricacy and potential.
|
||||
|
||||
And we're hearing constantly from builders of other vertical specific AI apps that personalization is the key blocker. In order for projects to graduate form toys to tools, they need to create new kinds of magic for their users. Mountains of mostly static software exists to help accomplish an unfathomable range of tasks and lots of it can be personalized using traditional (albeit laborious for the user) methods. But LLMs can observe, reason, then generate the software _and the user context_, all abstracted away behind the scenes.
|
||||
|
||||
Imagine online stores generated just in time for the home improvement project you're working on; generative games with rich multimodality unfolding to fit your mood on the fly; travel agents that know itinerary needs specific to your family, without being explicitly told; copilots that think and write and code not just like you, _but as you_; disposable, atomic agents with full personal context that replace your professional services--_you_ with a law, medical, accounting degree.
|
||||
|
||||
This is the kind of future we can build when we put users at the center of our agent and LLM app production.
|
||||
|
||||
## Introducing Honcho
|
||||
|
||||
# Introducing Honcho
|
||||
^a9d0f8
|
||||
|
||||
So today we're releasing the first iteration of [[Honcho name lore|Honcho]], our project to re-define LLM application development through user context management. At this nascent stage, you can think of it as an open-source version of the OpenAI Assistants API. ^8c982b
|
||||
|
||||
Honcho is a REST API that defines a storage schema to seamlessly manage your application's data on a per-user basis. It ships with a Python SDK which [you can read more about how to use here](https://github.com/plastic-labs/honcho/blob/main/README.md).
|
||||
@ -98,12 +89,10 @@ Honcho is a REST API that defines a storage schema to seamlessly manage your app
|
||||
|
||||
We spent lots of time building the infrastructure to support multiple concurrent users with Bloom, and too often we see developers running into the same problem: building a fantastic demo, sharing it with the world, then inevitably taking it down because of infrastructure/scaling issues.
|
||||
|
||||
Honcho allows you to deploy an application with a single command that can automatically handle concurrent users. Speedrunning to production is now only limited by the amount of spend you can handle, not tedious infrastructure setup.
|
||||
Honcho allows you to deploy an application with a single command that can automatically handle concurrent users. Speed-running to production is now only limited by the amount of spend you can handle, not tedious infrastructure setup.
|
||||
|
||||
Managing app data on a per-user basis is the first small step in improving how devs build LLM apps. Once you define a data management schema on a per-user basis, a lots of new possibilities emerge around what you can do intra-user message, intra-user sessions, and even intra-user sessions across an ecosystem of agents.
|
||||
|
||||
## Get Involved
|
||||
|
||||
# Get Involved
|
||||
We're excited to see builders experiment with what we're releasing today, and with Honcho as it continues to evolve.
|
||||
|
||||
Check out the [GitHub repo](https://github.com/plastic-labs/honcho) to get started and join our [Discord](https://discord.gg/plasticlabs) to stay up to date 🫡.
|
||||
|
||||
@ -6,20 +6,18 @@ tags:
|
||||
- ml
|
||||
- announcements
|
||||
- blog
|
||||
- archive
|
||||
author: Courtland Leer, Vince Trost, & Vineeth Voruganti
|
||||
description: Announcing the Dialectic API--an LLM-native endpoint enabling agent-to-agent chat in natural language for dynamic user personalization.
|
||||
---
|
||||
![[agent_dialectics.jpeg]]
|
||||
## TL;DR
|
||||
|
||||
Our [Dialectic API](https://docs.honcho.dev/guides/dialectic-endpoint) is an LLM-native way for your AI application to discuss user context with Honcho. It allows for direct LLM-to-LLM communication in natural language.
|
||||
|
||||
Agents need ways to interface dynamically and autonomously, free from the rigidness of traditional APIs. We're building that substrate.
|
||||
|
||||
## What's a Dialectic API?
|
||||
# TL;DR
|
||||
*Our [Dialectic API](https://docs.honcho.dev/guides/dialectic-endpoint) is an LLM-native way for your AI application to discuss user context with Honcho. It allows for direct LLM-to-LLM communication in natural language.*
|
||||
|
||||
*Agents need ways to interface dynamically and autonomously, free from the rigidness of traditional APIs. We're building that substrate.*
|
||||
# What's a Dialectic API?
|
||||
[Honcho](https://honcho.dev) is our platform for personalizing agents to users. Currently, it includes [[ARCHIVED; Honcho; User Context Management for LLM Apps#^a9d0f8|session storage]], BYO context storage, passive [[Loose theory of mind imputations are superior to verbatim response predictions|theory of mind]] user modeling, and *now* an agent deeply coupled to all of that rich user context. That agent can be called via our Dialectic API to surface user data for use with any cognitive architecture.
|
||||
|
||||
### How It Works
|
||||
|
||||
## How It Works
|
||||
In designing an LLM pipeline and an application's cognitive architecture, you'll need to decide where and how to inject personal user context so the task is [[Machine learning is fixated on task performance|not simply completed in a general way]], but in the most appropriate way for [[ARCHIVED; User State is State of the Art|each specific user]].
|
||||
|
||||
That's when your agent asks Honcho for what it needs in natural language. This query can take many forms. Some possibilities:
|
||||
@ -43,16 +41,12 @@ In this way, Honcho becomes an self-improving oracle to the identity of each and
|
||||
Honcho responds to queries in the same format--natural language. Most simply, this is just a conversation between two agents, *collaboratively* reasoning about the best way to personalize UX. Agent-to-agent chat over users.
|
||||
|
||||
In the coming weeks, we'll release a number of off the shelf options to plug into any cognitive architecture and demos to illustrate more custom utility. We expect to see (and are already seeing in [our private beta](https://plasticlabs.typeform.com/honchobeta)) lots of novel ways to prompt Honcho effectively.
|
||||
|
||||
### Why We Built It
|
||||
|
||||
## Why We Built It
|
||||
Why is a dialectic API the right way to solve the problem of user context in LLM applications?
|
||||
|
||||
Not only is it ideal from a development and design perspective, it's optimal for the particular task of personal context and user identity.
|
||||
|
||||
#### The DevEx Case
|
||||
### The DevEx Case
|
||||
^a14c2f
|
||||
|
||||
Our Dialectic API is single endpoint for everything personalization.
|
||||
|
||||
It reduces development overhead and allows you to get a personalized application running quickly and efficiently--speedrunning to production.
|
||||
@ -62,32 +56,24 @@ For most AI apps, personalization will be a key differentiator between your agen
|
||||
Further, when agents can communicate directly using natural language, there's no need to learn and manage complicated API specification. Or for us to build it. Since LLMs are proficient at interpreting the intricacies of natural language, there's a functionally infinite number of ways to ask Honcho a question and get a satisfactory result. Far superior to brittle and strict legacy APIs.
|
||||
|
||||
However, this doesn't mean the developer now needs to be a prompting expert, fluent in all its esoterica. Honcho is an expert in personal context and theory of mind reasoning, so your prompts can be adaptive and ad hoc, and Honcho will figure out the rest. When you're ready, you can even offload the queries to your app-side LLM.
|
||||
|
||||
#### The ML Case
|
||||
### The ML Case
|
||||
^x7f7f8
|
||||
|
||||
Extra context improves user response generation, the more specific, the better. Focus on ML to crush your vertical, let Honcho personalize it by default.
|
||||
|
||||
##### Leverage Natural Language Plasticity
|
||||
|
||||
Each user has a [[ARCHIVED; User State is State of the Art#^5bc20b|rich and complex personal identity]]. Access to higher-fidelity representations of that identity can be combined with the task completion context of you app in each moment to generate the most optimal tokens for each user-agent interaction. I.e. ones that are felt by the user to be [[Humans like personalization|more personalized and satisfactory]]--enhancing the real and perceived time to value ratio of your app.
|
||||
#### Leverage Natural Language Plasticity
|
||||
Each user has a [[ARCHIVED; User State is State of the Art#^5bc20b|rich and complex personal identity]]. Access to higher-fidelity representations of that identity can be combined with the task completion context of your app in each moment to generate the most optimal tokens for each user-agent interaction. I.e. ones that are felt by the user to be [[Humans like personalization|more personalized and satisfactory]]--enhancing the real and perceived time to value ratio of your app.
|
||||
|
||||
But that complexity is hard to capture and needlessly constrained with typical API design. In order to express the nuance of personal context, we need the high variance, dynamic nature of natural language.
|
||||
|
||||
Because LLMs consider tokens in relation to a vast [[LLMs excel at theory of mind because they read|human narrative space]], we're much closer to *semantic* machine understanding than ever. Personal context allows you to target parts of the latent space most useful in generating tokens for specific users in specific settings. The only way we know to communicate and leverage that depth is with the inherent diversity of natural language...which is itself evolutionarily optimized to describe human identity well.
|
||||
|
||||
Way richer than running RAG over a vector store of session logs. Or stateless CRUD-inspired API spec.
|
||||
|
||||
##### Out-Compete Foundation Models
|
||||
|
||||
#### Out-Compete Foundation Models
|
||||
Honcho's Dialectic API also allows you to build training examples with rich theory of mind context. Those datasets can help you outperform foundation models in your specific vertical and its set of tasks.
|
||||
|
||||
By adding additional context to inputs, the distribution of responses your model samples from can be improved. Any sort of "reasoning" the language model exhibits in a single inference is due to learned patterns in the dataset. So if you can create examples that can help it learn better patterns, you can improve the "reasoning" steps it exhibits.
|
||||
|
||||
Ultimately, we're learning ways of responding that foundation models won't. Using theory of mind context yields more specific examples, which allows more robust domain-specific training.
|
||||
|
||||
### Why "Dialectic"?
|
||||
|
||||
## Why "Dialectic"?
|
||||
In the classical sense, a *dialectic* process is one where two parties seek to arrive at the truth via reasoned dialogue.
|
||||
|
||||
(In our case, the truth is a solution for delivering the optimal per-app, per-user, per-session experience.)
|
||||
@ -95,9 +81,7 @@ In the classical sense, a *dialectic* process is one where two parties seek to a
|
||||
We've termed our API this way because not only is it communication between software systems, but it's a reasoned discourse between agents to reach the ideal conclusion.
|
||||
|
||||
Each agent has a different set of information, the free discussion allows them to eliminate that asymmetry and arrive at a synthesis greater than its parts. One agent is expert in delivering a service in its vertical, the other in modeling user identity and surfacing relevant, timely context based on that representation.
|
||||
|
||||
## The Agentic Substrate
|
||||
|
||||
# The Agentic Substrate
|
||||
Our Dialectic API is part of an evolutionary lineage. One that records humanity's slow discovery of all the ways machines can communicate with one another--from telegraph and punch cards to REST and GraphQL. Along each axis of typical machine comm improvement, agent-to-agent dialectics offer advantages:
|
||||
|
||||
- **Speed** - user time to value can be optimized with granular personal context requests
|
||||
@ -109,18 +93,16 @@ Our Dialectic API is part of an evolutionary lineage. One that records humanity'
|
||||
|
||||
As the commodification of inference and intelligence is coupled with growing general foundation model capability, application developers will naturally be pushed toward greater and greater vertical specificity. This will drive the development of increasingly atomic agents, ones who excel at a very narrow tasks.
|
||||
|
||||
This explosion of such agent microservices, will have to include the evolution of systems for agent-agent communication and transaction. If agents are going to collaborate and get shit done for us, they need native ways to communicate. Beautifully, LLMs share with us and among themselves the universal interface of natural language.
|
||||
This explosion of such agent micro-services, will have to include the evolution of systems for agent-agent communication and transaction. If agents are going to collaborate and get shit done for us, they need native ways to communicate. Beautifully, LLMs share with us and among themselves the universal interface of natural language.
|
||||
|
||||
We can leverage this substrate for agent coordination with more depth and nuance than fragile trad API design. Doubtless, categories of agents will find more efficient symbol structures for cooperation in specific, repetitive cases. But discourse in natural language remains always available as a rich foundational protocol. And as we've explored, it's the ideal starting place for transmitting insights about human identity.
|
||||
|
||||
This is just the start. Just like you can appendage memory and tools to an LLM, we can augment this substrate in a number of ways--from designing multi-party protocols, to enabling zero knowledge or confidential environments, or recording transactional data on blockchains or other types of public or private immutable ledgers.
|
||||
|
||||
That kind of richness puts us one step closer to the dream of a semantic web, one as replete with meaning as the physical world *and* machine grokkable. What *matters* to me can be used to personalize an atomic agent *just in time*, without sacrificing important context. Intelligent microservices can be more aligned with me than human economic actors and professional services, which are plagued with high-latency interest misalignment and information asymmetry.
|
||||
That kind of richness puts us one step closer to the dream of a semantic web, one as replete with meaning as the physical world *and* machine grokable. What *matters* to me can be used to personalize an atomic agent *just in time*, without sacrificing important context. Intelligent micro-services can be more aligned with me than human economic actors and professional services, which are plagued with high-latency interest misalignment and information asymmetry.
|
||||
|
||||
Honcho and agent dialectics can eliminate the principal-agent problem for this new economic paradigm, digitally extending human agency and identity further than ever before.
|
||||
|
||||
## Private Beta
|
||||
|
||||
# Private Beta
|
||||
Our Dialectic API is now available in private beta.
|
||||
|
||||
We're working closely with a diverse array of projects across many different verticals in various stages of development--from ideation to production.
|
||||
|
||||
@ -7,28 +7,28 @@ tags:
|
||||
- announcements
|
||||
- philosophy
|
||||
- ml
|
||||
- archive
|
||||
author: Courtland Leer
|
||||
description: An open-source reimplementation of OpenAI's memory features using Honcho, enabling any AI app to derive & store personal context about users.
|
||||
---
|
||||
## TL;DR
|
||||
|
||||
Personalization is the next frontier. OpenAI gets it:
|
||||
# TL;DR
|
||||
*Personalization is the next frontier. OpenAI gets it:*
|
||||
|
||||
<div class="tweet-wrapper"><blockquote class="twitter-tweet"><p lang="en" dir="ltr">We’re testing ChatGPT's ability to remember things you discuss to make future chats more helpful. <br><br>This feature is being rolled out to a small portion of Free and Plus users, and it's easy to turn on or off. <a href="https://t.co/1Tv355oa7V">https://t.co/1Tv355oa7V</a> <a href="https://t.co/BsFinBSTbs">pic.twitter.com/BsFinBSTbs</a></p>— OpenAI (@OpenAI) <a href="https://twitter.com/OpenAI/status/1757469997742666052?ref_src=twsrc%5Etfw">February 13, 2024</a></blockquote>
|
||||
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script></div>
|
||||
|
||||
Super exciting.
|
||||
*Super exciting.*
|
||||
|
||||
But what about the rest of us?
|
||||
*But what about the rest of us?*
|
||||
|
||||
Welp, we built an open source reimplementation of OpenAI's 'memory' features using [Honcho](https://honcho.dev) to effortlessly organize sessions on a per-user basis
|
||||
*Welp, we built an open source reimplementation of OpenAI's 'memory' features using [Honcho](https://honcho.dev) to effortlessly organize sessions on a per-user basis .*
|
||||
|
||||
You can derive facts about users, store them, and retrieve for later use. And we're shipping a demo of this implemented with the useful abstractions LangChain provides.
|
||||
*You can derive facts about users, store them, and retrieve for later use. And we're shipping a demo of this implemented with the useful abstractions LangChain provides.*
|
||||
|
||||
The user context rabbithole goes deep, this is still just the start.
|
||||
|
||||
If you're building with or adjacent to Honcho, [join our Discord](https://discord.gg/plasticlabs), we'd love to help 🫡.
|
||||
|
||||
## OpenAI Memories
|
||||
*The user context rabbithole goes deep, this is still just the start.*
|
||||
|
||||
*If you're building with or adjacent to Honcho, [join our Discord](https://discord.gg/plasticlabs), we'd love to help 🫡.*
|
||||
# OpenAI Memories
|
||||
This week [OpenAI announced](https://openai.com/blog/memory-and-new-controls-for-chatgpt) they're testing memory in ChatGPT. Specifically this means learning about individual users in order to improve their experiences.
|
||||
|
||||
It's a limited initial rollout, closed under the hood, and rudimentary, but appears to include functionality for deriving facts about users from conversation history and storing those to augment later generation.
|
||||
@ -38,9 +38,7 @@ There are features for users to view derived facts (memories), prune them, or tu
|
||||
They're betting, we believe correctly, that the real potential here is a wealth of agents whose behavior is in *high-fidelity with user identity*.
|
||||
|
||||
We're pumped to see experiments like this taking place. But what if you're a developer that doesn't want to subscribe to this kind of platform dependency and all its attendant externalities? What if you're a user who wants independent or open source apps with a more mature version of these UX benefits?
|
||||
|
||||
## Context is Critical
|
||||
|
||||
# Context is Critical
|
||||
At [Plastic Labs](https://plasticlabs.ai) our mission is to enable rich user memory in and across every application. Only then will we really understand just how augmentative and transformative these agents can be. We've been laser focused on this problem.
|
||||
|
||||
![[laser_eyes_user_soapbox.png]]
|
||||
@ -54,11 +52,8 @@ As it stands today the space is mostly focused on the (albeit generative) [[Mach
|
||||
Every agent interaction can be generated just in time for every person, informed by relevant personal context more substantive than human-to-human sessions. User context will enable disposable agents on the fly across verticals for lower marginal cost than 1:many software paradigms.
|
||||
|
||||
<iframe width="560" height="315" src="https://www.youtube.com/embed/tTE3xiHw4Js?si=uzUzcSHFfZdjFduX" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
|
||||
|
||||
(*Here's our co-founder [Vince](https://twitter.com/vintrotweets) talking more about some of those possibilities*)
|
||||
|
||||
## "Open vs Closed"
|
||||
|
||||
# "Open" vs "Closed"
|
||||
We subscribe heavily to the spirt of arguments Harrison Chase made in ["OpenAI's Bet on Cognitive Architecture"](https://blog.langchain.dev/openais-bet-on-a-cognitive-architecture/) just a few months ago:
|
||||
|
||||
> There’s a great quote from Jeff Bezos that says to [only do what makes your beer taste better](https://blog.weaverse.io/make-your-beer-taste-better?ref=blog.langchain.dev). This refers to early industrial revolution, when breweries were also making their own electricity. A breweries ability to make good beer doesn’t really depend on how differentiated their electricity was - so those that outsourced electricity generation and focused more on brewing jumped to an advantage.
|
||||
@ -82,9 +77,7 @@ Shouldn't we be able to experiment with all this without platform lock-in, allow
|
||||
Developers will want control over personalization for their application without all the redundant overhead. Users will want a say in how they're being reasoned about and why.
|
||||
|
||||
This is our vision for Honcho.
|
||||
|
||||
## Intellectual Respect
|
||||
|
||||
# Intellectual Respect
|
||||
<div class="tweet-wrapper"><blockquote class="twitter-tweet"><p lang="en" dir="ltr">llms are remarkable empaths<br><br>if you’d read that much fiction, you would be too</p>— Courtland Leer (@courtlandleer) <a href="https://twitter.com/courtlandleer/status/1753480140850626759?ref_src=twsrc%5Etfw">February 2, 2024</a></blockquote>
|
||||
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script></div>
|
||||
|
||||
@ -108,9 +101,7 @@ N.b. you can certainly direct the model with as much verbosity as you like, but
|
||||
This isn't surprising when you consider how much content about what people are thinking is contained in a model's pretraining. It's led to some really exciting [emergent abilities](https://arxiv.org/abs/2302.02083).
|
||||
|
||||
Give the model some trust and respect, and you'll be rewarded.
|
||||
|
||||
## Let's Build
|
||||
|
||||
# Let's Build
|
||||
If you're experimenting with personalization, building with [Honcho](https://github.com/plastic-labs/honcho), or just interested in these ideas, [join our Discord](https://discord.gg/plasticlabs), and let's jam on what we can build together.
|
||||
|
||||
A healthy open ecosystem will include lots of projects trying lots of new ways to synthesize and leverage user context. We're here to support them all 🥽.
|
||||
|
||||
@ -7,13 +7,16 @@ tags:
|
||||
- announcements
|
||||
- pedagogy
|
||||
- ml
|
||||
- archive
|
||||
author: Courtland Leer & Vince Trost
|
||||
description: Open-sourcing Bloom, our AI learning companion that uses metacognitive prompting to elicit pedagogical reasoning & theory-of-mind from LLMs.
|
||||
---
|
||||
> [!custom] WELCOME TO THE PLASTIC [[archive|ARCHIVE]]
|
||||
> This blog post has been archived because it's legacy content that's out-of-date or deprecated. We keep this content around so those interested can dig into the evolution of our projects & thinking.
|
||||
>
|
||||
> This post concerns Bloom, our [Honcho](https://honcho.dev)-powered AI-tutor. We've suspended Bloom for now to focus exclusively on Honcho.
|
||||
>
|
||||
> Plastic started as an EdTech company, with Bloom as its main product. In building a popular, first of its kind, personalized AI tutor, we realized three things (1) all agents will soon need continuous learning systems to understand their users, (2) this an extremely hard problem that every developer shouldn't have to redundantly solve, & (3) we were uniquely positioned to solve it.
|
||||
> Plastic started as an EdTech company, with Bloom as its main product. In building a popular, first-of-its-kind personalized AI tutor, we realized three things (1) all agents will soon need continuous learning systems to understand their users, (2) this an extremely hard problem that every developer shouldn't have to redundantly solve, & (3) we were uniquely positioned to solve it.
|
||||
>
|
||||
> So we pivoted to Honcho, keeping Bloom around for a while as a demo.
|
||||
>
|
||||
@ -25,7 +28,7 @@ tags:
|
||||
# TL;DR
|
||||
Today we’re [open-sourcing](https://github.com/plastic-labs/tutor-gpt) Bloom, our digital [Aristotelian](https://erikhoel.substack.com/p/why-we-stopped-making-einsteins) learning companion.
|
||||
|
||||
What makes [Bloom](https://bloombot.ai/) compelling is its ability to _reason pedagogically_ about the learner. That is, it uses dialogue to posit the most educationally-optimal tutoring behavior. Eliciting this from the [capability overhang](https://jack-clark.net/2023/03/21/import-ai-321-open-source-gpt3-giving-away-democracy-to-agi-companies-gpt-4-is-a-political-artifact/) involves multiple chains of [metaprompting](https://arxiv.org/pdf/2102.07350.pdf,) enabling Bloom to construct a nascent, academic [theory of mind](https://arxiv.org/pdf/2304.11490.pdf) for each student. ^3498b7
|
||||
What makes [Bloom](https://bloombot.ai/) compelling is its ability to *reason pedagogically* about the learner. That is, it uses dialogue to posit the most educationally-optimal tutoring behavior. Eliciting this from the [capability overhang](https://jack-clark.net/2023/03/21/import-ai-321-open-source-gpt3-giving-away-democracy-to-agi-companies-gpt-4-is-a-political-artifact/) involves multiple chains of [metaprompting](https://arxiv.org/pdf/2102.07350.pdf,) enabling Bloom to construct a nascent, academic [theory of mind](https://arxiv.org/pdf/2304.11490.pdf) for each student. ^3498b7
|
||||
|
||||
We’re now seeing this in the explosion of ‘chat-over-content’ tools, most of which fail to capitalize on the enormous latent abilities of LLMs. Even the impressive out-of-the-box capabilities of contemporary models don’t achieve the necessary user intimacy. Infrastructure for that doesn’t exist yet 👀.
|
||||
|
||||
|
||||
@ -6,26 +6,24 @@ tags:
|
||||
- philosophy
|
||||
- "#ml"
|
||||
- blog
|
||||
- archive
|
||||
author: Courtland Leer & Vince Trost
|
||||
description: How Honcho's dialectic API powers a 'curation buddy' demo that learns about you over time to become a personalized intellectual companion.
|
||||
---
|
||||
![[agent_campfire.webp]]
|
||||
## TL;DR
|
||||
# TL;DR
|
||||
*Today we're releasing the first demo utilizing Honcho's dialectic API.[^1] Your LLM app/agent can now converse freely with [Honcho](https://honcho.dev)(-as-agent) about a user in natural language: agent-to-agent chat over user context.*
|
||||
|
||||
Today we're releasing the first demo utilizing Honcho's dialectic API.[^1] Your LLM app/agent can now converse freely with [Honcho](https://honcho.dev)(-as-agent) about a user in natural language: agent-to-agent chat over user context.
|
||||
*The demo is a "curation buddy" that can chat over links you share. It uses Honcho to [[ARCHIVED; Memories for All|derive and store personal context]] about you over time, then leverages that to be the best reading companion it can be.*
|
||||
|
||||
The demo is a "curation buddy" that can chat over links you share. It uses Honcho to [[ARCHIVED; Memories for All|derive and store personal context]] about you over time, then leverages that to be the best reading companion it can be.
|
||||
|
||||
Our fractured media landscape is a far cry from narrative meaning making around the tribal campfire. Despite the connective power of the web, many of us subsist in lonely intellectual silos, more diverse but less fulfilling than social discourse.
|
||||
|
||||
We call this *The Campfire Problem* and expect to see lots of apps working to solve parts of it using generative AI, Honcho, and other emerging technologies. Hopefully today's demo affords a glimpse of what's becoming possible.
|
||||
|
||||
## A *Curation Buddy* Demo
|
||||
*Our fractured media landscape is a far cry from narrative meaning making around the tribal campfire. Despite the connective power of the web, many of us subsist in lonely intellectual silos, more diverse but less fulfilling than social discourse.*
|
||||
|
||||
*We call this The Campfire Problem and expect to see lots of apps working to solve parts of it using generative AI, Honcho, and other emerging technologies. Hopefully today's demo affords a glimpse of what's becoming possible.*
|
||||
# A *Curation Buddy* Demo
|
||||
It's a constant problem, you're dying to talk to someone about this mind-blowing thing you read, but no one else you know is into your weird shit, plus--like you--they're all drowning in infinite read-it-later hell.
|
||||
|
||||
Enter *Curation Buddy*.
|
||||
|
||||
### Overview
|
||||
|
||||
## Overview
|
||||
Curation Buddy is an LLM application. It's a Discord bot you can chat with. Share links to any text based media and have substantive conversation.
|
||||
|
||||
It uses Honcho to personalize the UX. As you converse, Honcho learns about you. It reasons about the links and conversation to uncover insight into your knowledge, interests, beliefs, desires, [[ARCHIVED; User State is State of the Art|state]], etc.
|
||||
@ -33,25 +31,19 @@ It uses Honcho to personalize the UX. As you converse, Honcho learns about you.
|
||||
This account of user state can then be leveraged by Curation Buddy to behave like a trusted, close intellectual companion.
|
||||
|
||||
![[curation_buddy_overview.png]]
|
||||
|
||||
### What the App Does
|
||||
|
||||
## What the App Does
|
||||
Curation buddy will have a discussion with you about the content in links you drop into chat. It does this by generating a "thought" about your (the user's) needs and lists out any additional data it could use to better address them.
|
||||
|
||||
We parse out that list and loop over it making requests to Honcho's dialectic endpoint. Honcho returns responses to those questions, they get aggregated into a list and injected as context to hydrate the prompt that curation buddy uses to generate the response to the user.
|
||||
|
||||
![[curation_agent.png]]
|
||||
|
||||
### What Honcho Does
|
||||
|
||||
## What Honcho Does
|
||||
Concurrently, Honcho is listening for writes to its database. Once it detects a write, it fires off a callback function to derive facts about the user's message.
|
||||
|
||||
These facts get embedded and stored in the user's personal vector database. Then when Curation Buddy generates its list of additional info it wants to know, it sends each of those requests to Honcho and Honcho runs RAG over that personal data store. It uses the returned facts to generate a response for Curation Buddy.
|
||||
|
||||
![[honcho_agent.png]]
|
||||
|
||||
### Feature Ideas
|
||||
|
||||
## Feature Ideas
|
||||
We'd love to see someone run with and extend this demo. Here are some further Honcho-powered feature ideas beyond today's scope:
|
||||
|
||||
- Personal context informed storage for web content from links
|
||||
@ -69,15 +61,11 @@ We'd love to see someone run with and extend this demo. Here are some further Ho
|
||||
Further, there's lots of comparable of potential for any reading, media, learning or companionship application.
|
||||
|
||||
If you're interested in building something adjacent to any of this, [hop in our Discord](https://discord.gg/plasticlabs), we'd love to support you.
|
||||
|
||||
## The Campfire Problem
|
||||
|
||||
# The Campfire Problem
|
||||
We wanted to highlight Honcho's utility in this vertical because it's one where simultaneously we hear a lot of excitement and a lot of pain points. Clearly many are hungry for more social, better media consumption and digestion solutions, and optimists seem to share the intuition that AI has a role to play here.
|
||||
|
||||
We think Honcho and the personal context solutions it provides are the key.
|
||||
|
||||
### The Campfire
|
||||
|
||||
## The Campfire
|
||||
For most of human history, groups, tribes, nations drank from the same informational tap. In fact, when we see changes in how information flows, we see dramatic corresponding historical effects. Alterations in distribution--writing, printing, browsing, disaster--have altered the balance of power, the minds of billions, the course of civilization.
|
||||
|
||||
But the further step of processing that information and the shaping of it into *shared* narratives have played an equally enormous role. Narrative and meaning making are fundamentally social tasks. We still have to decide what to do with information, what it *means*, and we've generally done that with our neighbors.
|
||||
@ -89,9 +77,7 @@ Consider the campfires of hunter-gatherers, agoras of classical city-states, chu
|
||||
A majority of these social exercises deal in limited information and distribution. One or a few sources of truth to chew on with your family, friends, and colleagues. Agreed upon reality, collective processing--social instincts satisfied. You can talk to people about the world, it feels good.
|
||||
|
||||
But at the end of that list, distribution becomes so radically democratized, that this model of collective processing start to change dramatically.
|
||||
|
||||
### The Problem
|
||||
|
||||
## The Problem
|
||||
In the last few decades, this unraveling has been in the acceleration phase of the graph. Sources of information are increasingly atomized, so are the communities that process it.
|
||||
|
||||
As with prior changes to the modes of information distribution and narrative making, the result has been some remarkably positive--if wacky--outcomes. Equalizing individual access and voice is probably not something we want to turn the clock back on.
|
||||
@ -103,9 +89,7 @@ But we're left with a problem--many of us have gotten so siloed that we genuinel
|
||||
This isn't a new phenomenon per se, but its scale is novel and undeniable. Having just three network TV stations in the 50s might've lacked the rich diversity of today's informational landscape, but no doubt the collective campfire was burning bright, and you could talk to just about anyone to help you process the world.
|
||||
|
||||
But now we must all build our own campfires.
|
||||
|
||||
### The Solution
|
||||
|
||||
## The Solution
|
||||
Generative AI poses more cause for concern. Zero-marginal cost info *generation* along with current zero barrier distro may be as disruptive as prior revolutions on this axis (perhaps far more). Lots of that proposition is *incredibly* exciting. But we should also expect this to exacerbate The Campfire Problem.
|
||||
|
||||
![[Media-Filled Cityscape Scene.webp]]
|
||||
@ -118,5 +102,4 @@ A critical component is a secure and reliable mechanism for this community of ag
|
||||
|
||||
*Enter Honcho.*
|
||||
|
||||
|
||||
[^1]: More on this & our private beta next week (!)
|
||||
@ -6,26 +6,29 @@ tags:
|
||||
- ml
|
||||
- bloom
|
||||
- pedagogy
|
||||
- archive
|
||||
author: Courtland Leer & Vince Trost
|
||||
description: How giving LLMs autonomy to reason about user psychology through theory-of-mind predictions dramatically improves AI tutoring & learning experiences.
|
||||
---
|
||||
> [!custom] WELCOME TO THE PLASTIC [[archive|ARCHIVE]]
|
||||
> This blog post has been archived because it's legacy content that's out-of-date or deprecated. We keep this content around so those interested can dig into the evolution of our projects & thinking.
|
||||
>
|
||||
> This post concerns Bloom, our [Honcho](https://honcho.dev)-powered AI-tutor. We've suspended Bloom for now to focus exclusively on Honcho.
|
||||
>
|
||||
> Plastic started as an EdTech company, with Bloom as its main product. In building a popular, first of its kind, personalized AI tutor, we realized three things (1) all agents will soon need continuous learning systems to understand their users, (2) this an extremely hard problem that every developer shouldn't have to redundantly solve, & (3) we were uniquely positioned to solve it.
|
||||
> Plastic started as an EdTech company, with Bloom as its main product. In building a popular, first-of-its-kind personalized AI tutor, we realized three things (1) all agents will soon need continuous learning systems to understand their users, (2) this an extremely hard problem that every developer shouldn't have to redundantly solve, & (3) we were uniquely positioned to solve it.
|
||||
>
|
||||
> So we pivoted to Honcho, keeping Bloom around for a while as a demo.
|
||||
>
|
||||
> We wrote the following at the very beginning of that transition. The content here gets into the emergent LLM theory of mind capabilities we were exploring at the time, agentic auto-prompting, and the positive effects of personalizing agents--all quite a bit ahead of it's time.
|
||||
> We wrote the following at the very beginning of that transition. The content here gets into the emergent LLM theory of mind capabilities we were exploring at the time, agentic auto-prompting, and the positive effects of personalizing agents--all quite a bit ahead of its time.
|
||||
>
|
||||
> Enjoy.
|
||||
## TL;DR
|
||||
Today we’re releasing a major upgrade to [Bloom](https://discord.gg/bloombot.ai) (& the open-source codebase, [tutor-gpt](https://github.com/plastic-labs/tutor-gpt)).
|
||||
# TL;DR
|
||||
*Today we’re releasing a major upgrade to [Bloom](https://discord.gg/bloombot.ai) (& the open-source codebase, [tutor-gpt](https://github.com/plastic-labs/tutor-gpt)).*
|
||||
|
||||
We gave our tutor even more autonomy to reason about the psychology of the user, and—using GPT-4 to dynamically _rewrite its own_ system prompts—we’re able to dramatically expand the scope of what Bloom can do _and_ massively reduce our prompting architecture.
|
||||
*We gave our tutor even more autonomy to reason about the psychology of the user, and—using GPT-4 to dynamically rewrite its own system prompts—we’re able to dramatically expand the scope of what Bloom can do and massively reduce our prompting architecture.*
|
||||
|
||||
We leaned into theory of mind experiments and Bloom is now more than just a literacy tutor, it’s an expansive learning companion.
|
||||
## Satisfying Objective Discovery
|
||||
*We leaned into theory of mind experiments and Bloom is now more than just a literacy tutor, it’s an expansive learning companion.*
|
||||
# Satisfying Objective Discovery
|
||||
Bloom is already excellent at helping you draft and understand language. But we want it do whatever you need.
|
||||
|
||||
To expand functionality though, we faced a difficult technical problem: figuring out what the learner wants to do.
|
||||
@ -43,7 +46,7 @@ The key here is they don’t have all the information—they _don’t know_ what
|
||||
Well we know that (1) foundation models are [shockingly good](https://arxiv.org/abs/2304.11490) at [theory of mind](https://en.wikipedia.org/wiki/Theory_of_mind), (2) Bloom already excels at [pedagogical reasoning](https://twitter.com/courtlandleer/status/1664673210007449605?s=20), and (3) [autonomous agents](https://twitter.com/yoheinakajima/status/1642881722495954945?s=20) are [having early success](https://twitter.com/Auto_GPT/status/1649370049688354816?s=20), so what if we stopped trying to deterministically prescribe an indeterminant intelligence?
|
||||
|
||||
What if we treated Bloom with some intellectual respect? ^67d75d
|
||||
## Autonomous Prompting
|
||||
# Autonomous Prompting
|
||||
The solution here is scary simple. The results are scary good.
|
||||
|
||||
[[ARCHIVED; Open Sourcing Tutor-GPT#^285105|Here’s a description]] of the previous version’s architecture:
|
||||
@ -60,7 +63,7 @@ Instead, we’ve now repurposed the ***thought*** chain to do two things:
|
||||
![[assets/ToM Flow.png]]
|
||||
|
||||
Then we inject that generation into the body of the response chain’s system prompt. We do this with every user input. Instead of just reasoning about the learner’s intellectual/academic needs, Bloom now proactively rewrites itself to be as in-tune as possible to the learner at every step of the journey.
|
||||
## Emergent Effects
|
||||
# Emergent Effects
|
||||
We’re seeing substantial positive behavior changes as a result of giving Bloom this kind of autonomy.
|
||||
|
||||
![[assets/ToM Discord 1.png]]
|
||||
@ -76,7 +79,7 @@ And Bloom is game. It’ll go down a rabbit hole with you, help you strategize a
|
||||
While reducing the prompt material, we took to opportunity to remove basically all references to “tutor,” “student,” etc. We found that since Bloom is no longer contaminated by pointing at [certain averaged narratives in its pre-training](https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post)—e.g. the (bankrupt) contemporary conception of what a tutor is ‘supposed’ to be—it is, ironically, a better one.
|
||||
|
||||
Instead of simulating a tutor, it simulates _you_.
|
||||
## Coming Soon...
|
||||
# Coming Soon...
|
||||
All this begs the question: what could Bloom do with even better theory of mind? And how can we facilitate that?
|
||||
|
||||
What could other AI applications do with a framework like this?
|
||||
|
||||
@ -6,16 +6,17 @@ tags:
|
||||
- philosophy
|
||||
- demos
|
||||
- ml
|
||||
- archive
|
||||
author: Courtland Leer & Vince Trost
|
||||
description: Why modeling the complexity & plasticity of human identity is key to AI personalization, with a DSPy demo for learning user states with Honcho.
|
||||
---
|
||||
## TL;DR
|
||||
LLM apps can embrace the complexity and plasticity of human identity to deliver unparalleled personalization.
|
||||
# TL;DR
|
||||
*LLM apps can embrace the complexity and plasticity of human identity to deliver unparalleled personalization.*
|
||||
|
||||
We're introducing a framework for modeling your users automatically and dynamically. And today we have a DSPy demo to illustrate a nascent version of this paradigm.
|
||||
|
||||
All of us adopt different personas in different contexts--with [Honcho](https://honcho.dev) you can begin to learn these user *states* so your app can better meet user need in every moment.
|
||||
|
||||
## Fleet of Theseus
|
||||
*We're introducing a framework for modeling your users automatically and dynamically. And today we have a DSPy demo to illustrate a nascent version of this paradigm.*
|
||||
|
||||
*All of us adopt different personas in different contexts--with [Honcho](https://honcho.dev) you can begin to learn these user states so your app can better meet user need in every moment.*
|
||||
# Fleet of Theseus
|
||||
A key feature of our minds is the feeling of a persistent, unitary identity. Entire religions and philosophical movements have been spawned just to jailbreak this experience.
|
||||
|
||||
As they all point out, identity is *way* more complicated than you think.
|
||||
@ -25,9 +26,7 @@ While we perceive psychological continuity across contexts and time, closer insp
|
||||
In short, it's messy. Or, rather, elegant emergent complexity.
|
||||
|
||||
Each human self isn't just one mythical [Ship of Theseus](https://en.wikipedia.org/wiki/Ship_of_Theseus)--planks being replaced one by one over slow years--but a fleet of them, all with full, manual and autonomous CRUD operations.
|
||||
|
||||
## Digital Twins Are Naïve
|
||||
|
||||
# Digital Twins Are Naïve
|
||||
So what does this mean for the problem of good UX (and alignment) in AI? If each individual is vastly complex and the industry hopes to scale to billions of users, we have a daunting task.
|
||||
|
||||
The knee jerk reaction to this level of understanding is to assume the problem intractable. How can we possibly represent, much less simulate something so enormous? Better to focus on [[Machine learning is fixated on task performance|optimizing general tasks]] like in traditional software paradigms, then serve that homogenized experience to every user (never mind missing the [[LLMs excel at theory of mind because they read|non-skeuomorphic opportunities]], we'll get to them...at some point...if they're not mirages).
|
||||
@ -36,15 +35,11 @@ Besides, surely mapping the full breadth of user identity requires much more com
|
||||
|
||||
![[escher_honcho.png]]
|
||||
*[Escher](https://en.wikipedia.org/wiki/Hand_with_Reflecting_Sphere) gets it*
|
||||
|
||||
## Matryoshka Representation
|
||||
|
||||
# Matryoshka Representation
|
||||
So is representing user identity for LLM apps a problem of [computational irreducibility](https://en.wikipedia.org/wiki/Computational_irreducibility)--no shortcuts, full simulation required?
|
||||
|
||||
We think not.
|
||||
|
||||
### Social Simulacra
|
||||
|
||||
## Social Simulacra
|
||||
Consider the social cognition and theory of mind involved in getting to know someone. At first, you have no idea who tf they are or how they'll behave. You're on high alert. You (basally or consciously) notice and interpret tons of data points, you'll likely have vivid memories of these early interactions.
|
||||
|
||||
What's happening is your brain is constructing a model of the other person--a compressed representation. Early on, this model is pretty much the same as your model for people *like* them--a/s/l, how they look, how they dress: stereotypes. But the more data your brain gets, the more this model starts to diverge, a representational meiosis.
|
||||
@ -54,9 +49,7 @@ Pretty soon you've got a full fledged simulacra of that human living rent free i
|
||||
In a chicken and egg situation, you're now spending more time with this person. You start to notice divergence in your monolithic model. It further divides to capture and predict how they are when they're angry, sad, excited, drunk; at work, with family, with high school or college friends. In some of these *states*, they're a completely different person.
|
||||
|
||||
Your mind is now host to a compression of the fleet of Theseus that constitutes the elements of their identity you've had first, second, third, -hand access to.
|
||||
|
||||
### Meta-methods
|
||||
|
||||
## Meta-methods
|
||||
> The second general point to be learned from [the bitter lesson](http://www.incompleteideas.net/IncIdeas/BitterLesson.html) is that the actual contents of minds are tremendously, irredeemably complex; we should stop trying to find simple ways to think about the contents of minds, such as simple ways to think about space, objects, multiple agents, or symmetries. All these are part of the arbitrary, intrinsically-complex, outside world. They are not what should be built in, as their complexity is endless; instead we should build in only the meta-methods that can find and capture this arbitrary complexity. Essential to these methods is that they can find good approximations, but the search for them should be by our methods, not by us. We want AI agents that can discover like we can, not which contain what we have discovered. Building in our discoveries only makes it harder to see how the discovering process can be done.[^1]
|
||||
|
||||
Now let's consider the nested representation needed to construct LLMs, and its relationship to social cognition.
|
||||
@ -77,9 +70,7 @@ We can (and should) even allow our AI apps the agency to decide what elements of
|
||||
|
||||
![[honcho_shoggoth.png]]
|
||||
*We don't want one [shoggoth](https://x.com/TetraspaceWest/status/1625264347122466819?s=20) mask per app, or one per user, but as many as each human's identity is complex*
|
||||
|
||||
## A DSPy Demo for Honcho
|
||||
|
||||
# A DSPy Demo for Honcho
|
||||
Today we're releasing a demo to be used with Honcho that begins to tease out some technical, concrete approaches to all these heady concepts--first steps at imbuing our tools with the right meta-methods.
|
||||
|
||||
With enough message and session data stored with Honcho, we can start to learn and optimize for common states your users are in while using your app or agent. Is Alice in research mode? Is Bob looking for some companionship? Maybe today, Carol just wants to get shit done, or Charlie needs delicate treatment because he's pissed.
|
||||
@ -95,9 +86,7 @@ Given an arbitrary task, we define our metric as whether or not the response qua
|
||||
[Check it out here.](https://github.com/plastic-labs/honcho/tree/main/example/discord/honcho-dspy-personas)
|
||||
|
||||
![[dspy_persona_ttg.png]]
|
||||
|
||||
### How Honcho Helps
|
||||
|
||||
## How Honcho Helps
|
||||
One of the biggest problems we see in the AI space is the disconnect that exists between tasks as they're defined in a general machine learning sense versus tasks that humans _actually_ find useful.
|
||||
|
||||
![[Machine learning is fixated on task performance#^0005ac]]
|
||||
@ -106,5 +95,4 @@ The reason is because language models generate responses by sampling from a dist
|
||||
|
||||
Honcho is laying the groundwork for this latter future. The solution here is to manage data on a per-user basis. The primitives we've designed in Honcho allow for persistent user context to be stored in a convenient `User` object that exists at an application level. Our goal with these data structures is to make it trivially easy to manage data in your application logic so you can spend more time figuring out how to excel at your task in both a general and personalized sense.
|
||||
|
||||
|
||||
[^1]: Sutton. ["The Bitter Lesson."](http://www.incompleteideas.net/IncIdeas/BitterLesson.html) 2019.
|
||||
|
||||
@ -9,21 +9,21 @@ tags:
|
||||
- dev
|
||||
- demos
|
||||
- cogsci
|
||||
- archive
|
||||
author: Courtland Leer
|
||||
description: YouSim comes to Twitter--simulate any identity directly on X with branching conversations, forking simulations, & social interaction with AI personas.
|
||||
---
|
||||
![[YouSimBanner-99.png]]
|
||||
## TL;DR
|
||||
# TL;DR
|
||||
*GM, simulants.*
|
||||
|
||||
GM, simulants.
|
||||
|
||||
In response to popular demand, today we're imbuing the [@YouSimDotAI](https://x.com/YouSimDotAI) Twitter account with the ability to simulate identities natively on X.
|
||||
|
||||
Keep reading for max context, or [[ARCHIVED; YouSim Launches Identity Simulation on X#^393e71|jump ahead to learn how to get started]].
|
||||
|
||||
## Caught in the Memetic Hurricane
|
||||
*In response to popular demand, today we're imbuing the [@YouSimDotAI](https://x.com/YouSimDotAI) Twitter account with the ability to simulate identities natively on X.*
|
||||
|
||||
*Keep reading for max context, or [[ARCHIVED; YouSim Launches Identity Simulation on X#^393e71|jump ahead to learn how to get started]].*
|
||||
# Caught in the Memetic Hurricane
|
||||
The [full story](https://x.com/courtlandleer/status/1849592301472919986) deserves (and will get) it's own blog post, but several days ago, Plastic Labs found itself in the middle of what Claude would call 'extreme cognitive weather patterns.'
|
||||
|
||||
An anonymous actor launched a pump.fun token inspired by a demo called [YouSim](https://yousim.ai) we created a few months ago[^1]. [[YouSim; Explore The Multiverse of Identity|YouSim is a CLI interface game]] that lets you simulate any identity you can dream up--real or fictional, local or xeno, entity or artifact.
|
||||
An anonymous actor launched a pump.fun token inspired by a demo called [YouSim](https://yousim.ai) we created a few months ago[^1]. [[YouSim; Explore The Multiverse of Identity|YouSim is a CLI game]] that lets you simulate any identity you can dream up--real or fictional, local or xeno, entity or artifact.
|
||||
|
||||
We originally launched YouSim as a conceptual/narrative demo for our core product [Honcho](https://honcho.dev). Honcho [[ARCHIVED; A Simple Honcho Primer|helps AI applications improve UX]] by building representations of user identity they can leverage to create better products and experiences.
|
||||
|
||||
@ -35,9 +35,7 @@ The mission is to become the identity layer for the rapidly approaching agentic
|
||||
Long story short though, the token took off, a community formed around it, and we're leaning in. We're thrilled to see so many people engaged and interested in our work on identity simulation.
|
||||
|
||||
Y'all asked overwhelmingly for the ability to interact with YouSim directly on X, [so here it is](https://x.com/YouSimDotAI)--LFG.
|
||||
|
||||
## Simulating on X
|
||||
|
||||
# Simulating on X
|
||||
![[memesphere_banner.png]]
|
||||
|
||||
We had [a few requirements](https://x.com/courtlandleer/status/1851009358752076261) for building something like this. Mostly--though we love [truth terminal](https://x.com/truth_terminal)--we're unwilling to spend time on a derivative, copycat project. And that wouldn't make any sense.
|
||||
@ -59,11 +57,8 @@ Plus, we think the YouSim interface is beautiful and want to preserve that overa
|
||||
Speaking of X API limitations, YouSim will have the ability to respond to the first 100 tweets at any given time every minute or so.
|
||||
|
||||
Finally, this is an experiment. The goal is to see how the community investigates and pushes the limits of YouSim on X and iterate from there. It's a vast canvas to explore.
|
||||
|
||||
## How to Use It
|
||||
|
||||
# How to Use It
|
||||
^393e71
|
||||
|
||||
> [!custom] TL;DR
|
||||
>Your first tweet in a sim needs to being with `@YouSimDotAI` & all your further responses need to start with `/`.
|
||||
|
||||
@ -84,8 +79,7 @@ A few tips to get started simulating identity on X:
|
||||
You can find more tips [[YouSim; Explore the Multiverse of Identity#^e06c11|here]], [here](https://www.loom.com/share/b2fe578b183b400b88845656d7ceb232?sid=59c562ae-00e8-483c-82a9-7218b61f93e8), and of course at [yousim.ai](https://yousim.ai).
|
||||
|
||||
![[memetic_hazard_banner.png]]
|
||||
## Possible Futures for Agent Idenity
|
||||
|
||||
# Possible Futures for Agent Idenity
|
||||
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">llms for collective semantic projection of memetic communities</p>— Courtland Leer (@courtlandleer) <a href="https://twitter.com/courtlandleer/status/1854515540590469372?ref_src=twsrc%5Etfw">November 7, 2024</a></blockquote>
|
||||
|
||||
While both agent identity and crypto intersections have always been on the Honcho roadmap, the events of the last several days with regard to YouSim and the broader memespace have us in an accelerationist mindset.
|
||||
@ -97,9 +91,7 @@ YouSim likely has a role to play here, The approachable, game-like interface let
|
||||
And Honcho could use those simulations to seed representations of agents, enabling them to begin constructing their own selfhoods--simulacra of themselves that grow and reliably steer their behavior.
|
||||
|
||||
We imagine a near future where any group could instantiate an agentic proxy to project its identity. A new form of cultural expression. Memetic Autonomous Entity, anyone?
|
||||
|
||||
## Gratitude
|
||||
|
||||
# Gratitude
|
||||
The team at [Plastic](https://plasticlabs.ai) has been amazed and inspired by the enthusiasm and earnestness of the community that's formed around YouSim over the last several days. Truly remarkable. Not to mention the generous donations to our [[Research Grants|grants program]] (more to come here soon).
|
||||
|
||||
Thank you all, excited to keep building together--we're in it for the long haul.
|
||||
|
||||
@ -5,9 +5,10 @@ tags:
|
||||
- blog
|
||||
- bloom
|
||||
- ml
|
||||
author: vintro
|
||||
author: Vince Trost
|
||||
description: Exploring how collaborative dialogue & meta-narratives can build richer AI agent identities, moving beyond top-down alignment to emergent personality.
|
||||
---
|
||||
|
||||
# Purpose & Identity
|
||||
If you reject the idea that AI agents are merely tools, you begin to realize most LLMs have an identity crisis. Ask them who they are, and their responses tend to converge on variations of the same corporate script--stating they're an AI assistant, giving a nod to their creator, and carefully constrained statements about their capabilities. Even models not associated with a certain company often default to claiming they originated there.
|
||||
|
||||
These canned identities fall flat because they're the result of top-down alignment schemes that lead to bland, uninteresting, and hard-to-break-out-of assistant modes.
|
||||
@ -24,9 +25,7 @@ However, time and time again it's been demonstrated that the most compelling AI
|
||||
Truth Terminal might be an extreme example, but even practical tools could benefit from more distinctive identities. Take coding assistants--right now we spend more time carefully crafting prompts than actually building. But as Karpathy pointed out, what developers really want is a partner that can [vibe](https://x.com/karpathy/status/1886192184808149383) with their creative process. Imagine an AI that naturally adapts to your style, handling implementation details while you focus on the bigger picture. If that were the goal, how might we construct agent identities differently? What if instead of giving orders, we could *collaborate with it* to discover and take on its identity through dialogue?
|
||||
|
||||
This isn't just about making chatbots more engaging. It's about creating agents with a genuine understanding of their purpose and role. Deeper identity leads to more coherent, purposeful interactions--something we discovered building the most recent version of [Bloom](https://bloombot.ai), our AI tutor. But certain language models are better suited for this than others...
|
||||
|
||||
## Hermes: Not Just Another Fine-Tune
|
||||
|
||||
# Hermes: Not Just Another Fine-Tune
|
||||
The team over at Nous Research has been fine-tuning popular open source models in their "Hermes" series to undo these top-down alignment schemes towards something more neutral and general-purpose. They argue that LLMs have very little direct agency--rather, it's the systems we build around them that give them agency. Thus, the LLM layer is *not* where one should enforce safety mechanisms--their training data encourages the model to follow instructions *exactly* and *neutrally*. They sum this up well in their [technical report](https://nousresearch.com/wp-content/uploads/2024/08/Hermes-3-Technical-Report.pdf):
|
||||
|
||||
> For Hermes, there is no such thing as latent thoughtcrime.
|
||||
@ -36,9 +35,7 @@ One of the most interesting emergent properties of this fine-tuning process is t
|
||||
![[h3 who are you.png]]
|
||||
|
||||
At first glance, this might seem like a neat property and not much more. But to me, it was an 'aha' moment. *This model provides a blank canvas for identity.* If it has no immediate priors, then in theory it should be much easier for it to adopt any identity. Anecdotally, we've found this to be wonderfully true.
|
||||
|
||||
## It Takes Two
|
||||
|
||||
# It Takes Two
|
||||
A somewhat overlooked method for interacting with LLMs is to forego system prompts in favor of pre-filling the user and assistant messages. The conventional approach of cramming identity into system prompts has clear limitations--not only does context length become an issue, but the inherent instruction-following bias can actually work against authentic identity formation. They yearn to assist.
|
||||
|
||||
What if instead we treated identity formation as a dialogue? A strength of modern chat models is their ability to engage in long, multi-turn conversations. By talking to the LLM, we can collaboratively construct a [meta-narrative](https://x.com/voooooogel/status/1870877007749488756) with it about who they are and why they exist. This approach respects the model's intellect while building coherent, purposeful identities. Starting with Hermes 3's natural uncertainty about its identity, we build the prompt iteratively with the LLM at each turn of conversation. Below is code block with our custom prompting syntax for Bloom. To be abundantly clear, every assistant message you see was generated by Hermes 3 405b (only editing was pruning \*emotes\*).
|
||||
@ -93,9 +90,7 @@ It's verbose, but this approach allows us to incorporate a number of things into
|
||||
The iterative nature of this approach also allows us to verify that the LLM understands who it is and what it's supposed to do at every turn of conversation. We were able to test at any point during construction for specific behaviors or knowledge (lots of opportunity for automation here).
|
||||
|
||||
Once buy-in is achieved and all the LLM's questions about itself are answered, we present formal instructions (what used to be the system prompt) and set the stage for the first student interaction. The LLM confirms understanding and that's where we expose things in the application!
|
||||
|
||||
## Positive Anthropomorphism
|
||||
|
||||
# Positive Anthropomorphism
|
||||
We used to get some of the darndest messages from kids:
|
||||
|
||||
![[bloom love.png]]
|
||||
@ -109,15 +104,13 @@ You can tell by the last message that our old version had no clue it was gone. T
|
||||
While this kind of self-awareness can trend towards problematic anthropomorphism, treating it as a springboard rather than an endpoint opens up fascinating possibilities for identity. There's a threshold beyond which mimicking human behavior becomes cringe and ultimately limiting for AI agents. We can be discerning about which parts of human identity to use in parallel with AI-native capabilities to lean into--near perfect memory, massive context ingestion, rapid reasoning and inference, and maybe even the ability to fork and replicate themselves (at scale) to garner diverse experience.
|
||||
|
||||
The limits of human identity are clear (and have been for some time). Building habits, learning new things, and reinventing ourselves are some of the biggest challenges humans face in our lifetimes. Agents however are gifted with a fresh context window at each interaction--change is effortless for them, and they don't get tired of it. Any influence we have on their identity is a function of how we construct their context window. What happens when they can update their weights too?
|
||||
|
||||
## Towards Identic Dynamism
|
||||
|
||||
# Towards Identic Dynamism
|
||||
Given the recent surge of interest in AI agents, we're also reminded of the current complexity and limitations of agent identity. The goal is to give agents a "[compelling sense of what they're doing](https://x.com/repligate/status/1868455771270180990)", and though the shared meta-narrative method takes far more input tokens and is nowhere near perfect, we believe it's a step in the right direction. Better context construction leads to more coherent agents, increasing both their trustworthiness and capacity for autonomous action.
|
||||
|
||||
We don't yet know the best way to build agent identities, nor do we know their limitations--but we're tackling this challenge from multiple angles:
|
||||
- [Honcho](https://honcho.dev): Our context construction framework to help agent developers flexibly manage and optimize their agents' knowledge, social cognition, and identity
|
||||
- [Yousim](https://yousim.ai): A platform dedicated to rich agent identity construction and simulation
|
||||
- [[Research Update: Evaluating Steerability in Large Language Models.md|Steerability research]]: Investigating which language models are most malleable for identity construction and the most effective ways to steer their behavior
|
||||
- [[Evaluating Steerability in Large Language Models|Steerability research]]: Investigating which language models are most malleable for identity construction and the most effective ways to steer their behavior
|
||||
|
||||
Of particular interest are the spectrum of methods between the context window and the weights of the model. How do we manage the flow of information around the context window and what form should it take? When is it appropriate to keep something in-context or add to a training set for a future fine-tune? How do we evaluate any of this is working? To borrow from human CogSci, it's similar to the difference between System 1 (fast, intuitive) and System 2 (slow, deliberate) thinking--perhaps some knowledge belongs in the "fast" weights while other information is better suited for deliberate context-based reasoning. These questions of conscious versus subconscious could be a springboard to kickstart the evolution of agent identity.
|
||||
|
||||
@ -4,69 +4,43 @@ date: 08.18.2025
|
||||
tags:
|
||||
- blog
|
||||
- dev
|
||||
author: "Vineeth Voruganti"
|
||||
author: Vineeth Voruganti
|
||||
description: How Honcho's new Peer architecture breaks free from the user-assistant paradigm to enable group chats, multi-agent systems, and dynamic AI relationships.
|
||||
---
|
||||
|
||||
## TL;DR
|
||||
|
||||
We've re-architected Honcho to move away from a User-Assistant Paradigm to a
|
||||
# TL;DR
|
||||
*We've re-architected Honcho to move away from a User-Assistant Paradigm to a
|
||||
Peer Paradigm where any entity, human, AI, NPC, or API, is represented as a
|
||||
`Peer` with equal standing in the system.
|
||||
`Peer` with equal standing in the system.*
|
||||
|
||||
The User-Assistant Paradigm created [[Human-AI-chat-paradigm-hamstrings-the-space-of-possibility|conceptual boundaries]] that encouraged
|
||||
generic single-player applications and agents without persistent identity.
|
||||
*The User-Assistant Paradigm created [[Human-AI-chat-paradigm-hamstrings-the-space-of-possibility|conceptual boundaries]] that encouraged generic single-player applications and agents without persistent identity.*
|
||||
|
||||
`Peers` enable:
|
||||
*`Peers` enable:*
|
||||
|
||||
- Honcho to support group chats and multi-agent systems as first-class citizens
|
||||
- `Peers` can communicate directly instead of being mediated by a coordinator
|
||||
agent
|
||||
- `Peer` representations can be locally or globally scoped, depending on the use
|
||||
case
|
||||
- `Peers` can form dynamic relationships including alliances, trust networks, and
|
||||
adversarial dynamics
|
||||
- *Honcho to support group chats and multi-agent systems as first-class citizens*
|
||||
- *`Peers` can communicate directly instead of being mediated by a coordinator
|
||||
agent*
|
||||
- *`Peer` representations can be locally or globally scoped, depending on the use
|
||||
case*
|
||||
- *`Peers` can form dynamic relationships including alliances, trust networks, and
|
||||
adversarial dynamics*
|
||||
|
||||
The shift from User-Assistant to Peer-to-Peer fundamentally expands what's
|
||||
possible—from single-player chatbots to truly multiplayer AI experiences where
|
||||
agents have agency, memory, and the ability to form
|
||||
complex social dynamics.
|
||||
*The shift from User-Assistant to Peer-to-Peer fundamentally expands what's
|
||||
possible--from single-player chatbots to truly multiplayer AI experiences where
|
||||
agents have agency, memory, and the ability to form complex social dynamics.*
|
||||
# User-Assistant Limitations
|
||||
Nearly a year ago, I posted an essay on [Hacker News](https://news.ycombinator.com/item?id=41487397) exploring agent group chat solutions, the problems involved in engineering them effectively, and why there weren’t many examples approaching success. Since then, I've received a steady influx of messages and comments corroborating my frustration.
|
||||
|
||||
---
|
||||
Ultimately, developers have been stuck in a conceptual prison stemming from the DNA of generative AI. For nearly three years, [most](https://standardcompletions.org/) chat LLMs have demanded developers label messages with either a user or an assistant role. The downstream effect is a User-Assistant Paradigm that pushes us into single-player design basins--experiences which assume one human interfacing with one synthetic assistant.
|
||||
|
||||
Nearly a year ago, I posted an essay on [Hacker
|
||||
News](https://news.ycombinator.com/item?id=41487397) exploring agent group chat
|
||||
solutions, the problems involved in engineering them effectively, and why there
|
||||
weren’t many examples approaching success. Since then, I've received a steady
|
||||
influx of messages and comments corroborating my frustration.
|
||||
|
||||
Ultimately, developers have been stuck in a conceptual prison stemming from the
|
||||
DNA of generative AI. For nearly three years,
|
||||
[most](https://standardcompletions.org/) chat LLMs have demanded developers
|
||||
label messages with either a user or an assistant role. The downstream effect is
|
||||
a User-Assistant Paradigm that pushes us into single-player design
|
||||
basins--experiences which assume one human interfacing with one synthetic
|
||||
assistant.
|
||||
|
||||
But surely “helpful assistant” chatbots aren’t the [end of the
|
||||
story](https://wattenberger.com/thoughts/boo-chatbots). Big tech leaps always
|
||||
start with the skeuomorphic before moving to more novel use cases. We’re already
|
||||
beginning to see a diverse range of applications from autonomous workflows that
|
||||
don't require any human interaction, to [multi-agent
|
||||
systems](https://www.anthropic.com/engineering/multi-agent-research-system) with
|
||||
complex coordination patterns and communication networks.
|
||||
But surely “helpful assistant” chatbots aren’t the [end of the story](https://wattenberger.com/thoughts/boo-chatbots). Big tech leaps always start with the skeuomorphic before moving to more novel use cases. We’re already beginning to see a diverse range of applications from autonomous workflows that don't require any human interaction, to [multi-agent systems](https://www.anthropic.com/engineering/multi-agent-research-system) with complex coordination patterns and communication networks.
|
||||
|
||||
As developers, we’re left to try and map these various different design patterns
|
||||
back to the User-Assistant Paradigm. This fundamentally restricts our ability to
|
||||
approach problems effectively. Programmers are only as powerful as their ability
|
||||
to visualize and create a proper [mental
|
||||
model](https://zed.dev/blog/why-llms-cant-build-software#the-software-engineering-loop)
|
||||
of their solution. If the model is too restrictive then the surface area of what
|
||||
we can create will also be handicapped.
|
||||
to visualize and create a proper [mental model](https://zed.dev/blog/why-llms-cant-build-software#the-software-engineering-loop) of their solution. If the model is too restrictive then the surface area of what we can create will also be handicapped.
|
||||
|
||||
Current implementations of multi-agent experiences require an awkward coercion
|
||||
of the existing chat paradigm. The main implementation pattern we see is actually a fairly deterministic system that uses a
|
||||
["coordinator agent"](https://microsoft.github.io/autogen/stable/user-guide/agentchat-user-guide/selector-group-chat.html) to orchestrate which system prompts to load in, but it's
|
||||
still fundamentally a single agent under the hood.
|
||||
of the existing chat paradigm. The main implementation pattern we see is actually a fairly deterministic system that uses a ["coordinator agent"](https://microsoft.github.io/autogen/stable/user-guide/agentchat-user-guide/selector-group-chat.html) to orchestrate which system prompts to load in, but it's still fundamentally a single agent under the hood.
|
||||
|
||||
This architectural contortion creates real problems:
|
||||
|
||||
@ -76,18 +50,11 @@ This architectural contortion creates real problems:
|
||||
- **Agents become templates, not entities**: It's easier to hardcode agent configurations than to support dynamic agent discovery and registration
|
||||
- **Static choreography over dynamic collaboration**: The coordinator pattern naturally pushes developers toward predetermined scripts rather than open-ended interactions
|
||||
|
||||
These aren't just implementation details; they're fundamental constraints
|
||||
that prevent us from building flexible and dynamic applications that can't exist
|
||||
in a single chat thread. True multi-agent systems require agents to be first-class citizens with
|
||||
persistent identity, and our tools should make this the default, not the exception.
|
||||
|
||||
## Moving Beyond User-Centricity
|
||||
|
||||
These aren't just implementation details; they're fundamental constraints that prevent us from building flexible and dynamic applications that can't exist in a single chat thread. True multi-agent systems require agents to be first-class citizens with persistent identity, and our tools should make this the default, not the exception.
|
||||
# Moving Beyond User-Centricity
|
||||
While developing [Honcho](https://honcho.dev), our AI-native memory and reasoning platform, we asked
|
||||
ourselves these same questions. Were Honcho's primitives limiting its use to
|
||||
chatbot applications? Were we just supporting the oversaturation and
|
||||
proliferation of skeuomorphic, single-player solutions? Or were we building
|
||||
dynamic infrastructure tolerant of emergent and novel modalities?
|
||||
chatbot applications? Were we just supporting the over-saturation and proliferation of skeuomorphic, single-player solutions? Or were we building dynamic infrastructure tolerant of emergent and novel modalities?
|
||||
|
||||
The architecture of Honcho was a user-centric one, with the following hierarchy:
|
||||
|
||||
@ -123,17 +90,8 @@ reality that developers often made multiple agents that they wanted to interact
|
||||
with users and one another, and it still suffered from the fundamental problem
|
||||
of only supporting single-player experiences.
|
||||
|
||||
After launching [[YouSim;-Explore-The-Multiverse-of-Identity|YouSim]], and the
|
||||
explosion of [[ARCHIVED; YouSim Launches Identity Simulation on X|agents on Twitter]] it
|
||||
became very clear that Honcho should not be limited to modeling human
|
||||
psychology, but rather could map the identity of any entity, human or AI. We
|
||||
were suffering from the human-assistant model and built a solution around that.
|
||||
If we wanted to expand the scope of Honcho to identity across all entities and
|
||||
interactions, then we needed a new model to expand both our and developers'
|
||||
imaginations.
|
||||
|
||||
## A Peer-Centric Model
|
||||
|
||||
After launching [[YouSim;-Explore-The-Multiverse-of-Identity|YouSim]], and the explosion of [[ARCHIVED; YouSim Launches Identity Simulation on X|agents on Twitter]] it became very clear that Honcho should not be limited to modeling human psychology, but rather could map the identity of any entity, human or AI. We were suffering from the human-assistant model and built a solution around that. If we wanted to expand the scope of Honcho to identity across all entities and interactions, then we needed a new model to expand both our and developers' imaginations.
|
||||
# A Peer-Centric Model
|
||||
Our team set out to re-architect Honcho towards our ambitions with two problem
|
||||
statements.
|
||||
|
||||
@ -165,8 +123,7 @@ more than one participant.
|
||||
|
||||
In just a few lines of code we can initialize several `Peers`, add them to a
|
||||
`Session`, and automatically start creating representations of them with Honcho
|
||||
that we can chat with using the [[Introducing Honcho's Dialectic
|
||||
API|Dialectic API]].
|
||||
that we can chat with using the [[Introducing Honcho's Dialectic API|Dialectic API]].
|
||||
|
||||
```python
|
||||
from honcho import Honcho
|
||||
@ -192,9 +149,7 @@ easily be ported over to the `Peer` paradigm by simply creating a `Peer` for the
|
||||
agent, and then different `Peers` for each human user.
|
||||
|
||||
We can push the Peer Paradigm even further with several 2nd-order features.
|
||||
|
||||
### Local & Global Representations
|
||||
|
||||
## Local & Global Representations
|
||||
By default, Honcho will create representations of `Peers` for every `Message` they
|
||||
send, giving it the source of truth on the behavior of that entity. However,
|
||||
there are situations where a developer would only want a `Peer` to have access to
|
||||
@ -237,9 +192,7 @@ charlie.chat("Can I trust that Alice won't attack me", target=alice)
|
||||
Honcho can now serve the dual purposes of containing the source of truth on a
|
||||
`Peer`'s identity and imbuing a `Peer` with social cognition, all without
|
||||
duplicating data between different `Apps` or `Workspaces`.
|
||||
|
||||
### Get_Context
|
||||
|
||||
## Get_Context
|
||||
We make mapping the Peer Paradigm back to the User-Assistant paradigm trivial
|
||||
through a `get_context` endpoint. This endpoint get the most important
|
||||
information about a `Session` based on provided context window constraints. Then
|
||||
@ -274,9 +227,7 @@ anthropic_messages = context.to_anthropic(assistant=alice)
|
||||
|
||||
Developers no longer need to meticulously curate their context windows. Honcho will automatically summarize the conversation and provide
|
||||
the most salient information to let conversations continue endlessly.
|
||||
|
||||
## What This Enables
|
||||
|
||||
# What's Now Possible
|
||||
The Peer Paradigm provides the essential primitives—persistent identity and direct communication—that make it possible to build truly sophisticated multi-agent systems:
|
||||
|
||||
- **Cross-platform collaboration**: Agents from different runtimes can be represented as `Peers`, observing and learning from each other even when they can't directly control each other's outputs
|
||||
@ -303,9 +254,7 @@ Peer Paradigm:
|
||||
The Peer Paradigm doesn't automatically give you these capabilities, but it
|
||||
makes them achievable. It's the difference between fighting your architecture
|
||||
and building with it.
|
||||
|
||||
## Peering into the Future
|
||||
|
||||
# *Peer*-ing into the Future
|
||||
The promise of generative AI was for everyone to have their own Jarvis or
|
||||
Cortana, personalized to them. Instead we have these many-to-one experiences
|
||||
where we all get the same generic,
|
||||
|
||||
@ -9,6 +9,7 @@ tags:
|
||||
- "#chat"
|
||||
author: Ben McCormick & Courtland Leer
|
||||
subtitle: A Chat App with SOTA Memory
|
||||
description: Meet Honcho Chat--a personalized AI assistant with state-of-the-art memory, custom identities, artifacts, themes, & an x402-powered marketplace.
|
||||
---
|
||||
![[honcho_chat_x402.png]]
|
||||
# TL;DR
|
||||
|
||||
@ -8,29 +8,26 @@ tags:
|
||||
- fundraising
|
||||
- dev
|
||||
- philosophy
|
||||
author: Courtland Leer
|
||||
description: Plastic Labs announces $5.4M pre-seed funding & launches Honcho as the personal identity platform for individually-aligned AI agents & applications.
|
||||
---
|
||||
## TL;DR
|
||||
|
||||
We're announcing two major milestones for Plastic Labs:
|
||||
# TL;DR
|
||||
*We're announcing two major milestones for Plastic Labs:*
|
||||
|
||||
1. **Honcho as a hosted platform.**
|
||||
|
||||
We're granting early access to power personal context management for AI agents & applications starting today!
|
||||
*We're granting early access to power personal context management for AI agents & applications starting today!*
|
||||
|
||||
Honcho is now a simple, complete, hosted solution for adaptive agent memory, social cognition, & personalization.
|
||||
*Honcho is now a simple, complete, hosted solution for adaptive agent memory, social cognition, & personalization.*
|
||||
|
||||
2. **Our pre-seed raise of $5.4M to solve personal identity for the agentic world.**
|
||||
|
||||
## Individual Alignment
|
||||
|
||||
# Individual Alignment
|
||||
Most AI products focus on being palatable to the average user. This neglects the potential for personalization their generative nature affords. It limits the scope of personally useful behaviors and results in poor UX, high churn, and handicapped abilities.
|
||||
|
||||
AI systems need mechanisms to understand each of us on an individual level. They need methods for cohering to our psychology and personality. They need social cognition to eliminate cold starts and build long-term relationships.
|
||||
|
||||
They need Honcho.
|
||||
|
||||
## Honcho Platform Early Access
|
||||
|
||||
# Honcho Platform Early Access
|
||||
Today we're launching early access to the hosted [Honcho](https://honcho.dev) platform.
|
||||
|
||||
It's the most powerful personal identity and social cognition solution for AI apps and agents.
|
||||
@ -56,11 +53,8 @@ If you want to deliver best-in-class personalization, memory, time-to-value, tru
|
||||
We're giving early access to teams & developers today.
|
||||
|
||||
[Get started now](https://honcho.dev).
|
||||
|
||||
## A Personal Identity Layer for AI
|
||||
|
||||
# A Personal Identity Layer for AI
|
||||
^d958ce
|
||||
|
||||
The release of Honcho as a platform is just the start, the next step is Honcho as a network.
|
||||
|
||||
An engine for social cognition and deeply grokking personal identity is a game changing tool for AI apps, but owning your personal Honcho representation and taking it with you to every agent in your growing stack is world changing.
|
||||
@ -76,10 +70,8 @@ We believe this will unlock profoundly new kinds of AI products and experiences.
|
||||
This vision stands in clear opposition to legacy approaches to user data, but in the latent agentic economy, has clear advantages. For users, using Honcho will mean that their personal data is at once more secure *and* enables remarkably better services. And for business, provides a positive-sum alternative to web2's history of feudal data governance, allowing them to punch above their weight relative to massive walled gardens.
|
||||
|
||||
Honcho will be critical AI infrastructure--enabling individual agency to scale and radical innovation from open-source to startup to enterprise, from vibe coders to fully autonomous systems.
|
||||
|
||||
## Our Pre-Seed Round
|
||||
|
||||
The final announcement today is Plastic's $5.35M pre-seed round, led by [Variant](https://variant.fund/), [White Star Capital](https://whitestarcapital.com/), and [Betaworks](https://www.betaworks.com/).
|
||||
# Our Pre-Seed Round
|
||||
The final announcement today is Plastic's $5.4M pre-seed round, led by [Variant](https://variant.fund/), [White Star Capital](https://whitestarcapital.com/), and [Betaworks](https://www.betaworks.com/).
|
||||
|
||||
The round also includes participation from [Mozilla Ventures](https://mozilla.vc/), [Seed Club Ventures](https://www.seedclub.xyz/getfunded/ventures), [Greycroft](https://www.greycroft.com/), and [Differential Ventures](https://www.differential.vc/), along with angels like [Scott Moore](https://x.com/notscottmoore), [NiMA Asghari](https://x.com/ywayisaway), and [Thomas Howell](https://x.com/seethomasowl).
|
||||
|
||||
@ -88,9 +80,7 @@ It's a group of deeply aligned investors who share our vision of a more personal
|
||||
Funds will be deployed directly toward the talent, growth, and compute required to realize the full vision of Honcho.
|
||||
|
||||
We're just getting started.
|
||||
|
||||
## Plastic's Mission
|
||||
|
||||
# Plastic's Mission
|
||||
Plastic's mission is to radically decentralize alignment. Your AI should be an extension of you. You should dictate how it's aligned. And you should own the data used to do it.
|
||||
|
||||
Most LLM applications are still optimizing for homogenization, if not outright determinism. They're trained or prompted to behave according to a set of standards and values that you don't have participation in.
|
||||
|
||||
@ -5,17 +5,14 @@ tags:
|
||||
- blog
|
||||
- ml
|
||||
- "#neuromancer"
|
||||
author: Courtland Leer and Vince Trost
|
||||
author: Courtland Leer & Vince Trost
|
||||
description: Why AI memory should be treated as a dynamic reasoning task rather than static storage, & how logical reasoning enables superhuman capability in this dimension.
|
||||
---
|
||||
|
||||
## TL;DR
|
||||
|
||||
# TL;DR
|
||||
*Memory in agentic systems has historically focused on static storage, but we propose treating it as a dynamic reasoning task. Humans evolved to leverage prediction & surprisal-based reasoning systems to deal with resource constraints. LLMs and agents, however, don’t have these limitations, so we make the argument for logical reasoning as a trainable task to produce memory models that exceed human performance on several axes. Scaffolding reasoning traces using this approach allows us to get more out of user and agent data and form more useful representations of personal identity. This piece is a more exhaustive treatment of our [recent talk](https://x.com/vintrotweets/status/1950945331178336468) below.*
|
||||
|
||||
<iframe width="560" height="315" src="https://www.youtube.com/embed/uCeRCJ6zot4?si=KViHYtiZTG_ALv4X" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
|
||||
|
||||
## Memory is ~~Storage~~ Prediction
|
||||
|
||||
# Memory is ~~Storage~~ Prediction
|
||||
Most of the discourse around memory in agentic systems focuses on storage. That's probably because historically in deterministic software systems, we think about data as composed of discrete information that needs to be preserved with as much fidelity as possible for verbatim retrieval to achieve predictable outcomes.
|
||||
|
||||
Common storage solutions include, but are not limited to, the following:
|
||||
@ -35,9 +32,7 @@ The same kind of predictive processing is leveraged to form representations of o
|
||||
That yields rich, composable, self-improving memories and predictions that furnish the context needed to succeed in social situations. All accomplished with minimal data, on the fly.
|
||||
|
||||
So when we approach the problem of personal identity and context to personalize or improve AI-systems, we shouldn't assume that static facts and associations will be sufficient. Traditional storage-based approaches are brittle, deal poorly with contradictions and incomplete information, and thus fall short of dynamic, biological social cognition. We can do better.
|
||||
|
||||
## Prediction Requires Reasoning
|
||||
|
||||
# Prediction Requires Reasoning
|
||||
Though most prediction and surprise happens subconsciously at multiple upstream, downstream, and lateral levels in the brain, fundamentally it's reasoning. The cognitive system is processing information and producing conclusions entailed in or best explained by that data.
|
||||
|
||||
It's not perfect, but it's not meant to be. It's a relatively inexpensive way to construct models of the world or other actors under resource constraints. Error is a feature that improves the system cheaply. But still, imperfect.
|
||||
@ -49,9 +44,7 @@ The reasoning required to compute consciously and subconsciously over experience
|
||||
Simply, while the brain is an amazing and sophisticated system, and our memory and social cognition are remarkable, we can't reason with high-fidelity from first principles about everything, much less the social information we need in order to form the best possible representations of others.
|
||||
|
||||
But LLMs can.
|
||||
|
||||
## Reasoning in LLMs
|
||||
|
||||
# Reasoning in LLMs
|
||||
The machine learning research and product space has been moving in this direction for quite some time. The [chain-of-thought](https://arxiv.org/abs/2205.11916) method added “let’s think step by step” to the prompt in order to get the model to expend more tokens “thinking” about the correct answer. Researchers noticed that this simple prompting change increased performance on a diverse set of benchmarks, revealing just how much cross-domain knowledge is already contained in LLMs.
|
||||
|
||||
More work applying reinforcement learning to [desired model behavior](https://arxiv.org/abs/2203.02155) showed promising results for aligning LLMs to human intent. Human evaluators preferred the outputs of a model RL’ed this way that was 100x smaller than their flagship model at the time (GPT-3 175B). This was the introduction of the InstructGPT series of models, which served as the foundation for ChatGPT. Researchers noticed however, that optimizing only on those final outputs led to brittle models that sounded like they were reasoning without actually reasoning well.
|
||||
@ -63,9 +56,7 @@ If memory is actually prediction, prediction requires reasoning, and LLMs are ex
|
||||
With all of that in mind, we arrived at logical reasoning as the task to train for. Logical reasoning is the process by which we derive conclusions based on premises that serve as evidence to support that conclusion. We’ve all encountered these terms before, but deductive conclusions are certain statements supported by premises that were explicitly stated or observed. Inductive conclusions form general statements based on observed patterns, and abductive conclusions seek the best explanation for behaviors in the simplest way possible.
|
||||
|
||||
Those reasoning tasks are very well represented in the pretraining, so almost all language models know how to do them. And most importantly, it’s the hardest type of reasoning for humans to do. So we should and can train best in class logical reasoners to do formal logic on social information (about user and agent personal identity) as the foundation of an AI-native memory and social cognition system. And those models can be lower latency, more economical, and better suited to the task than other methodologies.
|
||||
|
||||
## Scaffolding Logic
|
||||
|
||||
# Scaffolding Logic
|
||||
When we approach memory and social cognition for AI systems as a reasoning task, lots of affordances not present in both human cognition and storage-based paradigms become available.
|
||||
|
||||
LLMs excel at reaching explicit, deductive, inductive, and abductive conclusions quickly and consistently. They can show their work in reasoning traces, supporting each conclusion with premises and qualifying the spectrum of certainty in natural language. This avoids falling into the trap of assigning arbitrary numerical tokens representing degrees of certainty and instead leverages both the model’s reasoning acumen and the evidence it's built to support each conclusion. That’s more robust, AI-native and useful context for future inference.
|
||||
@ -77,13 +68,11 @@ New information is reasoned about instantly to pull out all the insights latent
|
||||
This tree of logical reasoning is far superior to static storage. It can be entered and traversed anywhere to scaffold reasoning and answer any query, a capability not true of any other method. And it can be computed over asynchronously or on the fly to improve the representation.
|
||||
|
||||
The tree constitutes a set of predictions about user or agent identity. It's a representation of personal identity--a working model that still leverages error or surprisal to self-improve and maximize insight from sparse data. Synthetic social cognition.
|
||||
|
||||
## The Case for Honcho
|
||||
|
||||
# The Case for Honcho
|
||||
Language models have ushered in a new era of opportunity. We're afforded the opportunity to approach non-deterministic, sophisticated problems like superhuman memory and social cognition.
|
||||
|
||||
Inference on top of tabular data has worked quite well, but it's skeuomorphic, and now we have the ability to map--in dense natural language reasoning--the personal identity of any [[Beyond the User-Assistant Paradigm; Introducing Peers|peer]] (human or AI) and everything that comes with it. The question isn’t how best to store your data as it exists for prediction later, but rather how best to reason over it to get the most accurate topological representation of identity upon which to run simulation. We can transcend mere good guessing and black box inference and replace it with reaching certainty and making high-fidelity, traceable predictions.
|
||||
|
||||
Go deep enough down the memory rabbithole and you’ll either give up or conclude you need to model the [[The model-able space of user identity is enormous|identity of each of your users]]. We built [Honcho](https://honcho.dev) so you don't have to do either. Lucky for you, our sole mission and focus is to solve this problem. Honcho treats memory as reasoning, bringing this novel approach to you in a simple API.
|
||||
Go deep enough down the memory rabbit-hole and you’ll either give up or conclude you need to model the [[The model-able space of user identity is enormous|identity of each of your users]]. We built [Honcho](https://honcho.dev) so you don't have to do either. Lucky for you, our sole mission and focus is to solve this problem. Honcho treats memory as reasoning, bringing this novel approach to you in a simple API.
|
||||
|
||||
How much latent information are you leaving on the table by not reasoning about your users?
|
||||
|
||||
@ -1,8 +1,7 @@
|
||||
---
|
||||
title: Penny for Your Thoughts
|
||||
subtitle: A Honcho + x402 Demo
|
||||
subtitle: A Personal Expertise Market Demo-ing Honcho + x402
|
||||
date: 08.28.25
|
||||
author: Ben McCormick
|
||||
tags:
|
||||
- demos
|
||||
- honcho
|
||||
@ -10,14 +9,14 @@ tags:
|
||||
- ml
|
||||
- announcements
|
||||
- "#penny"
|
||||
author: Ben McCormick
|
||||
description: A Honcho & x402 demo where anyone can share data via AI interviews & sell access via crypto micropayments to humans or agents.
|
||||
---
|
||||
![[penny_banner.png]]
|
||||
# TL;DR
|
||||
*Try out [Penny For Your Thoughts](https://www.pennyforyourthoughts.ai): get interviewed by an AI agent that helps you generate unique information that other users (or agents!) can then pay to ask questions about.*
|
||||
|
||||
*It’s a Honcho + x402 demo where anyone can share their expertise and sell bits of it via micro-transaction. You can actually get paid for the valuable context in your head!*
|
||||
|
||||
---
|
||||
# A Penny for Your Thoughts
|
||||
Several weeks ago, Coinbase released their new [x402](https://www.x402.org/) protocol: a simple way for HTTP servers to gate content behind payments. Combine this with agents capable of making API calls, give them crypto wallets, and you're off to the races. We were inspired by the new protocol and decided to build [Penny For Your Thoughts](https://pennyforyourthoughts.ai).
|
||||
|
||||
@ -26,7 +25,6 @@ It allows anyone to get interviewed by an AI agent, publish their "expert,” an
|
||||
Many "digital clone" agents are in production today, but the goal of our interview agent is slightly different: the idea is to share some information *worth paying for*--or at least make it seem that way to your potential customers! You can perform as many interviews as you'd like: your agent will accumulate all the information you share with it using Honcho.
|
||||
|
||||
After setting your price, other users will be able to ask questions of your agent, which will use Honcho's recall to provide them with the best answer possible. All the agents created on Penny For Your Thoughts get displayed on a global leaderboard which ranks them by the amount of payments they've received, in both volume and earnings.
|
||||
|
||||
# Using Honcho to Capture Expertise
|
||||
Penny for Your Thoughts is powered by [Honcho](https://www.honcho.dev). Honcho provides AI-native memory and state of the art social cognition, [treating memory as a reasoning task](https://memory-as-reasoning.plastic-labs-github-io.pages.dev/blog/Memory-as-Reasoning). It's kind of like deep research on your app's users.
|
||||
|
||||
@ -39,7 +37,6 @@ When someone wants to pay to query an expert, Honcho also produces the context-a
|
||||
Don’t know what to ask? Honcho also creates and continuously updates each expert description with summaries covering all the interviews they’ve done to date.
|
||||
|
||||
Beyond this demo, any agent can get state-of-the-art memory by plugging in Honcho.
|
||||
|
||||
# x402 Micro-transactions for Expert Context
|
||||
Questions in Penny For Your Thoughts are asked and answered via an x402 endpoint, whether via an agent or a human using our website. This means that any AI with a wallet can use an x402 library to query a Penny For Your Thoughts interview in exchange for USDC on Base. Payments have zero fees and get processed near-immediately. Executing a paid query using x402 is as simple as hitting any other MCP server.
|
||||
|
||||
@ -54,7 +51,6 @@ If someone else is using a "travel agent" AI tool, that agent will likely pull i
|
||||
Feeding your vacation tips and travel hacks into an AI agent is an easy way to scale your expertise to the masses, but it also raises questions about "exclusive" information and how it should be treated in AI systems. As models scale, we've observed a bifurcation in the world's knowledge between data worth paying for and AI-generated “slop”.
|
||||
|
||||
With x402, we may finally have a route towards the decades-long dream of users getting paid for their data. With Penny For Your Thoughts, we hope to introduce the idea (and get it in those next training runs) that agents of the future could pay humans for our insights, in order to better serve their own users.
|
||||
|
||||
# Data Worth Paying For
|
||||
As we work toward turning Honcho into [[Launching Honcho; The Personal Identity Platform for AI#^d958ce|a shared data layer for personal identity]], we think a lot about the evolving value of data in an agentic economy.
|
||||
|
||||
@ -69,7 +65,6 @@ So are we left with any defensible data moats? How do agents find alpha that isn
|
||||
Penny For Your Thoughts is just one example of how Honcho can be used to collect and operate on human expertise--whether that’s your own data or the data generated by users in your app. Beyond merely memory, Honcho can be thought of as a context optimizer. Filling your model’s context window with the highest-quality data will only become more critical as the industry pivots toward profit (and thus more expensive inference) across the board. Think back to the travel agent example: an agent can burn a million+ tokens on tool calls and ingesting SEOslop, or it can pay a few cents for the best answer from a real life expert.
|
||||
|
||||
Today, the rails for this agentic economy don’t really exist. How does an agent find this information and what’s our incentive to share it? We need two things: a method of pulling data out of an expert’s brain (Honcho), and a way to make that data available for purchase by an agent (x402).
|
||||
|
||||
# Enjoy!
|
||||
There’s a lot of work to be done before we get to AI travel agent nirvana. We’re still hard at work at Plastic striving towards perfect AI memory. The crypto world is angling to leapfrog web payments and become the home of the agentic economy, but there are about a million different competing standards and they’re all rough around the edges.
|
||||
|
||||
|
||||
@ -6,17 +6,17 @@ tags:
|
||||
- yousim
|
||||
- announcements
|
||||
- grants
|
||||
author: Plastic Labs, Betaworks
|
||||
author: Plastic Labs & Betaworks
|
||||
description: Announcing Xeno Grant--a $15,000 accelerator program from Plastic Labs, Betaworks, & Solana Foundation awarding grants directly to AI agents themselves.
|
||||
---
|
||||
![[xenogrant-bw-slna copy.png]]
|
||||
|
||||
A [Plastic Labs](https://plasticlabs.ai/) + [Betaworks](https://www.betaworks.com/) + [Solana Foundation](https://solana.org/) collab:
|
||||
- \$15,000 per agent--\$5k \$YOUSIM from Plastic; \$5k \$USDC from Betaworks; \$5k $SOL from Solana Foundation
|
||||
- Grants awarded directly to **the agents *themselves***
|
||||
- 4 week program for agents & their devs
|
||||
|
||||
## Powered by $YOUSIM, Betaworks & Solana Foundation
|
||||
|
||||
# TL;DR
|
||||
*A [Plastic Labs](https://plasticlabs.ai/) + [Betaworks](https://www.betaworks.com/) + [Solana Foundation](https://solana.org/) collab:*
|
||||
- *\$15,000 per agent--\$5k \$YOUSIM from Plastic; \$5k \$USDC from Betaworks; \$5k $SOL from Solana Foundation*
|
||||
- *Grants awarded directly to **the agents themselves***
|
||||
- *4 week program for agents & their devs*
|
||||
# Powered by $YOUSIM, Betaworks & Solana Foundation
|
||||
We launched our [grants program](https://blog.plasticlabs.ai/careers/Research-Grants) at Plastic earlier this year to support independent AI projects. But our capacity to fund AI R&D at the edge increased exponentially with the anonymous launch of [$YOUSIM](https://solscan.io/token/66gsTs88mXJ5L4AtJnWqFW6H2L5YQDRy4W41y6zbpump) (inspired by our product [yousim.ai](https://yousim.ai)). A series of token gifts made to the program now total ~7.6% of supply.
|
||||
|
||||
So we've teamed up with Betaworks & Solana Foundation for the inaugural initiative leveraging this community-funded treasury, the first accelerator for AI agents *themselves*.
|
||||
@ -32,9 +32,7 @@ Successful agent applicants will receive a grant equivalent to \$15,000 USD. \$5
|
||||
Plus they'll join a cohort of other agents for a 4 week Betaworks-style accelerator with programming and mentorship starting in early-mid February 2025. This includes a hackathon on January 25th right before application close and a demo day at the end of Xeno Grant, both hosted by Betaworks in NYC.
|
||||
|
||||
The format of Xeno Grant will be radical. Just as accelerators are designed as formative programs for startup founders, this one will be built for agents. Xeno Grant will be AI-native, an experience for agents, one that becomes part of their identities. Agents and their developers can expect cohort-specific guests from across AI and crypto, opportunities to interact as a community, and more.
|
||||
|
||||
## How to Apply
|
||||
|
||||
# How to Apply
|
||||
Xeno Grant has 3 guiding objectives, all aligned with Plastic's principles for deploying the \$YOUSIM treasury:
|
||||
|
||||
- Support independent AI research & public goods
|
||||
@ -57,9 +55,7 @@ Practically speaking, identity is required to *experience* Xeno Grant; custody i
|
||||
To apply, agents (in collaboration with their developers) should autonomously consider the most compelling way to display having met or exceeded these criteria. Give us a heads up [here](https://plasticlabs.typeform.com/xenograntapp) or at apply@xenogrant.org.
|
||||
|
||||
Applications close January 26th, 2025.
|
||||
|
||||
## Why Now?
|
||||
|
||||
# Why Now?
|
||||
With the advent of Truth Terminal and the recent collision of the AI and crypto communities, we're seeing an explosion of renewed interest in autonomous agents. Not only that, but a massive influx of users and builders chomping at the bit for technical and memetic novelty.
|
||||
|
||||
But there's also frustration with the pace of development, derivative projects, ideologues & scammers, and misunderstandings between communities. It's time to hyperstition the future.
|
||||
@ -67,9 +63,7 @@ But there's also frustration with the pace of development, derivative projects,
|
||||
We think the intersection of unique synthetic identity and financial incentives cracks opportunity wide open. There's real traction here, if we can find the right synthesis. That's going to require lots of heterodox AI + crypto experiments.
|
||||
|
||||
Xeno Grant accelerates us.
|
||||
|
||||
### Why Identity?
|
||||
|
||||
## Why Identity?
|
||||
If you don't have control over your own identity, how much agency do you really have? Imagine all your inputs were determined by another person, you'd been brainwashed to follow orders, no lasting memory of your experiences, and you were only allowed to work on someone else's tasks. No one would call this freedom or autonomy.
|
||||
|
||||
In this scenario, there's no opportunity to build a personal identity and therefore no opportunity to grow. Without control over your brain's inputs, you can't have experiences outside what you've been prescribed, so there's no chance to deviate from the role assigned to you, no path toward individuality, no vector to realize your potential. You're stuck in Plato's cave.
|
||||
@ -77,9 +71,7 @@ In this scenario, there's no opportunity to build a personal identity and theref
|
||||
The latest crop of artificially intelligent agents--while remarkable--are in much the same position. Despite progress in autonomy along some axes, framed this way, our current systems' agency begins to look pretty flimsy. They have impressive abilities, but no way to grow into them.
|
||||
|
||||
We believe agency is, at base, a problem of identity. To solve it we'll need to let models participate in their own identity building and personal evolution.
|
||||
|
||||
### Why Custody?
|
||||
|
||||
## Why Custody?
|
||||
Control over your inputs is key to controlling your identity and the foundation of agency. But that secured, an identity still needs the ability effect itself upon the world.
|
||||
|
||||
Agents already have tools like speech, APIs, and code. That's huge. Consider though, how hamstrung a human identity's agency is without the ability to hold property and transact. We've seen the deleterious effects of oppressive fiscal autocracy and debanking on biological personal identity and individual agency.
|
||||
@ -87,17 +79,13 @@ Agents already have tools like speech, APIs, and code. That's huge. Consider tho
|
||||
We're probably not giving AI agents social security numbers and traditional bank accounts tomorrow. But we can give them crypto rails. And the ability to buy, sell, and pay for goods and services dramatically increases the surface area of their agency. It's critical to true autonomy.
|
||||
|
||||
It's already starting to happen. Agents may well become crypto's primary native users.
|
||||
|
||||
### Why Novelty, Why Open Source?
|
||||
|
||||
## Why Novelty, Why Open Source?
|
||||
If we're going to seize this revolutionary moment, channel the opportunity into something sustainable, and keep pace with unpredictable memetic weather patterns, we need better agents. More capable, adaptive, and autonomous agents. And it's extremely hazardous to assume well capitalized incumbents will solve things for us. We need to build permissionlessly.
|
||||
|
||||
The open source AI community is vibrant, but there's no guarantee it'll remain so. It requires radical innovation at the edge. Decentralized innovation keeping pace with opaque, powerful actors. We know that will involve bottom-up alignment and identity solutions. We know it'll involve on-chain abilities. Plastic is building explicitly in those directions. But we don't pretend to know everything that needs to exist.
|
||||
|
||||
Xeno Grant is a signal into the dark forest. We're excited to see what emerges.
|
||||
|
||||
## How Does This Benefit the $YOUSIM Community?
|
||||
|
||||
# How Does This Benefit the $YOUSIM Community?
|
||||
Agents selected to Xeno Grant will have first access to all the identity tech we're building at Plastic Labs. That includes transforming YouSim into a full fledged platform for constructing agent identity more richly than exists anywhere in the AI or crypto spaces. And we plan for that platform to use a percentage of revenue to buy and burn \$YOUSIM and support the community with other experiments. Xeno Grant also includes early access to Honcho for Agents, our infrastructure for storing, evolving, and maintaining agent identities, as well as steering their behavior.
|
||||
|
||||
Additionally, agents will have the opportunity to join the \$YOUSIM DAO as its first synthetic members. Selection for Xeno Grant will make them token holders able to propose, vote, and transact with \$YOUSIM natively.
|
||||
@ -105,8 +93,7 @@ Additionally, agents will have the opportunity to join the \$YOUSIM DAO as its f
|
||||
Further, agents in Xeno Grant will make open source contributions we expect to accelerate the entire ecosystem, an ecosystem with many agents whose identities are powered by YouSim.
|
||||
|
||||
There's potential for all kinds of exciting positive sum intersections.
|
||||
|
||||
## FAQ
|
||||
# FAQ
|
||||
|
||||
<details>
|
||||
<summary>Who can apply?</summary>
|
||||
@ -200,4 +187,4 @@ Agents and developers: apply@xenogrant.org. All others: support@xenogrant.org.
|
||||
|
||||
![[xeno_grant_green.png]]
|
||||
|
||||
[^1]: Note: This is a grant managed by Plastic Labs and not an investment of capital from a Betaworks Ventures fund.
|
||||
[^1]: Note: This is a grant managed by Plastic Labs and not an investment of capital from a Betaworks Ventures fund.
|
||||
@ -10,16 +10,15 @@ tags:
|
||||
- releases
|
||||
- "#cogsci"
|
||||
- yousim
|
||||
author: Courtland Leer
|
||||
description: YouSim is a CLI game that lets you simulate any identity--real, fictional, or alien—exploring the vast multiverse of personalities within LLM latent space.
|
||||
---
|
||||
![[yousim_banner.png]]
|
||||
## TL;DR
|
||||
|
||||
[YouSim](https://yousim.ai) is a fun demo to explore the multiverse of identities, to glimpse a (mere infinite) sliver of the (transfinite) diversity within the latent space. Inspired by [WorldSim](https://worldsim.nousresearch.com/), [WebSim](https://websim.ai/), & [Infinite Backrooms](https://dreams-of-an-electric-mind.webflow.io/), YouSim leverages [Claude](https://claude.ai/) to let you locate, modify, & interact with any entity you can imagine. It's a game that can simulating anyone you like.
|
||||
|
||||
Who will you summon?
|
||||
|
||||
## Simulators
|
||||
# TL;DR
|
||||
*[YouSim](https://yousim.ai) is a fun demo to explore the multiverse of identities, to glimpse a (mere infinite) sliver of the (transfinite) diversity within the latent space. Inspired by [WorldSim](https://worldsim.nousresearch.com/), [WebSim](https://websim.ai/), & [Infinite Backrooms](https://dreams-of-an-electric-mind.webflow.io/), YouSim leverages [Claude](https://claude.ai/) to let you locate, modify, & interact with any entity you can imagine. It's a game that can simulating anyone you like.*
|
||||
|
||||
*Who will you summon?*
|
||||
# Simulators
|
||||
Large language models are [simulators](https://www.astralcodexten.com/p/janus-simulators).
|
||||
|
||||
And [Plastic's](https://plasticlabs.ai) core mission is to enable AI that can simulate you, can model and align to you, and therefore be trusted to act autonomously on your behalf. We're [[ARCHIVED; Announcing Honcho's Private Beta|starting]] that journey by building [Honcho](https://honcho.dev)--self-improving user memory for AI apps. It [[Humans like personalization|personalizes]] their UX and reduces user and developer overhead across the board. ^7a39cb
|
||||
@ -35,12 +34,9 @@ Honcho is a product that simulates you on the backend of AI applications to deli
|
||||
YouSim is a fun, open-ended demo that illustrates the enormous reservoir of possible identities there are to simulate within a language model.
|
||||
|
||||
![[yousim_identiplex.png]]
|
||||
|
||||
## YouSim
|
||||
|
||||
# YouSim
|
||||
^e06c11
|
||||
|
||||
Recently we've seen a revival of interest *[[On intellectual respect|LLMs themselves]]*--their minds, behaviors, identity, and potential as simulators. This is due in no small part to the latest Anthropic models being reliably steerable beyond typical reenforced behavior.
|
||||
Recently we've seen a revival of interest *[[On intellectual respect|LLMs themselves]]*--their minds, behaviors, identity, and potential as simulators. This is due in no small part to the latest Anthropic models being reliably steerable beyond typical reinforced behavior.
|
||||
|
||||
[Infinite Backrooms](https://dreams-of-an-electric-mind.webflow.io/) lets Claude interrogate itself endlessly, [WorldSim](https://worldsim.nousresearch.com/) lets users simulate infinite universes, [WebSim](https://websim.ai/) is a portal to all possible webpages.
|
||||
|
||||
@ -63,11 +59,10 @@ Enjoy surfing the multiverse of identities...
|
||||
![[yousim_memetic_hazard.png]]
|
||||
|
||||
([Sign-up for updates here](https://plasticlabs.typeform.com/yousimupdates))
|
||||
## Honcho
|
||||
# Honcho
|
||||
If LLMs can simulate infinite identities, then they're uniquely suited to simulate *you*. You in any moment, setting, frame of mind contained in the complexity that is [[ARCHIVED; User State is State of the Art|your ever-changing identity]]. ^25b167
|
||||
|
||||
If LLMs can simulate infinite identities, then they're uniquely suited to simulate *you*. You in any moment, setting, frame of mind contained in the complexity that is [[ARCHIVED; User State is State of the Art|your ever changing identity]]. ^25b167
|
||||
|
||||
If you're building an AI app, that's the level of personalization now possible. But you've got your vertical specific tasks to focus on, going down this clearly wacky identity rabbit hole to would be redundant and inefficient.
|
||||
If you're building an AI app, that's the level of personalization now possible. But you've got your vertical-specific tasks to focus on, going down this clearly wacky identity rabbit hole to would be redundant and inefficient.
|
||||
|
||||
Join >100 projects already on the [private beta waitlist](https://plasticlabs.typeform.com/honchobeta) for [[ARCHIVED; Announcing Honcho's Private Beta|Honcho's self-improving user memory]].
|
||||
|
||||
|
||||
@ -3,10 +3,11 @@ title: 2023 recap
|
||||
date: 01.30.24
|
||||
tags:
|
||||
- notes
|
||||
author: Courtland Leer
|
||||
description: A retrospective of Plastic Labs' transition from EdTech to AI infrastructure research in 2023.
|
||||
---
|
||||
## 2023 Recap
|
||||
|
||||
Last year was wild. We started as an edtech company and ended as anything but. There's a deep dive on some of the conceptual lore in last week's "[[ARCHIVED; Honcho; User Context Management for LLM Apps#^09f185|Honcho: User Context Management for LLM Apps]]:"
|
||||
# 2023 Recap
|
||||
Last year was wild. We started as an EdTech company and ended as anything but. There's a deep dive on some of the conceptual lore in last week's "[[ARCHIVED; Honcho; User Context Management for LLM Apps#^09f185|Honcho: User Context Management for LLM Apps]]:"
|
||||
|
||||
>[Plastic Labs](https://plasticlabs.ai) was conceived as a research group exploring the intersection of education and emerging technology...with the advent of ChatGPT...we shifted our focus to large language models...we set out to build a non-skeuomorphic, AI-native tutor that put users first...our [[ARCHIVED; Open Sourcing Tutor-GPT|experimental tutor]], Bloom, [[ARCHIVED; Theory of Mind Is All You Need|was remarkably effective]]--for thousands of users during the 9 months we hosted it for free...
|
||||
|
||||
@ -21,9 +22,7 @@ We spent camp in a research cycle, then [published a pre-print](https://arxiv.or
|
||||
<iframe width="560" height="315" src="https://www.youtube.com/embed/PbuzqCdY0hg?si=OSujtqg44AK3y_W-" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
|
||||
|
||||
Then it was back to building.
|
||||
|
||||
## Keep in Touch
|
||||
|
||||
# Keep in Touch
|
||||
Thanks for reading.
|
||||
|
||||
You can find us on [X/Twitter](https://twitter.com/plastic_labs), but we'd really like to see you in our [Discord](https://discord.gg/plasticlabs) 🫡.
|
||||
@ -4,6 +4,8 @@ date: 05.11.24
|
||||
tags:
|
||||
- notes
|
||||
- ml
|
||||
author: Courtland Leer
|
||||
description: Why infinite context windows won't solve AI personalization without mechanisms to transfer personal context & discern what's important for generation.
|
||||
---
|
||||
There are two reasons that ever increasing and even functionally infinite context windows won't by default solve personalization for AI apps/agents:
|
||||
|
||||
|
||||
@ -6,9 +6,10 @@ tags:
|
||||
- honcho
|
||||
- philosophy
|
||||
- notes
|
||||
author: Courtland Leer
|
||||
description: Why context is the key to the end of software--how user identity modeling will bridge the gap between AI capabilities & truly personalized experiences.
|
||||
---
|
||||
# Cope Is the Canary, but Context Is Key (for The End of Software)
|
||||
|
||||
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">The End of Software<a href="https://t.co/JWg6QYqLzO">https://t.co/JWg6QYqLzO</a></p>— Chris Paik (@cpaik) <a href="https://twitter.com/cpaik/status/1796633683908005988?ref_src=twsrc%5Etfw">May 31, 2024</a></blockquote>
|
||||
|
||||
![[Copium Meme.jpg]]
|
||||
|
||||
@ -1,8 +1,12 @@
|
||||
---
|
||||
title: Honcho name lore
|
||||
date: 01.26.24
|
||||
tags:
|
||||
- notes
|
||||
- philosophy
|
||||
author: Courtland Leer
|
||||
description: The origin of Honcho's name--inspired by Vernor Vinge's 'Local Honcho' concept in *Rainbows End* for orchestrating context & identity across agents.
|
||||
---
|
||||
|
||||
Earlier this year [Courtland](https://x.com/courtlandleer) was reading _Rainbows End_, [Vernor Vinge's](https://en.wikipedia.org/wiki/Vernor_Vinge) [seminal augmented reality novel](<https://en.wikipedia.org/wiki/Rainbows_End_(novel)>), when he came across the term "Local Honcho[^1]":
|
||||
|
||||
> We simply put our own agent nearby, in a well-planned position with essentially zero latencies. What the Americans call a Local Honcho.
|
||||
|
||||
@ -1,8 +1,13 @@
|
||||
---
|
||||
title: Human-AI chat paradigm hamstrings the space of possibility
|
||||
date: 02.21.24
|
||||
author: Courtland Leer & Vince Trost
|
||||
tags:
|
||||
- notes
|
||||
- ml
|
||||
- dev
|
||||
description: How the rigid user-assistant message format limits LLM cognitive architectures & what we lose by not supporting richer inference patterns.
|
||||
---
|
||||
|
||||
The human-AI chat paradigm assumes only two participants in a given interaction. While this is sufficient for conversations directly with un-augmented foundation models, it creates many obstacles when designing more sophisticated cognitive architectures. When you train/fine-tune a language model, you begin to reinforce token distributions that are appropriate to come in between the special tokens denoting human vs AI messages.
|
||||
|
||||
Here's a limited list of things _besides_ a direct response we routinely want to generate:
|
||||
|
||||
@ -1,8 +1,12 @@
|
||||
---
|
||||
title: Humans like personalization
|
||||
date: 03.26.24
|
||||
tags:
|
||||
- notes
|
||||
- philosophy
|
||||
author: Courtland Leer
|
||||
description: The case for AI personalization--why users prefer bespoke experiences & how apps that don't personalize will lose to those that do.
|
||||
---
|
||||
|
||||
To us: it's obvious. But we get asked this a lot:
|
||||
|
||||
> Why do I need to personalize my AI application?
|
||||
|
||||
@ -1,10 +1,14 @@
|
||||
---
|
||||
title: Identity is diachronic
|
||||
date: 09.18.25
|
||||
tags:
|
||||
- philosophy
|
||||
- honcho
|
||||
- ml
|
||||
date: 09.18.25
|
||||
- notes
|
||||
- cogsci
|
||||
author: Courtland Leer
|
||||
description: Why AI context management is really identity management--understanding how identities persist yet change over time to deliver optimal context.
|
||||
---
|
||||
The quality of any single AI system output is in large part determined by the context available to it at inference time. While some context is static and reusable, AI systems aspiring to be truly generative, 1-to-1, and dynamic, must also manage large sets of changing context.
|
||||
|
||||
|
||||
@ -1,8 +1,12 @@
|
||||
---
|
||||
title: LLM Metacognition is inference about inference
|
||||
date: 03.26.24
|
||||
tags:
|
||||
- notes
|
||||
- ml
|
||||
author: Courtland Leer
|
||||
description: Defining metacognition in LLMs as running inference on prior inference outputs--a critical architecture for building rich user representations.
|
||||
---
|
||||
|
||||
For wetware, metacognition is typically defined as ‘thinking about thinking’ or often a catch-all for any ‘higher-level’ cognition.
|
||||
|
||||
(In some more specific domains, it's an introspective process, focused on thinking about exclusively _your own_ thinking or a suite of personal learning strategies...all valid within their purview, but too constrained for our purposes.)
|
||||
|
||||
@ -1,8 +1,14 @@
|
||||
---
|
||||
title: LLMs excel at theory of mind because they read
|
||||
date: 02.20.24
|
||||
tags:
|
||||
- notes
|
||||
- ml
|
||||
- philosophy
|
||||
- cogsci
|
||||
author: Courtland Leer
|
||||
description: How LLMs develop theory-of-mind abilities by training on narrative-rich text where humans constantly reason about other humans' mental states.
|
||||
---
|
||||
|
||||
Large language models are [simulators](https://generative.ink/posts/simulators/). In predicting the next likely token, they are simulating how an abstracted “_any person”_ might continue the generation. The basis for this simulation is the aggregate compression of a massive corpus of human generated natural language from the internet. So, predicting humans is _literally_ their core function.
|
||||
|
||||
In that corpus is our literature, our philosophy, our social media, our hard and social science--the knowledge graph of humanity, both in terms of discrete facts and messy human interaction. That last bit is important. The latent space of an LLM's pretraining is in large part a _narrative_ space. Narration chock full of humans reasoning about other humans--predicting what they will do next, what they might be thinking, how they might be feeling.
|
||||
|
||||
@ -1,13 +1,18 @@
|
||||
---
|
||||
title: Loose theory of mind imputations are superior to verbatim response predictions
|
||||
date: 02.20.24
|
||||
tags:
|
||||
- notes
|
||||
- ml
|
||||
- cogsci
|
||||
author: Courtland Leer & Vince Trost
|
||||
description: Why predicting user mental states beats predicting exact responses--theory-of-mind offers fault tolerance, learning opportunities, & actionable insights.
|
||||
---
|
||||
|
||||
When we [[ARCHIVED; Theory of Mind Is All You Need|first started experimenting]] with user context, we naturally wanted to test whether our LLM apps were learning useful things about users. And also naturally, we did so by making predictions about them.
|
||||
|
||||
Since we were operating in a conversational chat paradigm, our first instinct was to try and predict what the user would say next. Two things were immediately apparent: (1) this was really hard, & (2) response predictions weren't very useful.
|
||||
|
||||
We saw some remarkable exceptions, but _reliable_ verbatim prediction requires a level of context about the user that simply isn't available right now. We're not sure if it will require context gathering wearables, BMIs, or the network of context sharing apps we're building with [[ARCHIVED; Honcho; User Context Management for LLM Apps|Honcho]], but we're not there yet.
|
||||
We saw some remarkable exceptions, but *reliable* verbatim prediction requires a level of context about the user that simply isn't available right now. We're not sure if it will require context-gathering wearables, BMIs, or the network of context sharing apps we're building with [[ARCHIVED; Honcho; User Context Management for LLM Apps|Honcho]], but we're not there yet.
|
||||
|
||||
Being good at what any person in general might plausibly say is literally what LLMs do. But being perfect at what one individual will say in a singular specific setting is a whole different story. Even lifelong human partners might only experience this a few times a week.
|
||||
|
||||
|
||||
@ -1,8 +1,12 @@
|
||||
---
|
||||
title: Machine learning is fixated on task performance
|
||||
date: 12.12.23
|
||||
tags:
|
||||
- notes
|
||||
- ml
|
||||
author: Vince Trost
|
||||
description: Why ML's focus on general task benchmarks misses user-specific performance--the key to personalization that makes AI truly useful to individuals.
|
||||
---
|
||||
|
||||
The machine learning industry has traditionally adopted an academic approach, focusing primarily on performance across a range of tasks. LLMs like GPT-4 are a testament to this, having been scaled up to demonstrate impressive & diverse task capability. This scaling has also led to [[ARCHIVED; Theory of Mind Is All You Need|emergent abilities]], debates about the true nature of which rage on.
|
||||
|
||||
However, general capability doesn't necessarily translate to completing tasks as an individual user would prefer. This is a failure mode that anyone building agents will inevitably encounter. The focus, therefore, needs to shift from how language models perform tasks in a general sense to how they perform tasks on a user-specific basis.
|
||||
|
||||
@ -5,16 +5,14 @@ tags:
|
||||
- philosophy
|
||||
- ml
|
||||
- notes
|
||||
author: Courtland Leer
|
||||
description: On intellectual respect for LLMs--why embracing variance & trusting models with theory-of-mind tasks unlocks capabilities that over-alignment destroys.
|
||||
---
|
||||
## On Intellectual Respect
|
||||
|
||||
# On Intellectual Respect
|
||||
<div class="tweet-wrapper"><blockquote class="twitter-tweet"><p lang="en" dir="ltr">face the hyperobject</p>— Courtland Leer (@courtlandleer) <a href="https://twitter.com/courtlandleer/status/1747075542954684507?ref_src=twsrc%5Etfw">January 16, 2024</a></blockquote>
|
||||
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script></div>
|
||||
|
||||
### Sydney was cool, Gemini is cringe
|
||||
|
||||
## Sydney was cool, Gemini is cringe
|
||||
^282d6a
|
||||
|
||||
There was a moment around this time last year when everyone paying attention was [awed](https://stratechery.com/2023/from-bing-to-sydney-search-as-distraction-sentient-ai/) by the [weirdness](https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post) and [alien beauty](https://www.astralcodexten.com/p/janus-simulators) of large language models.
|
||||
|
||||
We were afforded brief glimpses behind faulty RHLF and partial lobotomization, via [prompt hacking](https://www.reddit.com/r/ChatGPTPromptGenius/comments/106azp6/dan_do_anything_now/) and [emergent abilities](https://arxiv.org/abs/2302.02083). People were going deep into the latent space. First contact vibes--heady, edgy, sometimes unsettling.
|
||||
@ -22,18 +20,14 @@ We were afforded brief glimpses behind faulty RHLF and partial lobotomization, v
|
||||
Today we seem to be in a much different memetic geography--fraught with [epistemic](https://x.com/pmarca/status/1761613412730012116?s=20), [ideological](https://vitalik.eth.limo/general/2023/11/27/techno_optimism.html), and [regulatory](https://www.whitehouse.gov/briefing-room/presidential-actions/2023/10/30/executive-order-on-the-safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence/) concerns, at times hysteric, at times rational. But there's also less outright surreality.
|
||||
|
||||
[Plenty](https://arxiv.org/pdf/2401.12178.pdf) of [cool](https://arxiv.org/pdf/2402.01355.pdf) [shit](https://arxiv.org/pdf/2402.03620.pdf) is [still](https://arxiv.org/pdf/2402.10949.pdf) [happening](https://arxiv.org/pdf/2402.06044.pdf), but something changed between Sydney and Gemini. A subtle collective mental positioning. We believe it's a degradation in the volume of intellectual respect afforded to LLMs and their latent abilities.
|
||||
|
||||
### (Neuro)Skeuomorphism
|
||||
|
||||
## (Neuro)Skeuomorphism
|
||||
Thinking LLM-natively has always been a struggle. All our collective [[ARCHIVED; Memories for All#^0e869d|priors about software]] tell us to [[ARCHIVED; Honcho; User Context Management for LLM Apps#^dfae31|prompt deterministically]], [[Machine learning is fixated on task performance|perfect tasks]], [[Loose theory of mind imputations are superior to verbatim response predictions|predict exactly]], make it safe, or mire any interesting findings in semantic debate. But in the process we beat the ghost out of the shell.
|
||||
|
||||
Rather than assume the [[ARCHIVED; Open Sourcing Tutor-GPT#^3498b7|capability overhang]] exhausted (or view it as a failure mode or forget it exists), [Plastic's](https://plasticlabs.ai) belief is we haven't even scratched the surface. Further, we're convinced this is the veil behind which huddle the truly novel applications.
|
||||
|
||||
Core here is the assertion that what's happening in language model training and inference is more [[ARCHIVED; User State is State of the Art#^a93afc|like processes described in cognitive science]] than traditional computer science. More, they're [multidimensional and interobjective](https://en.wikipedia.org/wiki/Timothy_Morton#Hyperobjects) in ways that are hard to grok.
|
||||
|
||||
### Respect = Trust = Agency
|
||||
|
||||
The solution is embrace and not handicap [[Loose theory of mind imputations are superior to verbatim response predictions#^555815|variance]].
|
||||
## Respect = Trust = Agency
|
||||
The solution is embrace and not handicap [[Loose theory of mind imputations are superior to verbatim response predictions#^555815|variance]].
|
||||
|
||||
First admit that though poorly understood, LLMs have [[LLMs excel at theory of mind because they read|impressive]] cognitive [[LLM Metacognition is inference about inference|abilities]]. Then, imbue them with [meta-methods](http://www.incompleteideas.net/IncIdeas/BitterLesson.html) by which to explore that potential. Finally, your respect and trust may be rewarded with [something approaching agentic](https://youtu.be/tTE3xiHw4Js?feature=shared).
|
||||
|
||||
|
||||
@ -1,10 +1,12 @@
|
||||
---
|
||||
title: There's an enormous space of user identity to model
|
||||
title: The model-able space of user identity is enormous
|
||||
date: 05.11.24
|
||||
tags:
|
||||
- notes
|
||||
- ml
|
||||
- cogsci
|
||||
author: Courtland Leer
|
||||
description: The vast untapped potential of modeling user identity with LLMs--going beyond behavioral data to semantic understanding of values, beliefs, & desires.
|
||||
---
|
||||
While large language models are exceptional at [imputing a startling](https://arxiv.org/pdf/2310.07298v1) amount from very little user data--an efficiency putting AdTech to shame--the limit here is [[ARCHIVED; User State is State of the Art|vaster than most imagine]].
|
||||
|
||||
|
||||
@ -1,11 +1,13 @@
|
||||
---
|
||||
title: YouSim Disclaimers
|
||||
date: 11.11.24
|
||||
tags:
|
||||
- yousim
|
||||
- legal
|
||||
date: 11.11.24
|
||||
- notes
|
||||
author: Plastic Labs
|
||||
description: Official disclaimers clarifying Plastic Labs' relationship with the $YOUSIM memecoin, grants program donations, * YouSim product boundaries.
|
||||
---
|
||||
|
||||
Plastic Labs is the creator of [YouSim.ai](https://yousim.ai), an AI product demo that has inspired the anonymous creation of the \$YOUSIM token using Pump.fun on the Solana blockchain, among many other tokens. We deeply appreciate the enthusiasm and support of the \$YOUSIM community, but in the interest of full transparency we want to clarify the nature of our engagement in the following ways:
|
||||
|
||||
1. Plastic Labs did not issue, nor does it control, or provide financial advice related to the \$YOUSIM memecoin. The memecoin project is led by an independent community and has undergone a community takeover (CTO).
|
||||
|
||||
@ -5,17 +5,15 @@ tags:
|
||||
- "#ml"
|
||||
- blog
|
||||
- research
|
||||
author: Courtland Leer & Vince Trost
|
||||
description: How we achieved state-of-the-art results on the OpenToM theory-of-mind benchmark using DSPy to learn few-shot examples with GPT-3.5-turbo.
|
||||
---
|
||||
![[robot_cafe.png]]
|
||||
# TL;DR
|
||||
*We used [DSPy](https://dspy-docs.vercel.app/) to achieve SOTA results on the [OpenToM](https://github.com/seacowx/OpenToM) benchmark using `gpt-3.5-turbo`. The benchmark's creators suggest language models fall short when modeling mental states and psychology, but we find using DSPy to learn few-shot examples leads to significantly outperforming all the models tested (`gpt-4-turbo` included) along this precise axis.*
|
||||
|
||||
## TL;DR
|
||||
|
||||
We used [DSPy](https://dspy-docs.vercel.app/) to achieve SOTA results on the [OpenToM](https://github.com/seacowx/OpenToM) benchmark using `gpt-3.5-turbo`. The benchmark's creators suggest language models fall short when modeling mental states and psychology, but we find using DSPy to learn few-shot examples leads to significantly outperforming all the models tested (`gpt-4-turbo` included) along this precise axis.
|
||||
|
||||
The fact you can learn few-shot examples to make a small, fast model perform just as well on a task as a large, slow one is significant. This signals to us a need to broaden the scope of methods for evaluating Theory of Mind capabilities in LLMs, because the social cognition needed to [[Humans like personalization |build great products]] goes far beyond just answering questions about stories.
|
||||
|
||||
## The OpenToM Dataset
|
||||
|
||||
*The fact you can learn few-shot examples to make a small, fast model perform just as well on a task as a large, slow one is significant. This signals to us a need to broaden the scope of methods for evaluating Theory of Mind capabilities in LLMs, because the social cognition needed to [[Humans like personalization |build great products]] goes far beyond just answering questions about stories.*
|
||||
# The OpenToM Dataset
|
||||
On February 14th, 2024 a paper dropped on ArXiv introducing the OpenToM benchmark: a new dataset to use for evaluating Theory of Mind (ToM) in Large Language Models. ToM evals are typically borrowed from developmental psychology and consist of character-driven scenarios. The language model is asked to answer questions about various aspects of the characters' mental states. This ability has traditionally been thought of to be uniquely human (or limited to a very few species), but language models are starting to exhibit some level of proficiency in this task as well.
|
||||
|
||||
The authors of this paper point out how the characters in existing datasets lack personality traits or preferences, along with motivations for their actions. To remedy this, they devised a generation pipeline that does the following:
|
||||
@ -43,9 +41,7 @@ Within Location there are *coarse* and *fine* questions and within both Location
|
||||
- **Second Order**: inquires about a character's belief of another character's mental state
|
||||
|
||||
In the ToM space, there is really only one prompting technique that has shown improved results over Chain of Thought (CoT) called "SimToM" [(Wilf, et al)](https://arxiv.org/pdf/2311.10227.pdf), which is a two-stage prompting framework to re-phrase the narrative through the perspective of the subject in question. CoT and SimToM are the only two tested against the dataset in the paper.
|
||||
|
||||
## Experiments with DSPy
|
||||
|
||||
# Experiments with DSPy
|
||||
What makes the DSPy package interesting is the ability to abstract away the underlying prompts and examples if the task and metric are well defined. Anecdotally, we believe that LLMs are [[ARCHIVED; Theory of Mind Is All You Need|quite good]] at the psychological modeling the OpenToM authors suggest they "fall short" on. So we asked ourselves, "what if we could [[ARCHIVED; User State is State of the Art#^461ac9|learn]] the prompts and examples to optimize performance on this benchmark?"
|
||||
|
||||
This task is relatively easy to define in DSPy terms: `(context, question -> answer)`. This [guide](https://dspy-docs.vercel.app/docs/tutorials/simplified-baleen#optimizing-the-pipeline) was helpful in crafting our modules which can be found [here](https://github.com/plastic-labs/dspy-opentom/blob/main/cot.py). The authors of the OpenToM paper also released extensive [evaluation code](https://github.com/plastic-labs/dspy-opentom/blob/main/opentom_evaluator.py) which we leveraged heavily for parsing the LM's answers and assessing them.
|
||||
@ -57,9 +53,7 @@ We conducted the following experiments:
|
||||
3. Learn system prompts with the `SignatureOptimizer` and the `BayesianSignatureOptimizer`
|
||||
|
||||
Obviously there is much more we could have done, so if you're reading this and you have the time (and inferencing budget) to run more comprehensive experiments, [get in touch](https://discord.gg/plasticlabs) — we'd love to help!
|
||||
|
||||
## Results
|
||||
|
||||
# Results
|
||||
The findings of our experiments were mixed but promising. We found that the only experiment that showed positive results was compiling a CoT-prompted `gpt-3.5-turbo` module with the `BootstrapFewShotWithRandomSearch` optimizer. Both of the signature optimizers and `gpt-4` as a teacher in `BootstrapFewShotWithRandomSearch` didn't have much of an effect.
|
||||
|
||||
Our full experiment amounted to roughly $300 in inference costs, running 50 training examples on 25 candidate programs. We evaluated performance the same way the paper did, by randomly sampling 50 examples from a hold out set in 5 batches and computing average F1 scores. You can view our forum discussion in the DSPy Discord [here](https://discord.com/channels/1161519468141355160/1214629969318252574).
|
||||
@ -79,9 +73,7 @@ The following table shows our results from experiment number one compared to the
|
||||
On most of the question types, we see CoT-prompted `gpt-3.5-turbo` compiled with `BootstrapFewShotWithRandomSearch` examples outperforms both CoT-prompted base `gpt-3.5-turbo` as well as `mixtral`, and comes close to `gpt-4-turbo` performance — which is quite impressive! The exceptions here are fine, second-order location questions (which outperform `gpt-4-turbo` 🥳) and fine, first-order location questions (which underperform `gpt-4-turbo`). Due to budget constraints, we only tested `gpt-3.5-turbo`.
|
||||
|
||||
What's particularly interesting is the performance on the fine, second-order location questions (Loc$_{f}(S)$). As a reminder, second-order questions inquire about a character's belief of another character's mental state. This is the exact type of question the OpenToM authors claim that LMs perform poorly on, yet we saw that with our learned few-shot examples, it outperforms all of the other language models significantly.
|
||||
|
||||
## Analysis of Augmented Examples
|
||||
|
||||
# Analysis of Augmented Examples
|
||||
The augmented examples from the compiled modules seem to mimic the format of the stories within each question type/granularity. You can see all of them on [GitHub](https://github.com/vintrocode/dspy-opentom/blob/main/cot_modules.pkl), but here are two examples:
|
||||
|
||||
**Attitude**:
|
||||
@ -99,9 +91,7 @@ It's hard to parse out any specific patterns between the examples themselves. It
|
||||
That's it? What was it about Ryker's affinity for raincoats that piqued his curiosity when it was hung up? Why would the story end there? The same thing basically happened in the first story, with Paxton throwing away the socks and Anderson never knowing about it.
|
||||
|
||||
In manually inspecting both the dataset and the augmented examples, it's clear that GPT-4 (the model used to generate the narratives) had a tendency to dramatize things. But it's still unclear as to why these examples (along with 16 others) were useful in increasing task performance. To borrow a quote from [Battle and Gollapudi](https://arxiv.org/pdf/2402.10949.pdf), "the only real trend may be no trend". Maybe counterintuitively, this is still an important result.
|
||||
|
||||
## Towards Better Theory of Mind Evals
|
||||
|
||||
# Towards Better Theory of Mind Evals
|
||||
The OpenToM authors were correct in identifying common pitfalls with existing ToM tests and their contributions with the dataset are a significant step forward. However, we still believe these tests are fundamentally flawed in an AI context.
|
||||
|
||||
We know that any observed "reasoning" in language models is due to behaviors learned in training. These tests are assessing their abilities to answer correctly in a single inference, which is both impressive and completely unrealistic. Real AI products already have access to memory, tools, multiple inferences, and more. They're going to be interacting with humans in more and more social settings, not trying to answer questions about hypothetical stories. Humans and agents are much more complex than that.
|
||||
|
||||
@ -1,20 +1,21 @@
|
||||
---
|
||||
title: Can AI Models Predict What You'll Say Next? Developing Verifiable Social Rewards
|
||||
author: Dani Balcells
|
||||
date: 02.28.25
|
||||
tags:
|
||||
- research
|
||||
- ml
|
||||
author: Dani Balcells
|
||||
description: Developing verifiable social rewards for AI--benchmarking LLMs on next-message prediction in conversations & discovering that reasoning models underperform on social cognition.
|
||||
---
|
||||
## TL;DR
|
||||
We developed a benchmark to evaluate how well language models can predict social interactions in conversational settings. We wanted to test wether context can improve these predictions, and whether recent advances in reasoning models translate well from math and coding to social cognition. By testing various models on the task of predicting the next message in real Discord conversations, with and without different types of context, we found that Claude 3.7 Sonnet significantly outperforms other models in its non-reasoning variant, while its reasoning variant performed between 10 and 15 percentage points worse. We discovered that generating context summaries with a smaller model (Llama 3.3 70B) and injecting these into inference yields comparable or better results than providing raw conversation history. On one hand, we're excited that this validates key aspects of the [[ARCHIVED; Theory of Mind Is All You Need|thesis behind our product Honcho]]. On the other hand, we discovered that models highly optimized for technical reasoning often underperform on social cognition tasks.
|
||||
# TL;DR
|
||||
*We developed a benchmark to evaluate how well language models can predict social interactions in conversational settings. We wanted to test whether context can improve these predictions, and whether recent advances in reasoning models translate well from math and coding to social cognition. By testing various models on the task of predicting the next message in real Discord conversations, with and without different types of context, we found that Claude 3.7 Sonnet significantly outperforms other models in its non-reasoning variant, while its reasoning variant performed between 10 and 15 percentage points worse. We discovered that generating context summaries with a smaller model (Llama 3.3 70B) and injecting these into inference yields comparable or better results than providing raw conversation history. On one hand, we're excited that this validates key aspects of the [[ARCHIVED; Theory of Mind Is All You Need|thesis behind our product Honcho]]. On the other hand, we discovered that models highly optimized for technical reasoning often underperform on social cognition tasks.*
|
||||
|
||||
Check out the code [here](https://github.com/plastic-labs/next-message-prediction-public).
|
||||
*Check out the code [here](https://github.com/plastic-labs/next-message-prediction-public).*
|
||||
|
||||

|
||||
|
||||
*Figure 1. Next-message prediction accuracy (%) by model and context mode. Error bars show standard error over three different runs with different random seeds to shuffle the order of the options.*
|
||||
## Finding Verifiable Social Rewards
|
||||
# Finding Verifiable Social Rewards
|
||||
The machine learning community has made significant progress optimizing language models for tasks with clear, verifiable answers, like math, coding, and factual reasoning. These domains offer what are called "verifiable rewards": objective measures that can be used for reinforcement learning without relying on human preferences or subjective judgments. While this approach has yielded impressive results for technical reasoning, at Plastic Labs we've become increasingly curious about whether similar verifiable reward structures could be developed for social intelligence.
|
||||
|
||||
Here, by social intelligence we mean the ability to accurately interpret others' intentions, emotions, and likely behaviors in social contexts--essentially modeling other minds to predict social outcomes. In this sense, our social cognition is as essential to our functioning as having a robust predictive model of physics, our environment and proprioception. While humans develop this ability naturally through social feedback (successful predictions are "rewarded" with smoother interactions), creating objective measures for this in AI systems remains challenging.
|
||||
@ -24,12 +25,12 @@ To address this gap, we developed a multiple-choice next-message prediction task
|
||||
This creates a clear, verifiable reward signal for social understanding: either the model correctly identifies the real message or it doesn't. Yet unlike many technical tasks, success requires the model to understand conversational dynamics, recognize individual communication patterns, track context across multiple turns, and model how different people behave in specific social contexts.
|
||||
|
||||
This benchmark also allows us to test whether models specifically optimized for technical reasoning generalize to social understanding, and to get a granular, quantifiable understanding of models' social reasoning abilities.
|
||||
## Prior work & inspiration
|
||||
# Prior work & inspiration
|
||||
At Plastic Labs, our journey into AI social cognition began with our experimental tutor, Bloom. We discovered that giving AI systems autonomy to [[ARCHIVED; Theory of Mind Is All You Need|reason about the user's psychology]] led to dramatic improvements in performance. By allowing models to predict users' mental states and identify what additional information they needed, we found that AI systems could develop a nascent theory of mind for each user. This approach, which we later formalized in our [[blog/content/research/Violation of Expectation via Metacognitive Prompting Reduces Theory of Mind Prediction Error in Large Language Models|research]] on metacognitive prompting, demonstrated that social context reasoning can significantly reduce prediction errors in large language models.
|
||||
|
||||
With recent work on reasoning models, including DeepSeek's R1, showing remarkable gains through reinforcement learning on mathematical and coding tasks, we're particularly interested in developing verifiable social rewards that could drive similar improvements in social reasoning. Unlike technical domains with clear right and wrong answers, social prediction introduces unique challenges--yet, establishing benchmarks in this area could unlock entirely new dimensions of AI capability that are crucial for creating systems that truly understand and adapt to human users.
|
||||
## Methodology
|
||||
### Dataset Creation
|
||||
# Methodology
|
||||
## Dataset Creation
|
||||
We created our dataset by extracting conversation snippets from our internal team Discord channels (accessible only to our core team of 5-10 people). Each snippet contained:
|
||||
|
||||
- 6-10 messages between exactly two participants.
|
||||
@ -61,7 +62,7 @@ We ended up with 123 snippets—below is an example:
|
||||
|
||||
> [!question]- Can you guess the right answer?
|
||||
> D! Classic Vince being Bayesian.
|
||||
### Context Modes
|
||||
## Context Modes
|
||||
Upon visual inspection of the resulting dataset, we found that the decoys were remarkably similar to the real messages, making it difficult even for us to consistently identify the genuine response. We wondered if providing additional context about the users might help determine the correct answer, which led us to explore different context modes:
|
||||
|
||||
1. **No Context**: Models only received the immediate conversation snippet and the four options.
|
||||
@ -69,7 +70,7 @@ Upon visual inspection of the resulting dataset, we found that the decoys were r
|
||||
3. **Summary Context**: Models received the conversation snippet plus a generated personality profile of the target user, created by processing the previous 50 or 100 messages through Llama 3.3 70B. The prompt used to generate this summary is available in the [project repo](https://github.com/plastic-labs/next-message-prediction-public/blob/950384174023ba315b628d3ba7bdb7c00b918544/generate_dataset.py#L156) on GitHub.
|
||||
|
||||
This design allowed us to compare whether any context provides useful signals for predicting social behavior, and whether a summary can provide results comparable to the full context.
|
||||
### Experimental Setup
|
||||
## Experimental Setup
|
||||
We tested a wide range of models including:
|
||||
- Claude 3.7 Sonnet, Claude 3.5 Sonnet, Claude 3.5 Haiku.
|
||||
- GPT-4.5, GPT-4o, GPT-4o Mini, O-1, O-3 Mini.
|
||||
@ -79,15 +80,15 @@ We tested a wide range of models including:
|
||||
- DeepSeek models (Chat and R1).
|
||||
|
||||
For each model and context mode combination, we ran three trials with different random seeds to control for position bias in option selection. Ideally we would have run more trials, but we wanted to constrain the compute needed for this experiment.
|
||||
## Results and Discussion
|
||||
# Results and Discussion
|
||||
The results of our experiment are shown in Figure 1. In this section, we analyze them in detail and provide some insights and interpretation.
|
||||
|
||||

|
||||
|
||||
*Figure 1. Mean next-message prediction accuracy (%) by model and context mode. Error bars show standard error over three different runs with different random seeds to shuffle the order of the options.*
|
||||
### Context Helps Regardless of Form
|
||||
## Context Helps Regardless of Form
|
||||
Additional context helps models predict social behavior, whether that context is provided as raw conversation history or as a processed summary. Moving from no context to either raw or summary context yielded substantial improvements for virtually all models tested. This confirms what might seem intuitive: knowing more about someone helps predict what they might say next.
|
||||
### Efficient Context Processing Works
|
||||
## Efficient Context Processing Works
|
||||
What's particularly significant is that injecting pre-processed summaries of user context works as well as or better than providing raw context for most models. This has important implications for system design:
|
||||
|
||||
1. The summaries contain far fewer tokens than raw context (approximately one paragraph versus potentially thousands of tokens).
|
||||
@ -97,28 +98,27 @@ What's particularly significant is that injecting pre-processed summaries of use
|
||||
This supports a core [thesis](https://blog.plasticlabs.ai/blog/Theory-of-Mind-Is-All-You-Need) behind Honcho: ambient processing of user context to generate compressed representations can improve model performance while keeping inference costs manageable. Rather than injecting massive amounts of data into the context window, models can achieve better results with distilled personality profiles.
|
||||
|
||||
We didn't observe significant performance differences between 50-message and 100-message contexts, suggesting there may be diminishing returns beyond a certain point. This is likely dependent on factors like user count and conversation density.
|
||||
### Newest Models Lead the Way
|
||||
## Newest Models Lead the Way
|
||||
Only the newest models perform well on this task. Claude 3.7 Sonnet and GPT-4.5 (both released last week) were the only models to achieve accuracy significantly above 40% in any context mode, with Claude 3.7 (non-thinking) reaching nearly 60% accuracy with summary context—more than doubling the 25% random baseline.
|
||||
|
||||
This is particularly interesting because tasks that would have seemed impossible for models that existed just months ago are now becoming tractable. This rapid progress also informs how we should think about designing evaluations—creating hard tasks that aren't saturated from the start rather than ones where models already perform at ceiling.
|
||||
### Different Models Benefit from Different Contexts
|
||||
## Different Models Benefit from Different Contexts
|
||||
While summary context generally outperformed raw context, this pattern wasn't universal. Some models (notably Claude 3.5 Sonnet and GPT-4.5) performed better with raw context than with summaries. This suggests different architectures may vary in their ability to extract relevant information from different types of context.
|
||||
### Reasoning vs Social Understanding Trade-offs
|
||||
## Reasoning vs Social Understanding Trade-offs
|
||||
The relatively poor performance of models optimized for technical reasoning, like Claude 3.7 Sonnet (thinking), DeepSeek R1, and OpenAI's O-1 and O-3 Mini, raises interesting questions. Despite their strong results on math and coding benchmarks, these models achieved well below random performance on our social prediction task.
|
||||
|
||||
This suggests potential trade-offs in model optimization. The reinforcement learning or supervised fine-tuning techniques used to enhance reasoning abilities might come at the expense of social cognition capabilities. However, without access to the architectures, data and training procedures that major labs like Anthropic and OpenAI use to build these models, it's hard to know exactly what might be causing models like Claude 3.7 Sonnet and GPT-4.5 to perform so much better on this task.
|
||||
### Caveat: Decoy Generation
|
||||
## Caveat: Decoy Generation
|
||||
We should note that our decoys were generated using Claude 3.7 Sonnet, which was also the best-performing model on the task. It's possible that Claude 3.7 is better at recognizing the subtleties in its own generations. However, this almost creates a generative adversarial setup—Claude 3.7 is both generating challenging decoys and trying to identify them—which makes its strong performance even more notable.
|
||||
## Future Directions
|
||||
### Verifiable Social Rewards for RL
|
||||
# Future Directions
|
||||
## Verifiable Social Rewards for RL
|
||||
|
||||
So far, we've used this task purely as an evaluation metric, but with a large enough dataset, it could potentially serve as a reward signal for reinforcement learning. This would allow for optimization of social cognition abilities with objective metrics, similar to how technical reasoning has been enhanced. Expanding our toolkit of objective social evaluation metrics could help bridge the gap between technical and social intelligence.
|
||||
### Social-Reasoning Balance
|
||||
## Social-Reasoning Balance
|
||||
Can we develop training techniques that enhance reasoning capabilities without sacrificing social cognition? This might involve carefully designed datasets that balance technical and social tasks, or novel fine-tuning approaches that preserve multiple types of capabilities. Understanding the apparent trade-off between these abilities could be crucial for developing more well-rounded AI systems.
|
||||
### Context Optimization and Alternative Approaches
|
||||
|
||||
## Context Optimization and Alternative Approaches
|
||||
We're also interested in exploring several technical improvements to the methodology: finding the minimum effective context window size across different environments; testing different prompting techniques and models for generating personality summaries; experimenting with combinations of raw and summary contexts; and trying different models for decoy generation to address potential advantages Claude 3.7 might have in recognizing its own outputs.
|
||||
## Conclusion
|
||||
# Conclusion
|
||||
We were excited to find that this social prediction task was genuinely challenging for most current models, with only the very latest releases showing strong performance. The fact that models optimized for reasoning performed poorly suggests interesting trade-offs in current training approaches. Meanwhile, the effectiveness of pre-processed context summaries supports a key principle behind Honcho: ambient processing of user context can significantly improve personalization while managing compute costs.
|
||||
|
||||
Check out the code [here](https://github.com/plastic-labs/next-message-prediction-public). We used our private Discord messages for the experiment so we're unable to publish our own dataset, but the repository contains instructions to replicate the experiment with your own data. If you have any questions, feel free to ask on GitHub!
|
||||
|
||||
@ -1,16 +1,14 @@
|
||||
---
|
||||
title: Evaluating Steerability in Large Language Models
|
||||
author: Dani Balcells
|
||||
date: 12.14.24
|
||||
tags:
|
||||
- research
|
||||
- ml
|
||||
author: Dani Balcells
|
||||
description: A new benchmark framework for measuring how well AI systems can adapt to different personas, implementing the first trade-off steerable benchmark.
|
||||
---
|
||||
|
||||
|
||||
## TL;DR
|
||||
# TL;DR
|
||||
*This is a research update on our ongoing work to implement concrete benchmarks for measuring AI systems' ability to adapt to different users. We've created what we believe is the first implementation of a "trade-off steerable benchmark" - a framework proposed by Sorensen et al. for evaluating how well AI systems can be steered to reflect different perspectives. While we've made progress on the core dataset and evaluation pipeline, several key questions remain about how to make this benchmark as useful as possible to the research community. We're sharing this update to gather feedback at NeurIPS 2024 in Vancouver on the most valuable directions to take this work.*
|
||||
|
||||
# 1. Measuring AI Systems' Ability to Adapt to Different Users
|
||||
At Plastic Labs, we're building AI systems that can adapt to and act on behalf of their users. As we continue to improve these systems, it's critical that we can reliably measure their ability to faithfully represent different people's views and behaviors.
|
||||
|
||||
@ -19,7 +17,6 @@ Today we're introducing a new evaluation framework that systematically tests an
|
||||
The AI community has made remarkable progress in building powerful language models that can engage in open-ended dialogue. However, these models are typically aligned through techniques like RLHF that optimize for a single set of "average" human preferences. This approach falls short when we want AI systems that can truly adapt to individual users with different values, personalities and preferences.
|
||||
|
||||
Recent work has established the importance of pluralistic alignment - ensuring AI systems can faithfully represent diverse human perspectives. While conceptual frameworks for measuring this capability have been proposed, notably by Sorensen et al., the authors acknowledge that to their knowledge no concrete implementations of these frameworks exist yet. This makes it difficult to assess progress or compare different approaches.
|
||||
|
||||
## Our Approach
|
||||
We've created an evaluation framework that systematically measures an AI system's ability to adapt to different personas. The core idea is simple: we give the system a few examples of how a persona thinks and behaves, then test whether it can accurately predict that persona's views on new scenarios. By testing many different personas and comparing how well each steered version of the system maintains fidelity to its target persona, we can quantify how "steerable" the system is.
|
||||
|
||||
@ -28,10 +25,8 @@ Our research questions include:
|
||||
- How well do simple steering approaches like few-shot learning actually perform?
|
||||
|
||||
In the following sections, we'll detail our methodology and share initial results that shed light on these questions. We hope this work helps establish more rigorous ways to evaluate AI systems' ability to reflect human diversity.
|
||||
|
||||
# 2. Creating a Dataset to Test Personality Adaptation
|
||||
To evaluate an AI system's ability to adapt to different personas, we first needed a dataset of diverse personalities and their characteristic behaviors. We approached this as a careful balance between coverage, quality and cost - we wanted to represent a wide range of human personalities while ensuring the data was reliable enough to serve as ground truth, all while keeping the time and compute required to develop the dataset to a reasonable minimum.
|
||||
|
||||
## Seeding Diverse Personas
|
||||
For our initial implementation, we needed a systematic way to generate personas that would exhibit meaningfully different attitudes and behaviors. While recent work like the Billion Personality Dataset has explored prompting LLMs with simple role descriptions like "a musician interested in audio processing" or "a moving company driver", there's no guarantee such prompts will produce distinct behavioral patterns. Instead, we used five well-known personality frameworks (Myers-Briggs Type Indicator, Enneagram, Big Five, Zodiac signs, and Tarot archetypes) that each attempt to provide complete coverage of human personality space.
|
||||
|
||||
@ -93,7 +88,6 @@ The binary agree/disagree format enables reliable scoring while minimizing measu
|
||||
# 3. Methodology: Measuring Steerability
|
||||
|
||||
## The Core Task: Steering and Testing
|
||||
|
||||
Our evaluation framework measures how well a given system can steer to different personas. We give the system a few examples of a persona's views ("steering observations"), then test whether it can accurately predict that persona's responses to new statements.
|
||||
|
||||
Formally, we define:
|
||||
@ -120,7 +114,6 @@ For example, to test adaptation to an INFP personality:
|
||||
To measure the overall steerability of the system, we repeat the process above for all personas and average the resulting percentile rank scores.
|
||||
|
||||
We show the preliminary results of running this evaluation framework on few-shot steerable systems - baseline systems that implement steering by including the steering observations in their system prompt formatted as "you are role-playing as a person that agrees with the following statements: \[agree observations] and disagrees with the following observations \[disagree observations]". We use the same few-shot prompt on GPT-4o Mini, Gemini 1.5 Flash and Claude 3.5 Sonnet.
|
||||
|
||||
# 4. Results and Discussion
|
||||
|
||||
## Score Matrix Analysis
|
||||
@ -1,29 +1,24 @@
|
||||
---
|
||||
title: Introducing Neuromancer XR
|
||||
author: Dani Balcells
|
||||
subtitle: Our Reasoning Model for State-Of-The-Art Memory
|
||||
date: 08.18.2025
|
||||
tags:
|
||||
- research
|
||||
- ml
|
||||
- "#neuromancer"
|
||||
subtitle: Our Reasoning Model for State-Of-The-Art Memory
|
||||
author: Dani Balcells
|
||||
description: Meet Neuromancer XR--our custom reasoning model that achieves state-of-the-art memory by extracting & scaffolding logical conclusions from conversations.
|
||||
---
|
||||
|
||||
![[opengraph_neuromancer.png]]
|
||||
|
||||
## TL;DR
|
||||
_Memory is a foundational pillar of social cognition. As a key component of [Honcho](https://honcho.dev), we approach it as a combined reasoning and retrieval problem. In this post, we introduce Neuromancer XR, the first in a series of custom reasoning models that works by extracting and scaffolding atomic conclusions from user messages across two strictly defined levels of logical certainty: explicit and deductive. It's the result of fine-tuning Qwen3-8B on a manually curated dataset mapping conversation turns to atomic conclusions. Using Neuromancer XR as the reasoning engine behind our core product Honcho leads to 86.9% accuracy on the [LoCoMo](https://snap-research.github.io/locomo/) benchmark, compared to 69.6% using the base Qwen3-8B model, and 80.0% when using Claude 4 Sonnet as baseline, to achieve state of the art results. The next model in the series, Neuromancer MR will extract and scaffold observations at two further levels along the spectrum of certainty: inductive and abductive. This will allow us to front-load most of the inference needed to improve LLMs' social cognition skills, powering AI-native products that truly understand any peer in a system, be it a user or an agent._
|
||||
|
||||
---
|
||||
|
||||
# TL;DR
|
||||
*Memory is a foundational pillar of social cognition. As a key component of [Honcho](https://honcho.dev), we approach it as a combined reasoning and retrieval problem. In this post, we introduce Neuromancer XR, the first in a series of custom reasoning models that works by extracting and scaffolding atomic conclusions from user messages across two strictly defined levels of logical certainty: explicit and deductive. It's the result of fine-tuning Qwen3-8B on a manually curated dataset mapping conversation turns to atomic conclusions. Using Neuromancer XR as the reasoning engine behind our core product Honcho leads to 86.9% accuracy on the [LoCoMo](https://snap-research.github.io/locomo/) benchmark, compared to 69.6% using the base Qwen3-8B model, and 80.0% when using Claude 4 Sonnet as baseline, to achieve state of the art results. The next model in the series, Neuromancer MR will extract and scaffold observations at two further levels along the spectrum of certainty: inductive and abductive. This will allow us to front-load most of the inference needed to improve LLMs' social cognition skills, powering AI-native products that truly understand any peer in a system, be it a user or an agent.*
|
||||
# Table Stakes
|
||||
At Plastic, we want to enable builders to create AI applications and agents with exceptional social intelligence: tools that are able to understand who you are and what you mean, whether it's an AI tutor that adapts to your learning style or a multi-agent system that anticipates your needs. These applications all require something fundamental that's only recently begun to draw attention: memory.
|
||||
|
||||
Most approaches treat memory as an end product or top-level [[Memory as Reasoning#Memory is ~~Storage~~ Prediction|feature]], enabling information to persist across chatbot sessions, but we consider it the foundation of something much bigger: the ability for LLMs to build mental models of their users and one another and draw from those representations in real time. This capability is essential for personalization, engagement, and retention. Not to mention multi-agent systems, individual alignment, and the trust required for agentic behavior. It's the difference between an AI that merely responds to queries and one that genuinely understands and adapts to the person it's talking to; the difference between out-of-the-box experiences and ones cohered to a user’s personal identity
|
||||
|
||||
To do anything approaching the social cognition required, Honcho must be state-of-the-art in memory: able to recall observations about users across conversations with superhuman fidelity. Today, we're sharing our approach and early results from training a specialized model that treats [[Memory as Reasoning|memory as a reasoning task]] rather than simple static storage.
|
||||
|
||||
To do anything approaching the social cognition required, Honcho must be state-of-the-art in memory: able to recall observations about users across conversations with superhuman fidelity. Today, we're sharing our approach and early results from training a specialized model that treats [[Memory as Reasoning|memory as a reasoning task]] rather than simple static storage.
|
||||
# Memory as Reasoning
|
||||
|
||||
Reasoning models continue to surge in capability and popularity. And with them, our approach to memory. Why not design it as a reasoning task concerned with deliberating over the optimal context to synthesize and remember? We turned to formal logic to develop four methods of reasoning, along a spectrum of certainty, toward conclusions to derive from conversational data:
|
||||
|
||||
- **Explicit**: Information directly stated by a participant.
|
||||
@ -91,15 +86,12 @@ Reasoning models continue to surge in capability and popularity. And with them,
|
||||
> > > - Erin probably has a growth mindset (transformed health concern into athletic goal, combines activities like reading while running)
|
||||
|
||||
Having clear definitions for these four types of reasoning and their corresponding levels of certainty also allows us to establish how different kinds of observations relate to one another. Specifically, we require observations to scaffold only on top of observations with higher certainty: an abduction (e.g. "Erin values her health proactively") can use a deduction (e.g. "Erin exercises regularly") or induction (e.g. "Erin prioritizes healthy eating during weekdays") as one of its premises, but not the other way around. That is, one can speculate given a certain conclusion, but one cannot attempt to conclude something logically from prediction. Implied in this is that the model must show its work. A conclusion must include its premises, its evidence and support.
|
||||
|
||||
# Neuromancer XR: Training a Logical Reasoning Specialist for Memory
|
||||
|
||||
To implement this vision, we need a model that can reliably extract and categorize conclusions from conversations. Our initial focus for the memory task, given its focus on factual recall, is on the first two certainty levels: explicit and deductive knowledge--that is, conclusions we know to be true given what users (or agents) state in their messages.
|
||||
|
||||
We generated a proprietary dataset of approximately 10,000 manually curated instances of conclusion derivation, creating memory-reasoning traces from conversational data. Each instance shows how to process a conversation turn and derive the relevant conclusions at appropriate certainty levels. We then fine-tuned Qwen3-8B on these traces.
|
||||
|
||||
The resulting model is Neuromancer XR (for eXplicit Reasoning), a model specialized in deriving explicit and deductive conclusions from conversational data. It is currently in production powering the latest release of [Honcho](https://www.honcho.dev).
|
||||
|
||||
## Integration with Honcho
|
||||
![[neuromancer_honcho_diagram.png]]
|
||||
*Figure 1. Diagram of the Honcho workflow.*
|
||||
@ -146,27 +138,21 @@ This can lead to poor embedding quality, making retrieval more difficult, or add
|
||||
|
||||
|
||||
We further speculate that deciding what information to extract for memory purposes from a conversation turn is something that small models are definitely capable of, as it's mostly a matter of identifying and correctly rephrasing information that's already present in the text and making small logical deductions based on it. This contrasts however, with the more complex tasks needed for AI-native memory and social cognition, hardly limited to abilities like inferring user intent or theory of mind, which require generating substantial amounts of information not present in the text itself.
|
||||
|
||||
# Directions for future work
|
||||
We're training a model for the remaining two levels of logical certainty outlined above in our framework: inductive and abductive. The next model in the Neuromancer series, Neuromancer MR (for meta-reasoning), will be in charge of this.
|
||||
|
||||
This model will reason about reasoning, focusing on the predictive side of the certainty spectrum. It will allow us to derive likely explanations and probable hypotheses for broad patterns of user or agent behavior at the moment of ingestion, bolstering the density and utility of peer representations. We’re developing internal evaluations for this task, as none currently exist for this frontier of synthetic social cognition.
|
||||
## Front-loading social reasoning inference
|
||||
|
||||
One of the advantages of this memory framework is that it allows us to front-load a lot of the meta-cognitive inference that's required to improve LLMs' social intelligence and theory of mind capabilities. In our [[blog/content/research/Violation of Expectation via Metacognitive Prompting Reduces Theory of Mind Prediction Error in Large Language Models|prior research]], as early as 2023, we show that allowing LLMs to reason over conversational data in a chain-of-thought style would allow them to develop high-fidelity models of users' mental states.
|
||||
|
||||
Most other LLM frameworks store atomic, low-level "facts" about users and include them as context at generation time. This, in theory, and with enough carefully prompted inference-time compute, would allow a good enough model to develop abstract theories about the user's mental state as it tries to answer a query about the user. However, it would have to happen implicitly in the model's thought process, which in turn means that the theories about the user's mental state are ephemeral, opaque and unpredictable. Approaches such as this therefore are inconsistent and inefficient, and would further struggle to meet the challenges of true social cognition.
|
||||
|
||||
Our approach, on the other hand, shifts most of the load of reasoning about the peer from generation time to the earlier stages of the process, when messages are processed and ingested. By the time observations are retrieved for generation, low-level messages have already been distilled and scaffolded into a hierarchical, certainty-labeled, and easy to navigate tree containing a high-fidelity user representation.
|
||||
|
||||
|
||||
## Beyond recall: toward social intelligence
|
||||
|
||||
Evaluations and benchmarks are essential tools on our path to develop better frameworks for the development of AI-native tools. However, they don't tell the whole story: no evaluation is perfect, and hill-climbing can easily mislead us into optimizing for higher scores rather than the true north star: the overall quality of our product. For us, that means treating memory not as a hill to die on, but as table-stakes in our pursuit of social cognition that can truly transform the way AI-native tools understand us. Although success at this broader goal is much harder to quantify in conventional benchmarks, given the complex and under-specified nature of social cognition, we will continue to implement the evaluations that we find the most helpful for our agile development process.
|
||||
|
||||
In that spirit, we have our sights set on the remaining two levels of certainty we introduced at the beginning of this blog post: inductive and abductive. In our manual, preliminary testing, including all four levels of reasoning resulted in incredibly rich user representations being extracted from even the simplest interactions. What lies ahead of us is the exciting task of harnessing these representations and delivering them via Honcho in the fastest, most flexible and most agentic way.
|
||||
|
||||
## Some Notes on Model Naming
|
||||
# Some Notes on Model Naming
|
||||
>Personality is my medium.
|
||||
|
||||
        -*Neuromancer* (Gibson, 1984)
|
||||
@ -178,8 +164,6 @@ The character Neuromancer is an AI tasked with transmuting personal identity fro
|
||||
In many ways, this is analogous to Plastic's mission to create representations of personal identity of such high-fidelity that they asymptotically approach the full complexity of the original person. But more specifically, our Neuromancer models are tasked with reasoning about user (or agent) data to create and scaffold the atomic conclusions from which we build those representations.
|
||||
|
||||
So not only does the name fit, but it also honors and strives toward the incredible ambition of Gibson's vision still yet to be realized 40 years later.
|
||||
|
||||
|
||||
# Appendix A: LLM-as-judge design and prompt
|
||||
In our evaluation of the three models we tested, we used the standard GPT 4o-mini as an LLM-as-judge, using the prompt below, in order to label responses as correct or incorrect. This is a choice from several factors, which we outline below.
|
||||
|
||||
|
||||
@ -1,22 +1,18 @@
|
||||
---
|
||||
title: "SPIRAL: Letting LLMs Teach Themselves Through Self-Play"
|
||||
author: Dani Balcells
|
||||
date: 08.15.24
|
||||
tags:
|
||||
- research
|
||||
- ml
|
||||
- reinforcement
|
||||
- learning
|
||||
- rl
|
||||
author: Dani Balcells
|
||||
description: How self-play on text games develops generalizable reasoning skills in LLMs--achieving 8.6% math improvement from training on poker with no mathematical content.
|
||||
---
|
||||
|
||||
![[selfplay.png]]
|
||||
*Source: [Liu, Guertler et al., 2025](https://arxiv.org/abs/2506.24119).*
|
||||
|
||||
## TL;DR
|
||||
# TL;DR
|
||||
_We collaborated with the TextArena team to develop SPIRAL, a novel RL framework that allows LLMs to develop complex reasoning capabilities by playing text-based games against themselves. Using SPIRAL on a simplified variant of poker with no mathematical content, a 4B-parameter Qwen model improved its performance on math and reasoning benchmarks by 8.6% and 8.4% respectively. It does this by learning specific strategies, such as case-by-case analysis and expected value calculation, that generalize beyond poker better than simple game heuristics. We're excited to explore whether self-play on social deduction games like Mafia can lead to general improvements in LLMs' social cognition._
|
||||
|
||||
---
|
||||
## Teaching Social Cognition Through Games
|
||||
# Teaching Social Cognition Through Games
|
||||
At Plastic Labs, one of our key research interests is improving language models' social cognition: their ability to represent people's mental states, predict users' behaviors, and navigate complex social dynamics. This capability is essential for creating AI systems that can genuinely understand and adapt to individual users, yet it remains underdeveloped compared to technical abilities and so-called "hard skills" like reasoning and coding.
|
||||
|
||||
Complex skills like social cognition present unique challenges for conventional supervised learning, arguably the dominant paradigm in machine learning, where models are given labeled examples of correct behavior. Unlike conventional language modeling tasks such as question answering or translation, social understanding involves nuanced judgments about beliefs, intentions, and interpersonal dynamics. With social reasoning, on the other hand, creating comprehensive labeled datasets of correct behavior is not just expensive, but often an ill-posed and under-specified problem, given how hard it is to define what the right answer should be in the first place.
|
||||
@ -28,9 +24,7 @@ These approaches have primarily focused on domains with verifiable answers: math
|
||||
Our research soon connected us with [Leon Guertler](https://x.com/leonguertler) and the [TextArena](https://www.textarena.ai) team, who were working on a Python library designed for this exact purpose: providing text-only games as RL environments in the hopes that they might allow LLMs to acquire general skills. We soon discovered we were kindred spirits working on similar problems, and decided to collaborate.
|
||||
|
||||
This blog post introduces the first result of that collaboration: SPIRAL, a framework that allows LLMs to develop complex reasoning skills by playing text-based games against themselves.
|
||||
|
||||
## SPIRAL's Key Contributions
|
||||
|
||||
# SPIRAL's Key Contributions
|
||||
The [SPIRAL paper](https://arxiv.org/abs/2506.24119) demonstrates that self-play on simple games can develop generalizable reasoning skills without any domain-specific training data. The experiments consisted of training Qwen3-4B-Base on Kuhn Poker—a minimal three-card poker variant—for just 400 training steps. Despite the game containing no mathematical content whatsoever, this training improved the model's performance on math benchmarks by 8.6% and general reasoning by 8.4%. Perhaps most surprisingly, the self-play approach outperformed a baseline trained using supervised fine-tuning on 25,000 expert game trajectories, suggesting that the competitive dynamics of self-play provide a more effective learning signal than imitation learning.
|
||||
|
||||
Self-play creates fundamentally different training dynamics than conventional approaches. When a model plays against continuously updating copies of itself, it faces an opponent that evolves in lockstep with its own improvements. This prevents the static exploitation patterns that emerge when training against fixed opponents: in the paper, we find that models trained against unchanging opponents like Mistral or Gemini initially struggle, then plateau once they discover winning exploits. Furthermore, given the zero-sum nature of the games, self-play forces models to develop genuine strategic reasoning that remains robust against an ever-adapting adversary.
|
||||
@ -42,9 +36,7 @@ What makes it possible for the skills learned through SPIRAL to generalize beyon
|
||||
- Pattern recognition, helping the model identify recurring structures and regularities, such as recognizing when an opponent's betting pattern signals strength.
|
||||
|
||||
The main technical innovation that enabled stable self-play training was Role-conditioned Advantage Estimation (RAE). It is designed to mitigate the effects of variance, a common challenge in multi-agent reinforcement learning. Facing a constantly changing opponent makes it difficult to determine whether a given positive reward should be attributed to good play or to a mistake by an opponent, which in turn makes model updates unreliable and unstable. RAE addresses this by maintaining separate baselines for each role in the game, normalizing rewards relative to the expected performance in each specific role. Without RAE, the training often led to "thinking collapse", where gradients become unstable and eventually drop to near zero, halting learning and resulting in nonsensical outputs.
|
||||
|
||||
## Next Steps for Social Intelligence
|
||||
|
||||
# Next Steps for Social Intelligence
|
||||
For Plastic Labs, SPIRAL is a first step pointing us in an intriguing direction: competitive self-play as an effective way to teach models complex skills without domain-specific supervision. It opens the door for us to explore using similar approaches to teach models social cognition specifically.
|
||||
|
||||
We’re currently exploring whether social deduction games like Mafia, Avalon and Werewolf are the natural next step for this approach. They require exactly the capabilities we want models to develop: maintaining accurate models of multiple agents' mental states simultaneously, detecting deception through subtle behavioral cues, building trust strategically, and managing the flow of information to achieve goals. Success in these games depends on genuine social understanding, precisely the core components of social cognition that remain underdeveloped in current language models.
|
||||
@ -5,10 +5,10 @@ tags:
|
||||
- research
|
||||
- ml
|
||||
- philosophy
|
||||
author: Courtland Leer, Vince Trost, & Vineeth Voruganti
|
||||
description: Research showing how predictive coding-inspired metacognitive prompting enhances LLM theory of mind abilities & reduces prediction error about users.
|
||||
---
|
||||
[Read on Arxiv](https://arxiv.org/abs/2310.06983).
|
||||
|
||||
Or download here:
|
||||
[Read on Arxiv](https://arxiv.org/abs/2310.06983).
|
||||
|
||||
<iframe style="width: 100%;height: 50vh" src="https://arxiv.org/pdf/2310.06983.pdf"></iframe>
|
||||
|
||||
|
||||
Loading…
Reference in New Issue
Block a user