honcho UCM blog & assets draft

This commit is contained in:
Courtland Leer 2024-01-17 20:52:17 -05:00
parent 354b653b6f
commit be78ac8aa8
12 changed files with 103 additions and 104 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 256 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 282 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.1 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 361 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 209 KiB

View File

@ -0,0 +1,93 @@
## TL;DR
Today we drop the first release of a project called *Honcho* (LINK). An open-source version of the OpenAI Assistants API. Honcho manages your AI app data on a per-user basis, allowing for multiple concurrent identities. Glaringly absent from the existing stack, Honcho will, at full maturity, usher the advent of atomic, disposable agents that are user-first by default.
## Plastic Lore
[Plastic Labs](https://plasticlabs.ai) was conceived as a research group exploring the intersection of education and emerging technology. Our first cycle focused on how the incentive mechanisms and data availability made possible by distributed ledgers might be harnessed to improve learning outcomes. But with the advent of ChatGPT and a chorus of armchair educators proclaiming tutoring solved by the first nascent consumer generative AI, we shifted our focus to large language models.
As a team with with backgrounds spanning machine learning and education, we found the prevailing narratives overestimating short-term capabilities and under-imagining longterm potential. Fundamentally, LLMs were and are still 1-to-many instructors. Yes, they herald the beginning of a revolution of personal access not to be discounted, but every student is still ultimately getting the same experience. And homogenized educational paradigms are by definition under-performant on an individual level. If we stop here, we're selling ourselves short.
![[zombie_tutor_prompt.jpg]]
*A well intentioned but monstrously deterministic [tutor prompt](https://www.oneusefulthing.org/p/assigning-ai-seven-ways-of-using)*.
Most edtech projects we saw emerging actually made foundation models worse by adding gratuitous lobotomization and coercing deterministic behavior. The former stemmed from the typical misalignments plaguing edtech, like the separation of user and payer. The latter seemed to originate with deep misunderstandings around what LLMs are and translates to a huge missed opportunity.
So we set out to build a non-skeuomorphic, AI-native tutor that put users first. The same indeterminism so often viewed as LLMs' greatest liability is in fact their greatest strength. Really, it's what they _are_. When great teachers deliver effective personalized instruction, they don't consult some M.Ed flowchart, they leverage the internal personal context they have on the student and reason (consciously or basally) about the best pedagogical intervention. LLMs are the beginning of this kind of high-touch learning companion being _synthetically_ possible.
![[teacher_shoggot.png]]
*We're not so different after all* ([@anthrupad](https://twitter.com/anthrupad))_._
Our [[Open-Sourcing Tutor-GPT|experimental tutor]], Bloom, [[Theory-of-Mind Is All You Need|was remarkably effective]]--for thousands of users during the 9 months we hosted it for free--precisely because we built [cognitive architectures](https://blog.langchain.dev/openais-bet-on-a-cognitive-architecture/) that mimic the theory-of-mind expertise of highly efficacious 1:1 instructors.
## Context Failure Mode
But we quickly ran up against a hard limitation. The failure mode we believe all vertical specific AI applications will eventually hit if they want to be sticky, paradigmatically different than their deterministic counterparts, and realize the full potential here. That's context, specifically user context--Bloom didn't know enough about each student.
We're always blown away by how many people don't realize that large language models themselves are stateless. They don't remember shit about you. They're just translating the context they're given into a probable sequence of tokens. LLMs are like horoscope writers, good at writing general statements that feel very personal. You would be too if you'd ingested and compressed that much of the written human corpus.
![[geeked_dory.png]]
There are lots of developer tricks to give the illusion of state about the user, mostly injecting conversation history or some personal digital artifact into the context window. Another is running inference on that limited recent user context to derive new insights. This was the game changer for our tutor, and we still can't believe by how under-explored that solution space is (more on this soon 👀).
To date, machine learning has been [[The machine learning industry is too focused on general task performance|far more focused on]] optimizing for general task competition than personalization. This is natural, although many of these tasks are still probably better suited to deterministic code. It's also historically prestiged papers over products--it takes a bit for research to morph into tangible utility. Put these together and you end up with a big blindspot over individual users and what they want.
The real magic of 1:1 instruction isn't subject matter expertise. Bloom and the foundation models it leveraged had plenty of that (despite what clickbait media would have you believe about hallucination in LLMs). Instead, it's personal context. Good teachers and tutors get to know their charges--their history, beliefs, values, aesthetics, knowledge, preferences, hopes, fears, interests, etc. They compress all that and generate customized instruction, emergent effects of which are the relationships and culture necessary for positive feedback loops.
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Human intelligent agency depends more on the intricate sphere of ideas and the cultural intellect that we have grown over thousands of years than on the quirks of our biological brains. The minds of modern humans have more in common with chatGPT than with humans 10000 years ago.</p>&mdash; Joscha Bach (@Plinz) <a href="https://twitter.com/Plinz/status/1735427295937020177?ref_src=twsrc%5Etfw">December 14, 2023</a></blockquote>
Large language models can be good at this too. With similar compression and generation abilities, they're uniquely suited (among existing technology) to get to know you. We really can have shared culture and relationships with LLMs, absent (if we like) any cringy anthropomorphism.
Bloom needed a mechanism to harvest and utilize more context about the student. So we built it one.
## Research Solutions
Prediction algorithms have become phenomenal at hacking attention using tabular engagement and activity data. But if we're thinking LLM-natively, a few questions emerge:
1. How are LLMs uniquely positioned to understand users?
2. What new affordances does this enable for modeling users?
3. Can that improve agent design, DX, & UX?
4. Does this enable more positive sum user data opportunities?
Every day human brains do incredibly sophisticated things with sorta-pejoratively labelled 'soft' insights about others. But social cognition is part of the same evolutionarily optimized framework we use to model the rest of the world.
We run continuous active inference on wetware to refine our internal world models. This helps us make better predictions about the world by minimizing the difference between our expectation and reality. That's more or less what learning is. And we use the same set of mechanisms to model other humans, i.e. get to know them.
In LLMs we have remarkable predictive reasoning engines with which we can begin to build the foundations of social cognition and therefore model users with much more nuance and granularity. Not just their logged behavior, but reasoning between the lines about its motivation and grounding in the full account of their identity.
Late last year we published a [pre-print of research on this topic](https://arxiv.org/abs/2310.06983), and we've shown that these kinds of biologically-inspired frameworks can construct models of users that improve an LLM's ability to reason and make predictions about that individual user:
![[honcho_powered_bloom_paper_fig.png]]
*A [predictive coding inspired metacognitive architecture_](https://youtu.be/PbuzqCdY0hg?feature=shared) from our research.*
We added it to Bloom and found the missing piece to overcoming the failure mode of user context. Our tutor could now learn about the student and use that knowledge effectively to produce better learning outcomes.
## Blast Horizon
Building and maintaining a production-grade AI app for learning catapulted us to this missing part of the stack. Lots of users all growing in unique ways and all needing personalized attention that evolved over multiple longform sessions forced us to confront the user context management problem with all it's thorny intricacy and potential.
And we're hearing constantly from builders of other vertical specific AI apps that personalization is the key blocker. In order for projects to graduate form toys to tools, they need to create new kinds of magic for their users. Mountains of mostly static software exists to help accomplish an unfathomable range of tasks and lots of it can be personalized using traditional (albeit laborious for the user) methods. But LLMs can observe, reason, then generate the software _and the user context_, all abstracted away behind the scenes.
Imagine online stores generated just in time for the home improvement project you're working on; generative games with rich multimodality unfolding to fit your mood on the fly; travel agents that know itinerary needs specific to your family without being explicitly told; copilots that think and write and code not just like you, _but as you_; disposable, atomic agents with full personal context that replace your professional services--_you_ with a law, medical, accounting degree.
This is the kind of future we can build when we put users at the center of our agent and LLM app production.
## Introducing Honcho
So today we're releasing the first iteration of [[Honcho name lore|Honcho]], our project to re-define LLM application development through user context management. At this nascent stage, you can think of it as an open-source version of the OpenAI Assistants API.
Honcho is a REST API that defines a storage schema to seamlessly manage your application's data on a per-user basis. It ships with a Python SDK which you can read more about how to use here.
We spent lots of time building the infrastructure to support multiple concurrent users with Bloom, and too often we see developers running into the same problem: building a fantastic demo, sharing it with the world, then inevitably having to take it down due to infrastructure/scaling issues.
Honcho allows you to deploy an application with a single command that can automatically handle concurrent users. Going from prototype to production is now only limited by the amount of spend you can handle, not tedious infrastructure setup.
Managing app data on a per-user basis is the first small step in improving how devs build LLM apps. (And this framework in its full maturity will push the boundaries of whats possible.) Once you define a data management schema on a per-user basis, a lots of new possibilities emerge around what you can do intra-user message, intra-user sessions, and even intra-user sessions across agents.
![[tron_bike.gif]]
## Get Involved
We're excited to see builders experiment with what we're releasing today and with Honcho as it continues to evolve.
Check out the GitHub repo (LINK) to get started and join our [Discord](https://discord.gg/plasticlabs) to stay up to date 🫡.

View File

@ -1,66 +0,0 @@
---
title: hold
date: Dec 19, 2023
---
### Scratch
meme ideas:
-something about statelessness (maybe guy in the corner at a party "they don't know llms are stateless")
-oppenheimer meme templates
excalidraw ideas:
## TL;DR
## Toward AI-Native Metacogniton
At Plastic, we've been thinking hard for nearly a year about [cognitive architectures](https://blog.langchain.dev/openais-bet-on-a-cognitive-architecture/) for large language models. Much of that time was focused on developing [[Theory-of-Mind Is All You Need|a production-grade AI-tutor]], which we hosted experimentally as a free and permissionless learning companion.
The rest has been deep down the research rabbit hole on a particularly potent, synthetic subset of LLM inference--[[Metacognition in LLMs is inference about inference|metacognition]]:
> For wetware, metacognition is typically defined as "thinking about thinking" or often a catch-all for any "higher-level" cognition.
>...
>In large language models, the synthetic corollary of cognition is inference. So we can reasonably define a metacognitive process in an LLM as any that runs inference on the output of prior inference. That is, inference itself is used as context--*inference about inference*. It might be instantly injected into the next prompt, stored for later use, or leveraged by another model.
Less verbose, it boils down to this: if metacogntion in humans is *thinking about thinking*, then **metacognition in LLMs is *inference about inference***.
We believe this definition helps frame an exciting design space for several reasons
- Unlocks regions of the latent space unapproachable by humans
- Leverages rather than suppresses the indeterministic nature of LLMs
- Allows models to generate their own context
- Moves the research and development scope of focus beyond tasks and toward identity
- Affords LLMs the requisite intellectual respect to realize their potential
- Enables any agent builder to quickly escape the gravity well of foundation model behavior
## Research Foundations
(Def wanna give this a more creative name)
(@vintro, should we reference some of the papers that explicitly call out "metacognition"? or maybe we get into some of that below)
- historically, machine learning research has consisted of researchers intelligently building datasets with hard problems in them to evaluate models' ability to predict the right answer for, whatever that looks like
- someone comes along, builds a model that generalizes well on the benchmarks, and the cycle repeats itself, with a new, harder dataset being built and released
- this brings us to today, where datasets like [MMLU](https://arxiv.org/abs/2009.03300), [HumanEval](https://arxiv.org/abs/2107.03374v2), and the hilariously named [HellaSwag](https://arxiv.org/abs/1905.07830)
- what they all have in common is they're trying to explore a problem space as exhaustively as possible, providing a large number of diverse examples to evaluate on (MMLU - language understanding, HumanEval - coding, HellaSwag - reasoning)
- high performance on these datasets demonstrates incredible *general* abilities
- and in fact their performance on these diverse datasets proves their capabilities are probably much more vast than we think they are
- but they're not given the opportunity to query these diverse capabilities in current user-facing systems
## Designing Metacogntion
how to architect it
inference - multiple
storage - of prior inference
between inference, between session, between agents
examples from our research
## Selective Metacog Taxonomy
a wealth of theory on how cognition occurs in humans
but no reason to limit ourselves to biological plausibility
### Metamemory
### Theory of Mind
### Imaginative Metacognition
## The Future/Potential/Importance
-intellectual respect
-potential features

View File

@ -1,18 +0,0 @@
*the post to accompany basic user context management release*
mtg notes
-us building bloom
-issues we discovers
-llms dont remember memory
-complications associated --ballooning of context management
-so we made this
-open AI assistants API, general version of this
-machine learning over fixation of llm performance
-toys v tools
-research v product
next mtg notes
-ran up against the problem of completing the task for a specific person
-entire space has been fixated on tasks
-story we told to investors over and over
-llm uniquely positioned to do this

View File

@ -1,4 +1,4 @@
Earlier this year I was reading *Rainbows End*, [Vernor Vinge's](https://en.wikipedia.org/wiki/Vernor_Vinge) [seminal augmented reality novel](https://en.wikipedia.org/wiki/Rainbows_End_(novel)), when I came across the term "Local Honcho[^1]":
Earlier this year [Courtland](https://x.com/courtlandleer) was reading *Rainbows End*, [Vernor Vinge's](https://en.wikipedia.org/wiki/Vernor_Vinge) [seminal augmented reality novel](https://en.wikipedia.org/wiki/Rainbows_End_(novel)), when he came across the term "Local Honcho[^1]":
>We simply put our own agent nearby, in a well-planned position with essentially zero latencies. What the Americans call a Local Honcho.
@ -10,10 +10,10 @@ Highlighting this, a major narrative arc in the novel involves intelligence agen
>Altogether it was not as secure as Vazs milnet, but it would suffice for most regions of the contingency tree. Alfred tweaked the box, and now he was getting Parkers video direct. At last, he was truly a Local Honcho.
Plastic had been deep into the weeds on how to harvest, retrieve, and leverage user context with large language models for months. First to enhance the UX of our AI tutor (Bloom), then in thinking about how to solve this horizontally for all vertical-specific AI applications. It struck me that we faced similar challenges to the characters in _Rainbows End_ and were converging on a similar solution.
For months, Plastic had been deep into the weeds around harvesting, retrieving, & leveraging user context with LLMs. First to enhance the UX of our AI tutor (Bloom), then in thinking about how to solve this horizontally for all vertical-specific AI applications. It struck us that we faced similar challenges to the characters in _Rainbows End_ and were converging on a similar solution.
As you interface with the entire constellation of AI applications, you shouldn't have to redundantly provide context and oversight for every interaction. You need a single source of truth that can do this for you. You need a Local Honcho.
But as we've discovered, LLMs are remarkable at theory of mind tasks, and thus at reasoning about user need. So unlike in the book, this administration can be offloaded to an AI. And your Honcho can orchestrate the relevant context and identities on your behalf, whatever the operation.
[^1]: American English, from [Japanese](https://en.wikipedia.org/wiki/Japanese_language)_[班長](https://en.wiktionary.org/wiki/%E7%8F%AD%E9%95%B7#Japanese)_ (hanchō, “squad leader”), from 19th c. [Mandarin](https://en.wikipedia.org/wiki/Mandarin_Chinese)[班長](https://en.wiktionary.org/wiki/%E7%8F%AD%E9%95%B7#Chinese) (_bānzhǎng_, “team leader”). Probably entered English during World War II: many apocryphal stories describe American soldiers hearing Japanese prisoners-of-war refer to their lieutenants as _[hanchō](https://en.wiktionary.org/wiki/hanch%C5%8D#Japanese)_. ([Wiktionary](https://en.wiktionary.org/wiki/honcho))
[^1]: American English, from [Japanese](https://en.wikipedia.org/wiki/Japanese_language)_[班長](https://en.wiktionary.org/wiki/%E7%8F%AD%E9%95%B7#Japanese)_ (hanchō, “squad leader”)...probably entered English during World War II: many apocryphal stories describe American soldiers hearing Japanese prisoners-of-war refer to their lieutenants as _[hanchō](https://en.wiktionary.org/wiki/hanch%C5%8D#Japanese)_. ([Wiktionary](https://en.wiktionary.org/wiki/honcho))

View File

@ -1,9 +0,0 @@
TL;DR: they aren't very flexible for intermediate metacognition steps
It's interesting that the machine learning community has decided to converge on this training paradigm because it assumes only two participants in a conversation. Just thinking intuitively about what happens when you train/fine-tune a language model, you being to reinforce token distributions that are appropriate to come in between the special tokens denoting human vs AI messages.
The issue we see here is that oftentimes there are a lot of intermediate reasoning steps you want to take in order to serve a more socially-aware answer. It's almost like the current state of inference is the equivalent of saying the first thing that comes to mind -- the quickness of one's wit can vary, but usually we think for a second before responding. We saw the advantages of doing this with Bloom (see [[Theory-of-Mind Is All You Need]]) and continue to be interested in exploring how much better this can get.
In order to assess its efficacy in this regard, I usually want to prompt it to generate as if it were the user -- which is usually very hard given the fact that those types of responses don't ever come after the special AI message token.
We're already anecdotally seeing very well-trained completion models follow instructions well likely because of its incorporation in their pre-training. Is chat the next thing to be subsumed by general completion models? Because if so, flexibility in the types of inferences you can make would be very beneficial. Metacognition becomes something you can do at any step in a conversation. Same with instruction following and chat. Maybe this is what starts to move language models in a much more general direction.

View File

@ -1,5 +1,5 @@
For wetware, metacognition is typically defined as "thinking about thinking" or often a catch-all for any "higher-level" cognition.
For wetware, metacognition is typically defined as thinking about thinking or often a catch-all for any higher-level cognition.
(In some more specific domains, it's an introspective process, focused on thinking about exclusively *your own* thinking or a suite of personal learning strategies...all valid within their purview, but too constrained for our purposes.)
(In some more specific domains, it's an introspective process, focused on thinking about exclusively _your own_ thinking or a suite of personal learning strategies...all valid within their purview, but too constrained for our purposes.)
In large language models, the synthetic corollary of cognition is inference. So we can reasonably define a metacognitive process in an LLM as any that runs inference on the output of prior inference. That is, inference itself is used as context--*inference about inference*. It might be instantly injected into the next prompt, stored for later use, or leveraged by another model. Experiments here will be critical to overcome [[The machine learning industry is too focused on general task performance|the machine learning community's fixation on task completion]].
In large language models, the synthetic corollary of cognition is inference. So we can reasonably define a metacognitive process in an LLM as any that runs inference on the output of prior inference. That is, inference itself is used as context--_inference about inference_. It might be instantly injected into the next prompt, stored for later use, or leveraged by another model. Experiments here will be critical to overcome [[The machine learning industry is too focused on general task performance|the machine learning community's fixation on task completion]].

View File

@ -1,8 +1,7 @@
The machine learning industry has traditionally adopted an academic approach, focusing primarily on performance across a range of tasks. LLMs like GPT-4 are a testament to this, having been scaled up to demonstrate impressive & diverse task capability. This scaling has also led to emergent abilities, debates about the true nature of which rage on.
The machine learning industry has traditionally adopted an academic approach, focusing primarily on performance across a range of tasks. Large Language Models (LLMs) like GPT-4 are a testament to this approach, having been scaled up to demonstrate impressive capabilities across numerous tasks. This scaling has led to the emergence of new abilities, although debates about the true nature of these emergent abilities continue.
However, general capability doesn't necessarily translate to completing tasks as an individual user would prefer. This is a failure mode that anyone building agents will inevitably encounter. The focus, therefore, needs to shift from how language models perform tasks in a general sense to how they perform tasks on a user-specific basis.
However, the issue that arises with this approach is that while these models are generally capable, they may not perform tasks in the way an individual user would prefer. This is a failure mode that anyone building agents will inevitably encounter. The focus, therefore, needs to shift from how language models perform tasks in a general sense to how they perform tasks on a user-specific basis.
Take summarization. Its a popular machine learning task at which models have become quite proficient, at least from a benchmark perspective. However, when models summarize for users with a pulse, they fall short. The reason is simple: the models dont know this individual. The key takeaways for a specific user differ dramatically from the takeaways _any possible_ internet user _would probably_ note.
Take the task of summarization as an example. Its a popular machine learning task and models have become quite proficient at it, at least from a benchmark perspective. However, when these models summarize for users, the results often fall short. The reason for this is simple: the models dont summarize things the way an individual user would. The key takeaways for a user would differ from the takeaways of the average internet user.
Therefore, a shift in focus towards user-specific task performance would provide a much more dynamic and realistic approach. This would not only cater to the individual needs of the user but also pave the way for more personalized and effective machine learning applications.
So a shift in focus toward user-specific task performance would provide a much more dynamic & realistic approach. Catering to individual needs & paving she way for more personalized & effective ML applications.