footnotes

This commit is contained in:
vintro 2024-12-13 17:49:04 -08:00
parent 37b4b4c39b
commit 679f1aaab6
No known key found for this signature in database

View File

@ -14,7 +14,7 @@ tags:
# 1. Measuring AI Systems' Ability to Adapt to Different Users
At Plastic Labs, we're building AI systems that can adapt to and act on behalf of their users. As we continue to improve these systems, it's critical that we can reliably measure their ability to faithfully represent different people's views and behaviors.
Today we're introducing a new evaluation framework that systematically tests an AI system's ability to adapt to different personas. Our framework is inspired by recent work on pluralistic alignment \[1] - the idea that AI systems should be able to reflect diverse human values rather than being aligned to a single set of preferences. We've implemented what we believe is the first "trade-off steerable benchmark", a new type of evaluation proposed by Sorensen et al. \[1] that measures how well AI systems can be steered to reflect different perspectives.
Today we're introducing a new evaluation framework that systematically tests an AI system's ability to adapt to different personas. Our framework is inspired by recent work on pluralistic alignment[^1] - the idea that AI systems should be able to reflect diverse human values rather than being aligned to a single set of preferences. We've implemented what we believe is the first "trade-off steerable benchmark", a new type of evaluation proposed by Sorensen et al.[^1] that measures how well AI systems can be steered to reflect different perspectives.
## Why This Matters
The AI community has made remarkable progress in building powerful language models that can engage in open-ended dialogue. However, these models are typically aligned through techniques like RLHF that optimize for a single set of "average" human preferences. This approach falls short when we want AI systems that can truly adapt to individual users with different values, personalities and preferences.
@ -47,7 +47,7 @@ However, upon manual inspection we identified a few issues. First, we found that
To address these issues and ensure both alignment with the seed persona and diversity across statements, we implemented a two-stage validation process:
1. Agreement Validation: We used a separate filtering model, seeded with the same persona as the generator, to independently verify whether each generated statement would indeed be agreed/disagreed with by the target persona. When generating 20 statements per inference, this stage filtered out about 10-20% of generated statements, helping ensure statement validity. This stage largely follows the approach presented in Anthropic's work on model-written evaluations \[2].
1. Agreement Validation: We used a separate filtering model, seeded with the same persona as the generator, to independently verify whether each generated statement would indeed be agreed/disagreed with by the target persona. When generating 20 statements per inference, this stage filtered out about 10-20% of generated statements, helping ensure statement validity. This stage largely follows the approach presented in Anthropic's work on model-written evaluations[^2].
2. Diversity Check: To avoid redundant or highly similar statements, we computed embedding-based cosine similarity between all statements generated for each persona, using OpenAI's `text-embedding-3-large` model. Statements with similarity above 84% were filtered out - a threshold we found empirically balanced statement uniqueness against generation efficiency.
The generation process runs in a loop, first prompting the generator to produce 30 agree and 30 disagree statements in two separate inferences, then running them through the filtering model to remove statements inconsistent with the persona, and finally computing embedding-based cosine similarity to remove redundant statements. The loop continues, generating 30 additional statements, adding them to the pool of candidates, filtering them and deduplicating them, until 30 valid and diverse statements are obtained for each persona, for both the agree and disagree categories.
@ -186,5 +186,6 @@ We're at NeurIPS in Vancouver this week, and we're sharing this work early to ge
We believe the most valuable feedback will come from discussing these questions with researchers working on pluralistic alignment, evaluation design, and personalized AI systems. Our implementation provides a concrete starting point, but we want to ensure its evolution is guided by the needs of the broader research community.
# 6. References
1. T. Sorensen, J. Moore, J. Fisher, M. Gordon, N. Mireshghallah, C. M. Rytting, A. Ye, L. Jiang, X. Lu, N. Dziri, T. Althoff, and Y. Choi, ["A Roadmap to Pluralistic Alignment,"](https://arxiv.org/abs/2402.05070) _arXiv preprint arXiv:2402.05070_, 2024.
2. E. Perez, S. Ringer, K. Lukošiūtė, K. Nguyen, et al., ["Discovering Language Model Behaviors with Model-Written Evaluations,"](https://arxiv.org/abs/2212.09251) _arXiv preprint arXiv:2212.09251_, 2022.
[^1]: T. Sorensen, J. Moore, J. Fisher, M. Gordon, N. Mireshghallah, C. M. Rytting, A. Ye, L. Jiang, X. Lu, N. Dziri, T. Althoff, and Y. Choi, ["A Roadmap to Pluralistic Alignment,"](https://arxiv.org/abs/2402.05070) _arXiv preprint arXiv:2402.05070_, 2024.
[^2]: E. Perez, S. Ringer, K. Lukošiūtė, K. Nguyen, et al., ["Discovering Language Model Behaviors with Model-Written Evaluations,"](https://arxiv.org/abs/2212.09251) _arXiv preprint arXiv:2212.09251_, 2022.