Editors Note: This blog article is a distillation of an in-person event held in San Francisco on 2024-05-12 facilitated by
in partnership with . Quotes are paraphrased from the original conversation.The alignment and ethical development of AI agents is a multifaceted issue that involves navigating the conflicting values of multiple stakeholders, as well as grappling with the question of whether AI agents represent a fundamentally new kind of challenge that requires novel approaches to evaluation and control. The Ai Salon recently held a discussion inspired by a recent paper: The ethics of advanced AI assistants.
👉 To jump directly to a list of takeaways and open questions, click here.
The Novelty of AI Agents: A Difference in Degree or Kind?
One of the key tensions that emerged from the conversation was whether AI agents represent a fundamentally new challenge requiring novel approaches to evaluation and control, or if they are simply an extension of existing technologies that can be managed using established frameworks. This question strikes at the heart of the AI agent alignment problem and has significant implications for how we approach the development and governance of these systems.
On one side of the debate, some participants argued that AI agents are not entirely unprecedented and that many of their potential risks and challenges have been encountered before in other contexts. The idea is that existing methods for evaluating and controlling complex technological systems may be applicable to AI agents, albeit with some adaptations. By drawing on the lessons learned from dealing with other autonomous or semi-autonomous systems, we may be able to develop effective strategies for managing the risks and challenges posed by AI agents without starting from scratch. As one participant said:
We've encountered these sort of semi-autonomous digital systems before, like software, viruses, and worms. We have a whole set of technical and regulatory responses to that. It's not a silver bullet, it's not zero risk, but we know how to think about that problem and what kind of technical and legal measures we need to take in response
An important component of this approach is a use-case-centric view of ethics and safety. In comparison to more general-purpose alignment techniques like RLHF that seek to create broad-scale value alignment, use case-centrism recognizes that safeguards, certifications, evaluations and other aspects of governance & safety should be applied with respect to the context in which the AI agent is operating and the task it is performing. An extreme application forwards that the fact that it is an AI agent is actually irrelevant; as long as we understand the use case, we can set up acceptability parameters which should apply to any system performing the task (human, AI, or some other software). An advantage of this view is we can build off of the work already done in society which defines acceptability in many areas of public importance, be it healthcare, housing, education, etc.
However, other participants countered that AI agents may cross certain thresholds of capability and autonomy that fundamentally distinguish them from other technologies. This isn’t “business as usual” now, and certainly won’t be in the future. As one participant put it:
We may be underselling the emerging dynamics that come through [from AI advancements]. And, at a certain point, a difference of degree becomes one of kind. In particular, it's important to remember we don't yet have really powerful AI agents. We don't have systems that are capable of the kinds of long term strategic reasoning that will be available in just a couple of years.
This view goes on to view our world as a complex system affected by the interactions of many agents. In such a system, we should expect phase transitions where the entire system shifts into regimes of radically different behavior. Predicting these phase transitions is notoriously difficult:
You can ask the question, when do you get phase transitions and systems? Whether it's the climate with tipping points, or whether it's a lake that all of a sudden starts mixing and it turns into a dead eutrophy lake. I think one of the core pieces is based on, back to that point, about, what can we predict based on what we know?
No easy answers were proposed for how you could predict phase transitions or what their consequence would be. However, it was suggested that a reasonable place to start in characterizing systems of agents would be the power of any individual agent in the system. A group of microbes acts different than colony of ants, which acts different than a city of humans. The capabilities of the base “agent” isn’t the only difference here, but it seems a critical one. AI agents may supply yet another level of individual agent, which we should assume will have great impact on the overall system.
The upshot of this argument is that as AI agents become more advanced and capable of long-term strategic reasoning, their impact on society and the challenges they pose for alignment may become fundamentally different from those of other technologies or agents of the past.
One proposal for marrying these two views was a focus on an abundance of evaluations. They proposed massively multi-dimensional evaluation where AI systems are measured on thousands of dimensions. They note that when we say that something is “generally” intelligent, we may just be using a shorthand for “10’s of thousands of tasks”: so many tasks that to humans they seem undifferentiated. If we could understand these many tasks, we could measure performance on them. Connecting to the use-case specific view, perhaps with massively multi-dimensional evaluation we can cover use-case specific assessments while giving a broader view of an AI system’s capabilities. If we have advanced AI assistants, we don’t just have new systems to steer and control, but new tools with which to support that steering and controlling. One participant pointed out that we should question a scarcity view on oversight, noting that the future may be quite different:
[with respect to] oversight within specific tasks… we have a scarcity mindset. We're restricted by the amount of people, the expertise that could be doled out, the number of different institutions we have, and how we parse out the different tasks we want to have oversight over. But in this future of these very powerful AGI systems and an advancing theory of evaluations and oversight, I don't see why we don't turn some of these assistants, some of these evaluations, to give ourselves a dimensional perspective on tasks that is just far beyond anything we've done before.
They go on to connect evaluations of agents to psychometric approaches for human evaluation. Thanks to decades of research, there are expectations of how different forms of knowledge correlate. For instance, for software engineer hiring, people often test data structures & algorithms. The reason why is not because those specific algorithms are likely important to the job, but because:
[employers] care that that kind of knowledge [e.g., algorithms] correlates with a huge set of knowledge that is important for being a software engineer and generally seems to be somewhat predictive.
We are currently taking a similar approach to evaluating AI systems, by testing on relatively few benchmarks and assuming that performance on that subset implies broader capabilities. This may be the case, but it seems far too early to assume it. Thus a larger set of evaluations are both important to more directly connect to use-case specific approaches, and also to improve our understanding of how agent capabilities evolve and correlate.
Ultimately, the question of whether AI agents represent a difference in degree or kind remains unresolved. While existing frameworks and methods may offer valuable insights and tools for managing the challenges posed by AI agents, it is also important to remain open to the possibility that these systems may require fundamentally new approaches as they continue to evolve and become more sophisticated. Striking the right balance between adapting established methods and developing novel ones will be crucial for ensuring the responsible development and alignment of AI agents in the future.
Multi-dimensional Alignment: Navigating Conflicting Stakeholder Values
The Ethics of Advanced AI Assistants emphasized the multi-stakeholder nature of AI steering:
To be beneficial and value-aligned, we argue that assistants must be appropriately responsive to the competing claims and needs of users, developers and society
Of course, none of these groups are monoliths: in particular, the discussion emphasized the difference between “developer” and “company”. Developers may be focused on technical capabilities, performance, and personal mission, while companies are driven by market demands and profitability. Balancing these competing interests is a formidable task, complicated by the power dynamics at play between different stakeholder groups. The power dynamics between users and companies can have a significant impact on the alignment of AI agents, as one participant pointed out:
Power dynamics between users and companies can influence which values take precedence in AI agent alignment," suggesting that the interests of more powerful stakeholders may dominate the alignment process.
Further complicating the challenge of multi-dimensional alignment is the difficulty of defining and operationalizing values in the context of AI agent alignment. Values are often abstract and subjective, making them hard to translate into concrete, actionable objectives that can be implemented in AI systems.
One participant referenced a previous Ai Salon on the topic of Common Sense, noting that some belief systems, particularly those with a strong tradition of codifying values into explicit rules or prescriptions, may have an advantage when it comes to aligning AI systems with their values. For example, a religion that has a clear set of commandments or a well-defined ethical framework may find it easier to translate those values into a format that can be implemented in an AI system. We’ve universalized some of these perspectives, but much is still not written down:
Human rights took time for us to be prescriptive, but the things that are not written down… need to be observed or the work hasn't yet been done to transform them [into an alignable form]
To navigate the challenges of multi-dimensional alignment, participants in the discussion proposed various approaches. One suggestion was to involve diverse stakeholders in the development process, ensuring that a wide range of perspectives and values are represented. This could involve establishing clear guidelines and standards for value alignment, as well as creating mechanisms for ongoing monitoring and adjustment of AI agents to ensure they remain aligned with stakeholder values over time.
Another idea was the use of game theory and simulations to identify stable value alignments that balance the interests of different stakeholders. By modeling the interactions between AI agents and various stakeholders, it may be possible to find equilibrium points where the values and objectives of all parties are satisfactorily met.
However, some participants questioned whether perfect alignment is even possible given the inherent trade-offs and conflicts between different stakeholder values. They argued that the goal should not be to eliminate all tensions and conflicts, but rather to find ways to manage and navigate them in a transparent and accountable manner.
One participant drew an analogy to the way that conflicting values are managed in human society through democratic processes and institutions. Just as in human society we have mechanisms like elections, courts, and public deliberation to navigate conflicts between individual and collective values, we may need similar mechanisms for AI agents. The goal should not be perfect alignment, but rather creating a system where conflicts can be surfaced, discussed, and resolved in a legitimate and transparent way.
The concept of multi-dimensional alignment underscores the need for a collaborative and inclusive approach to AI agent development, along with the processes and institutions needed to safeguard that culture. Alignment is not a one-time problem to be solved, but an ongoing process of negotiation and collaboration that will be a central challenge of humanity forevermore.
Conclusion
As is often the case with these kinds of discussions, it is clear that the alignment of AI agents is a complex and multifaceted challenge. The key issues we discussed here are apparent in almost all safety conversations. The topic of “advanced agents” simply renews this conversation. The path forward will require navigating the conflicting values and interests of multiple stakeholders, as well as grappling with the question of whether AI agents represent a fundamentally new kind of challenge. The latter will play out soon enough as AI agents become more advanced and integrated into various aspects of society. We will see if our existing approaches can adapt without significant overhaul.
Notes from the conversation
Alignment of AI agents is a complex issue involving multiple stakeholders with potentially conflicting values and incentives.
Developers, companies, users, and society at large may have different perspectives on what values AI agents should be aligned with.
Power dynamics between users and companies can influence which values take precedence in AI agent alignment.
Government intervention in AI agent alignment is a contentious issue, with some arguing for its necessity and others cautioning against overreach.
Evaluating AI agents based on human standards may not be appropriate due to fundamental differences between human and AI cognition.
AI agents may require task-specific evaluations and certifications to ensure their safety and reliability in different contexts.
The generalization abilities of AI agents may be less predictable than those of humans, necessitating caution when making inferences about their capabilities.
AI agents should be transparent about their non-human nature to avoid misleading users.
Defining and aligning AI agents with values is complicated by the lack of a universal understanding of what constitutes a value.
The legibility of values and the inscrutability of AI models pose challenges for value alignment.
Foundational models and fine-tuning approaches have trade-offs in terms of efficiency, performance, and scalability.
The development of AI agents may lead to a proliferation of niche applications tailored to specific needs and constituencies.
Game theory and simulations could help identify stable value alignments for AI agents.
AI agents that act on behalf of users in a transparent and trustworthy manner are desirable.
AI assistants that help users reason about their experiences and personal growth are an exciting prospect.
The potential for AI agents to perpetuate biases and inequities is a concern that requires careful consideration in their design and deployment.
The long-term risks of powerful AI agents, such as their ability to operate autonomously and influence society, should be taken seriously.
Community building and collaboration among AI researchers and stakeholders are important for navigating the challenges of AI agent development.
The adoption of AI agents by companies may lead to a competitive advantage, potentially driving the further integration of AI into decision-making processes.
Developing appropriate evaluation frameworks for AI agents that consider their impact on various stakeholders and societal outcomes is crucial.
Questions
How can the conflicting values and incentives of different stakeholders be reconciled in AI agent alignment?
What mechanisms can be put in place to ensure that user values are not overshadowed by corporate interests in AI agent development?
Under what circumstances, if any, is government intervention in AI agent alignment justified, and how can it be implemented without stifling innovation?
What alternative evaluation methods can be developed to assess AI agents' capabilities and potential risks accurately?
How can task-specific evaluations and certifications for AI agents be standardized and enforced across different industries and jurisdictions?
What safeguards can be implemented to mitigate the risks associated with the unpredictable generalization abilities of AI agents?
How can transparency about the non-human nature of AI agents be ensured without compromising their effectiveness or user experience?
Is it possible to develop a universal framework for defining and aligning AI agents with human values, given the diversity of cultural and individual perspectives?
What research directions should be pursued to improve the legibility of AI models and the alignment of AI agents with human values?
How can the trade-offs between foundational models and fine-tuning approaches be navigated to optimize AI agent performance while maintaining scalability?
What policies or regulations, if any, are needed to ensure that the development of niche AI applications does not lead to the exclusion or marginalization of certain groups?
How can game theory and simulations be applied to real-world AI agent deployment, and what are the limitations of these approaches?
What technical and ethical standards should be established for AI agents that act on behalf of users to ensure their trustworthiness and accountability?
How can the potential benefits of AI assistants for personal growth and self-reflection be maximized while minimizing the risks of over-reliance or misuse?
What strategies can be employed to detect and mitigate biases and inequities perpetuated by AI agents, particularly in sensitive domains such as healthcare, education, and criminal justice?
How can the long-term risks of powerful AI agents be anticipated and addressed proactively, given the uncertainty surrounding their future capabilities and impact?
What role should community building and collaboration play in shaping the development and governance of AI agents, and how can diverse perspectives be effectively incorporated?
How can the competitive advantage conferred by AI agents be harnessed to promote responsible and beneficial AI development rather than a race to the bottom?
What are the key components of an evaluation framework for AI agents that balances technical performance with societal impact and stakeholder interests?
How can public trust in AI agents be fostered through transparency, accountability, and the demonstration of their alignment with human values?
Some of the authors from DeepMind have written a paper on "Justified Trust" of AI Assistants: https://facctconference.org/static/papers24/facct24-79.pdf
Good follow up to this discussion