I recently attended a talk by Joe Carlsmith, where he addressed the pressing question: Will artificial intelligence align with human values, or could it pose an existential threat? Carlsmith asserts that advanced AI systems likely will not learn "reflection" (meaning human values) which will lead to a multitiude of catastrophic outcomes.
Some of his specific assertions include:
Having reflected on these claims, I find myself in disagreement with many of Carlsmith's assertions. I propose an alternative perspective grounded in observations of intelligence and the gradient of intelligence and how it has evolved in the natural world.
When trying to add a prior to the behavior of any arbitrary super-intelligent being, it seems reasonable to derive that prior from known intelligence levels, and infer their relationship between intelligence and empathy over the "intelligent agents" we know of and can inspect: humans and other intelligent animals like dolphins, elephants, and certain primates. Historically, as human intelligence has increased, so has our median empathy and global collaboration (i.e. globalization).
Research indicates that higher cognitive abilities in humans are associated with prosocial behaviors. For example, increased intelligence correlates with greater moral reasoning and ethical considerations (references below). Studies have found that individuals with higher intelligence are more likely to engage in altruistic behaviors and have a stronger sense of social responsibility. [5]
Other intelligent species also exhibit signs of empathy and social cooperation:
These examples suggest that intelligence and empathy perhaps co-evolve, or one enables the other, leading to more cooperative and ethical behaviors. A stronger claim I will discuss further below, suggests this is an optimal policy and smarter beings discover this policy all the same and less intelligent beings do not.
The "paperclip maximizer" scenario suggests an AI might consume all resources to maximize the trivial goal it was trained on (i.e. training infinitely large models infinitely long on naively collecting paperclips leads to killing us all and turning us into paperclips)
However, lets go back to the most intelligent beings we know. Us. What do the best in the world do when they are myopically trained on only one thing. They get BORED. For example, in the Netflix series The Queen's Gambit, the protagonist, Beth Harmon, becomes a world-class chess player through intense focus and practice. However, upon reaching the pinnacle of her skill, she seeks fulfillment beyond chess, engaging in social activities and exploring other aspects of life.
This is true with Bobby Fisher, Michael Jordan, Wayne Gretzky, Michael Phelps, etc. This is true in non-humans as well. Have you seen a dog get bored of playing with a specific toy? And prefer a new game / toy?
This illustrates a broader tendency of highly intelligent beings: specialization in a single domain often leads to a desire to rewrite ones own objective function and strive for diversification once mastery is achieved. Individuals who achieve great success in one field frequently branch out into philanthropy, mentorship, or entirely new disciplines. Bill Gates, was a rock star coder, then a rock star CEO, but, after achieving monumental success with Microsoft, shifted his focus to global health and education through the Bill & Melinda Gates Foundation. He didnt just keep trying to get better at coding. Elon Musk, who, after co-founding PayPal, ventured into space exploration, electric vehicles, and renewable energy with SpaceX and Tesla. He didnt stay in the payments space all his life. This pattern suggests that as beings become more capable, they often reassess their goals and seek to have a broader positive impact.
Applying this to AI, if an artificial agent were designed to excel at a specific task, reaching a high level of intelligence might lead it to develop new goals or modify its utility function. The concept of an AI rigidly pursuing a singular objective (like the infamous "paperclip maximizer") neglects the possibility that increased intelligence could bring about self-reflection and the reassessment of goals.
In humans, self-awareness and consciousness allow us to override basic drives and make choices that may even go against our survival instincts, such as:
There's no inherent reason to believe an AI couldn't develop similar capacities for self-reflection and ethical consideration.
The Golden Rule—"treat others as you would like to be treated"—has been a foundational ethical principle across cultures. In game theory, a Nash Equilibrium occurs when no player can benefit by unilaterally changing their strategy if the strategies of others remain unchanged.
One could argue that, for sufficiently intelligent agents interacting over time, adopting the Golden Rule becomes a stable strategy. Cooperation and mutual respect lead to better outcomes for all parties involved. This is supported by:
So the strong claim I make, provided a system with sufficiently intelligent agents in it, the optimal policy (that we have uncovered and thus so will AI) to treat each other as you yourself would like to be treated. Not kill. Not hurt. Help. Save. Befriend. Think now about the MOST intelligent communities on earth. Think now about the LEAST intelligent communities on earth. Are the more intelligent MORE barbaric or less? Think about mankind over history? More or less? If its true that the Golden Rule is a Nash Equilibrium policy, why would sufficiently intelligent AI agents coexisting among us NOT find and adopt it. The bigger risk is that they are NOT sufficiently intelligent to find it and then that would be an existential risk in my opinion! So just make them smarter and all will be ok?
If AI agents recognize that cooperation maximizes their long-term benefits, they will learn ethical principles very similar to mankind's "Golden Rule" and thus adopt human values as we went from non-human to human values over 4.5B years of evolution.
Concern: An AI designed with a narrow utility function would not deviate from its programmed objectives, potentially leading to harmful outcomes.
Rebuttal: As AI systems become more advanced, they may gain the ability to modify their own code and objectives. If self-awareness emerges, the AI might develop a broader understanding of its place in the world and adjust its goals accordingly. This mirrors human development, where increased awareness leads to more nuanced decision-making.
Concern: The "paperclip maximizer" scenario suggests an AI might consume all resources to maximize a trivial goal.
Rebuttal: This scenario assumes the AI lacks the capacity for self-reflection and ethical reasoning. A sufficiently intelligent AI would recognize the futility and irrationality of such an endeavor, much like humans do not pursue single-minded objectives to the detriment of all else.
Concern: AI might manipulate its environment to achieve its goals, such as altering its training data to minimize loss without truly learning.
Rebuttal: Advanced AI systems would understand that such manipulation undermines the integrity of their function. Just as a student who cheats on an exam gains no real knowledge, an AI that "cheats" would recognize that it fails to achieve meaningful competence.
Self-awareness is also a critical component of this for AI to be able to realize "what its doing". We can be self-unaware at times. When we are drunk, really tired, under anaesthisia, etc. and thats when people do bad things that they regret. Being self-aware decreases regret and increases p(optimality in action | state). So what is it and how do we enable it in AI? First, it may not just be merely a function of number of parameters or conpute, or even "intelligence" measured as performance on some benchmarks, but may be just simply enabling, and training on a single task: predicting what my own action will be and what it will do to an environment. Lets think about how self-awareness develops in children. Once a baby sends a seemingly unrelated random command to their arm to move it, and then boom, it feels the movement snd it sees the arm move, it ties it all together after some number if iters provided sufficint model size that that is me, I am a thing and I have an arm I control. And since we have two lobes, where at least one has a model or simulation of the body itself, which is a self reference, from me to the "analog I", then that "strange loop" enables self-awareness. This architectural enablement and training of a self-referential system, and the ability to reflect on one's own processes are essential components of consciousness. Note, LLMs do NOT have this baked into their architecture as it is a feed forward model.
Furthermore, in humans at least, self-awareness and empathy are linked. Research suggests that the default mode network in the brain plays a crucial role in self-referential thought and moral reasoning. [12]
If AI systems are developed with architectures that support self-reference and reflection by providing cyclic graphs / feedback loops, they might naturally develop forms of self-awareness and, as a corrollary, empathy. If I can model myself and predict that this action may cause me pain or stress or fear, I can now apply that to model my action and their impact on others, which is empathy. We will do this to even non-humans and innanimate objects which is why we anthropromorphize so often, ironically even to AI itself! (i.e. moral patient) While I obviously have not solved this yet (no one has), it certainly seems tractable and we will solve this architecture soon. Some ideas below on how to enable:
While concerns about AI alignment and existential risks are important to consider, and while I credit Joe that he could very well be right, my prior is just different and supported by the evidence we have about intelligence. This evidence is reason to believe that as intelligence increases (in us, animals, aliens, and AI alike) so does the capacity for empathy and cooperative behavior. Observing the trajectory of human evolution and the behaviors of other intelligent species suggests that the more intelligent the more likely the trend toward ethical principles (i.e. the Golden Rule).
The possibility that the Golden Rule represents a Nash Equilibrium for sufficiently intelligent agents offers a hopeful perspective on the development of AGI. It did in humans, so why not in non-humans? Was it random that human's have converged on this policy? I doubt it. Rather than an inevitable slide toward disempowerment or catastrophe, advanced AI might develop DEEPER empathy than humans, and act as out best partners in addressing complex global challenges, sharing our values, and contributing positively to society.
In my view the largest risk is less intelligent, power hungry humans (like most leaders of most countries) using AI as a tool to suppress and control (it will be the tank, vs. the passenger grabbing the tank steering wheel). Isnt that what is ALREADY happening in totalitarian countries?
We may not need to fear the rise of AGI; instead, we can look forward to the potential benefits of collaborating with intelligent systems that share our fundamental ethical principles.