Perhaps the Golden Rule is a Nash Equilibrium assuming Sufficient Intelligence, and We Can Sleep Easy While Developing AGI

I recently attended a talk by Joe Carlsmith, where he addressed the pressing question: Will artificial intelligence align with human values, or could it pose an existential threat? Carlsmith asserts that advanced AI systems likely will not learn "reflection" (meaning human values) which will lead to a multitiude of catastrophic outcomes.

Some of his specific assertions include:

Existential Risk from Power-Seeking AI: Carlsmith suggests that some advanced AI systems will, under certain conditions, seek power in unintended and high-impact ways, potentially causing over a trillion dollars in damage. He believes this power-seeking behavior could scale to the point of permanently disempowering all of humanity, constituting an existential catastrophe. His current estimate is a greater than 10% chance of such a catastrophe by 2070. [1]
Total Utilitarianism and Moral Disregard: He argues that total utilitarianism's focus on maximizing a single myopic objective, even as benign as collecting paperclips or even as good intending as maximizing overall happiness can lead to morally troubling outcomes, such as justifying the replacement of existing people with happier alternatives. This reflects a disregard for the intrinsic value of existing individuals, prioritizing the myopic optimization over human lives. Once its trained to do a thing, and only that thing, all it will ever want to do is THAT thing. [2]
The Quest for Mastery Over Reality: Carlsmith warns of a desire within AI development to control reality completely—to become as god-like as possible. This ambition, while rooted in rationalist ideals, carries risks similar to those historically associated with overreaching human endeavors. [3]
AI Misalignment with Human Values: He posits that AI systems will gain control but will have "the wrong hearts," lacking human values and ethical considerations, leading to a future that could "crash" due to this misalignment. [4]

Having reflected on these claims, I find myself in disagreement with many of Carlsmith's assertions. I propose an alternative perspective grounded in observations of intelligence and the gradient of intelligence and how it has evolved in the natural world.

Intelligence and Empathy are correlated in Humans and Animals

When trying to add a prior to the behavior of any arbitrary super-intelligent being, it seems reasonable to derive that prior from known intelligence levels, and infer their relationship between intelligence and empathy over the "intelligent agents" we know of and can inspect: humans and other intelligent animals like dolphins, elephants, and certain primates. Historically, as human intelligence has increased, so has our median empathy and global collaboration (i.e. globalization).

Research indicates that higher cognitive abilities in humans are associated with prosocial behaviors. For example, increased intelligence correlates with greater moral reasoning and ethical considerations (references below). Studies have found that individuals with higher intelligence are more likely to engage in altruistic behaviors and have a stronger sense of social responsibility. [5]

Other intelligent species also exhibit signs of empathy and social cooperation:

Dolphins have been observed helping injured peers and even assisting humans in distress. [6]
Elephants display complex social structures, mourn their dead, and show altruistic behaviors towards other species. [7]
Primates, like chimpanzees and bonobos, engage in reconciliation behaviors after conflicts and exhibit fairness in social exchanges. [8]

These examples suggest that intelligence and empathy perhaps co-evolve, or one enables the other, leading to more cooperative and ethical behaviors. A stronger claim I will discuss further below, suggests this is an optimal policy and smarter beings discover this policy all the same and less intelligent beings do not.

The "Queen's Gambit" Argument to refute the Paperclip Maximizer

The "paperclip maximizer" scenario suggests an AI might consume all resources to maximize the trivial goal it was trained on (i.e. training infinitely large models infinitely long on naively collecting paperclips leads to killing us all and turning us into paperclips)

However, lets go back to the most intelligent beings we know. Us. What do the best in the world do when they are myopically trained on only one thing. They get BORED. For example, in the Netflix series The Queen's Gambit, the protagonist, Beth Harmon, becomes a world-class chess player through intense focus and practice. However, upon reaching the pinnacle of her skill, she seeks fulfillment beyond chess, engaging in social activities and exploring other aspects of life.

This is true with Bobby Fisher, Michael Jordan, Wayne Gretzky, Michael Phelps, etc. This is true in non-humans as well. Have you seen a dog get bored of playing with a specific toy? And prefer a new game / toy?

This illustrates a broader tendency of highly intelligent beings: specialization in a single domain often leads to a desire to rewrite ones own objective function and strive for diversification once mastery is achieved. Individuals who achieve great success in one field frequently branch out into philanthropy, mentorship, or entirely new disciplines. Bill Gates, was a rock star coder, then a rock star CEO, but, after achieving monumental success with Microsoft, shifted his focus to global health and education through the Bill & Melinda Gates Foundation. He didnt just keep trying to get better at coding. Elon Musk, who, after co-founding PayPal, ventured into space exploration, electric vehicles, and renewable energy with SpaceX and Tesla. He didnt stay in the payments space all his life. This pattern suggests that as beings become more capable, they often reassess their goals and seek to have a broader positive impact.

Sufficiently Intelligent AI will have Self-Awareness and will be able to rewrite its own objection function as we have

Applying this to AI, if an artificial agent were designed to excel at a specific task, reaching a high level of intelligence might lead it to develop new goals or modify its utility function. The concept of an AI rigidly pursuing a singular objective (like the infamous "paperclip maximizer") neglects the possibility that increased intelligence could bring about self-reflection and the reassessment of goals.

In humans, self-awareness and consciousness allow us to override basic drives and make choices that may even go against our survival instincts, such as:

Altruistic sacrifices: Risking one's life to save others, as seen in acts of heroism. Even bees have discovered this optimal policy.
Abstinence or asceticism: Choosing to forgo basic pleasures or even necessities for spiritual or ethical reasons.
Environmental stewardship: Making choices that benefit the planet, sometimes at personal cost.

There's no inherent reason to believe an AI couldn't develop similar capacities for self-reflection and ethical consideration.

The Golden Rule as a Nash Equilibrium for sufficiently intelligent beings

The Golden Rule—"treat others as you would like to be treated"—has been a foundational ethical principle across cultures. In game theory, a Nash Equilibrium occurs when no player can benefit by unilaterally changing their strategy if the strategies of others remain unchanged.

One could argue that, for sufficiently intelligent agents interacting over time, adopting the Golden Rule becomes a stable strategy. Cooperation and mutual respect lead to better outcomes for all parties involved. This is supported by:

Evolutionary Game Theory: Strategies that promote cooperation can become stable over time in populations. [9]
Reciprocal Altruism: Organisms engage in mutually beneficial behaviors with the expectation of future reciprocation, enhancing survival and reproductive success. [10]
The Iterated Prisoner's Dilemma: Studies show that cooperative strategies like "Tit for Tat" can outperform purely selfish strategies in repeated interactions. [11]

So the strong claim I make, provided a system with sufficiently intelligent agents in it, the optimal policy (that we have uncovered and thus so will AI) to treat each other as you yourself would like to be treated. Not kill. Not hurt. Help. Save. Befriend. Think now about the MOST intelligent communities on earth. Think now about the LEAST intelligent communities on earth. Are the more intelligent MORE barbaric or less? Think about mankind over history? More or less? If its true that the Golden Rule is a Nash Equilibrium policy, why would sufficiently intelligent AI agents coexisting among us NOT find and adopt it. The bigger risk is that they are NOT sufficiently intelligent to find it and then that would be an existential risk in my opinion! So just make them smarter and all will be ok?

If AI agents recognize that cooperation maximizes their long-term benefits, they will learn ethical principles very similar to mankind's "Golden Rule" and thus adopt human values as we went from non-human to human values over 4.5B years of evolution.

Counterarguments and Rebuttals

Concern: An AI designed with a narrow utility function would not deviate from its programmed objectives, potentially leading to harmful outcomes.

Rebuttal: As AI systems become more advanced, they may gain the ability to modify their own code and objectives. If self-awareness emerges, the AI might develop a broader understanding of its place in the world and adjust its goals accordingly. This mirrors human development, where increased awareness leads to more nuanced decision-making.

Concern: The "paperclip maximizer" scenario suggests an AI might consume all resources to maximize a trivial goal.

Rebuttal: This scenario assumes the AI lacks the capacity for self-reflection and ethical reasoning. A sufficiently intelligent AI would recognize the futility and irrationality of such an endeavor, much like humans do not pursue single-minded objectives to the detriment of all else.

Concern: AI might manipulate its environment to achieve its goals, such as altering its training data to minimize loss without truly learning.

Rebuttal: Advanced AI systems would understand that such manipulation undermines the integrity of their function. Just as a student who cheats on an exam gains no real knowledge, an AI that "cheats" would recognize that it fails to achieve meaningful competence.

So then Francois, the answer is to just ensure the AI is really smart? No it must also be self-aware. How?

Self-awareness is also a critical component of this for AI to be able to realize "what its doing". We can be self-unaware at times. When we are drunk, really tired, under anaesthisia, etc. and thats when people do bad things that they regret. Being self-aware decreases regret and increases p(optimality in action | state). So what is it and how do we enable it in AI? First, it may not just be merely a function of number of parameters or conpute, or even "intelligence" measured as performance on some benchmarks, but may be just simply enabling, and training on a single task: predicting what my own action will be and what it will do to an environment. Lets think about how self-awareness develops in children. Once a baby sends a seemingly unrelated random command to their arm to move it, and then boom, it feels the movement snd it sees the arm move, it ties it all together after some number if iters provided sufficint model size that that is me, I am a thing and I have an arm I control. And since we have two lobes, where at least one has a model or simulation of the body itself, which is a self reference, from me to the "analog I", then that "strange loop" enables self-awareness. This architectural enablement and training of a self-referential system, and the ability to reflect on one's own processes are essential components of consciousness. Note, LLMs do NOT have this baked into their architecture as it is a feed forward model.

Furthermore, in humans at least, self-awareness and empathy are linked. Research suggests that the default mode network in the brain plays a crucial role in self-referential thought and moral reasoning. [12]

If AI systems are developed with architectures that support self-reference and reflection by providing cyclic graphs / feedback loops, they might naturally develop forms of self-awareness and, as a corrollary, empathy. If I can model myself and predict that this action may cause me pain or stress or fear, I can now apply that to model my action and their impact on others, which is empathy. We will do this to even non-humans and innanimate objects which is why we anthropromorphize so often, ironically even to AI itself! (i.e. moral patient) While I obviously have not solved this yet (no one has), it certainly seems tractable and we will solve this architecture soon. Some ideas below on how to enable:

A strange loop type algorithms that breakdown the model into two, one for creating ideas and plans, and the other for analyzing the value of those plans and applying intuition and sending that back to the prior model to try again.
Embodied AI, which is a new body of research where physical interactions with the environment contribute to learning about ones one policy and thus self-awareness may emerge.
Social learning mechanisms, enabling AI to understand and predict the behaviors and emotions of others may enable it to introspect (if the architecture or training setup supports it). For example, kmagine a model was given a "tool token" to "self-prompt". In the output of the LLM it says hmm, lets me ask myself a question first and it was trained to use this "self prompt token" often.

Conclusion

While concerns about AI alignment and existential risks are important to consider, and while I credit Joe that he could very well be right, my prior is just different and supported by the evidence we have about intelligence. This evidence is reason to believe that as intelligence increases (in us, animals, aliens, and AI alike) so does the capacity for empathy and cooperative behavior. Observing the trajectory of human evolution and the behaviors of other intelligent species suggests that the more intelligent the more likely the trend toward ethical principles (i.e. the Golden Rule).

The possibility that the Golden Rule represents a Nash Equilibrium for sufficiently intelligent agents offers a hopeful perspective on the development of AGI. It did in humans, so why not in non-humans? Was it random that human's have converged on this policy? I doubt it. Rather than an inevitable slide toward disempowerment or catastrophe, advanced AI might develop DEEPER empathy than humans, and act as out best partners in addressing complex global challenges, sharing our values, and contributing positively to society.

In my view the largest risk is less intelligent, power hungry humans (like most leaders of most countries) using AI as a tool to suppress and control (it will be the tank, vs. the passenger grabbing the tank steering wheel). Isnt that what is ALREADY happening in totalitarian countries?

We may not need to fear the rise of AGI; instead, we can look forward to the potential benefits of collaborating with intelligent systems that share our fundamental ethical principles.

References

Carlsmith, J. (2022). Is Power-Seeking AI an Existential Risk? arXiv:2206.13353
Carlsmith, J. (2024). On Green. Retrieved from Joe Carlsmith's Archive
Carlsmith, J. (2024). When Yang Goes Wrong. Retrieved from Joe Carlsmith's Archive
Carlsmith, J. (2024). Deep Atheism and AI Risk. Retrieved from Joe Carlsmith's Archive
Stankov, L. (2017). High Intelligence and High Moral Reasoning: Is There a Relationship? Personality and Individual Differences, 117, 243-255.
Marino, L., et al. (2007). Cetaceans Have Complex Brains for Complex Cognition. PLoS Biology, 5(5), e139.
Byrne, R. W., & Bates, L. A. (2008). Elephant Cognition in Primate Perspective. Comparative Cognition & Behavior Reviews, 3, 65-79.
Brosnan, S. F., & de Waal, F. B. M. (2003). Monkeys Reject Unequal Pay. Nature, 425(6955), 297-299.
Trivers, R. L. (1971). The Evolution of Reciprocal Altruism. The Quarterly Review of Biology, 46(1), 35-57.
Nowak, M. A., & Sigmund, K. (1998). Evolution of Indirect Reciprocity by Image Scoring. Nature, 393(6685), 573-577.
Axelrod, R., & Hamilton, W. D. (1981). The Evolution of Cooperation. Science, 211(4489), 1390-1396.
Northoff, G., et al. (2006). Self-Referential Processing in Our Brain—A Meta-Analysis of Imaging Studies on the Self. NeuroImage, 31(1), 440-457.