We (Alfred and Jeremy) started a Dovetail project on Natural Latents in order to get some experience with the proofs. Originally we were going to take a crack at this bounty, but just before we got started John and David published a proof, closing the bounty. This proof involves a series of transformations that can take any stochastic latent, and transform it into a deterministic latent, where each step can only increase the error by a small multiple. We decided to work through the proof and understand each step, and attempt to improve the bound. After 20 hours or so working through it, we found a fatal flaw in one step of the proof, and spent the next several days understanding and verifying it. With David’s help, the counterexamples we found were strengthened to show that this transformation step had unbounded error on some cases. We started using sympy for arbitrary precision evaluation of counterexamples, and along the way found a bug in sympy that occasionally caused us some problems. We switched to Mathematica.
This post is about one of the results described in the 2004 paper 'Information-theoretic approach to the study of control systems' by Hugo Touchette and Seth Lloyd.[1] The paper compares 'open-loop' and 'closed-loop' controllers (which we here call 'blind' and 'sighted' policies) for the task of reducing entropy and quantifies the difference between them using the mutual information between the controller and the environment. This post is a pedagogical guide to this result and includes discussion of it in light of the selection theorems agenda and the agent structure problem.[2] The proof, along with more worked-out examples, can be found in the next post.
When studying for agent foundations research, I kept finding that I wanted a good general formalism of "stuff happening over time". Applications include...
The classical model of the scientific process is that its purpose is to find a theory that explains an observed phenomenon. Once you have any model whose outputs matches your observations, you have a valid candidate theory. Occam's razor says it should be simple. And if your theory can make correct predictions about observations that hadn't previously been made, then the theory is validated...
I was interested in the IMP because I wanted to know if it could be considered a selection theorem. A selection theorem is a result which tells us something about the structure of system, given that certain behaviours are selected for. In particular, in Agent Foundations, we are interested in circumstances under which 'agent-like structure' is selected for...
This post was written during the agent foundations fellowship with Alex Altair funded by the LTFF. This is the second part of a two posts series explaining the Internal Model Principle and how it might relate to AI Safety, particularly to Agent Foundations research. In the first post, we constructed a simplified version of IMP that was easier to understand and focused on building intuition about the theorem's assumptions. In this second post, we explain the general version of the theorem as stated by Cai&Wonham[1] and discuss how it relates to alignment-relevant questions such as agent-structure problem and selection theorems.
The Internal Model Principle (IMP) is often stated as "a feedback regulator must incorporate a dynamic model of its environment in its internal structure" which is one of those sentences where every word needs a footnote. I have written this post to summarise what I understand of the Internal Model Principle and I have tried to emphasise intuitive explanations.
This is the first part of a two-post series about the Internal Model Principle, which could be considered a selection theorem, and how it might relate to AI Safety, particularly to Agent Foundations research. In this first post, we will construct a simplified version of IMP that is easier to explain compared to the more general version and focus on the key ideas, building intuitions about the theorem's assumptions.
I am working on a project about ontology identification. I've found conversations to be a good way to discover inferential gaps when explaining ideas, so I'm experimenting with using dialogues as the main way of publishing progress during the fellowship. We can frame ontology identification as a robust bottleneck for a wide variety of problems in agent foundations & AI alignment...
When we have been discussing the Agent-like Structure problem, lookup tables often come up as a useful counter-example or intuition pump for how a system could exhibit agent-like behaviour without agent-like structure. It is fairly intuitive that, in the limit of a large number of entries, a lookup table requires a longer program to implement than a program which 'just' computes a function.
We prove a version of the Good Regulator Theorem for a regulator with imperfect knowledge of its environment aiming to minimize the entropy of an output.
This is a post explaining the proof of the paper "Robust Agents Learn Causal World Models" in detail. Check the previous post in the sequence for a higher-level summary and discussion of the paper, including an explanation of the basic setup (like terminologies and assumptions) which this post will assume from now on.
The selection theorems agenda aims to prove statements of the following form: "agents selected under criteria X has property Y," where Y are things such as world models, general purpose search, modularity, etc. We're going to focus on world models. But what is the intuition that makes us expect to be able to prove such things in the first place? Why expect world models?
The Good Regulator Theorem, as published by Conant and Ashby in their 1970 paper claims to show that 'every good regulator of a system must be a model of that system', though it is a subject of debate as to whether this is actually what the paper shows. It is a fairly simple mathematical result which is worth knowing about for people who care about agent foundations and selection theorems.
I think that the statement of the Natural Abstractions Hypothesis is not true and that whenever cognitive systems converge on using the same abstractions this is almost entirely due to similarities present in the systems themselves, rather than any fact about the world being 'naturally abstractable'. I tried to explain my view in a conversation and didn't do a very good job, so this is a second attempt.
A choice of variable in causal modeling is good if its causal effect is consistent across all the different ways of implementing it in terms of the low-level model. This notion can be made precise into a relation among causal models, giving us conditions as to when we can ground the causal meaning of high-level variables in terms of their low-level representations. A distillation of (Rubenstein, 2017).
This is an edited transcription of the final presentation I gave for the AI safety camp cohort of early 2024. It describes some of what the project is aiming for, and some motivation. Here's a link to the slides. See this post for a more detailed and technical overview of the problem. This is the presentation for the project that is described as "does sufficient optimization imply agent structure". That's what we call the "agent structure problem" which was posed by John Wentworth, that's what we spent the project working on. But mostly for this presentation I'm going to talk about what we mean by "structure" (or what we hope to mean by structure) and why I think it matters for AI safety.
I've noticed that when trying to understand a math paper, there are a few different ways my skill level can be the blocker. Some of these ways line up with some typical levels of organization in math papers: Understanding a piece of math will require understanding each of these things in order. It can be very useful to identify which of type of thing I'm stuck on, because the different types can require totally different strategies. Beyond reading papers, I'm also trying to produce new and useful mathematics. Each of these three levels has another associated skill of generating them. But it seems to me that the generating skills go in the opposite order.
In Clarifying the Agent-Like Structure Problem (2022), John Wentworth describes a hypothetical instance of what he calls a selection theorem. In Scott Garrabrant's words, the question is, does agent-like behavior imply agent-like architecture? That is, if we take some class of behaving things and apply a filter for agent-like behavior, do we end up selecting things with agent-like architecture (or structure)? Of course, this question is heavily under-specified. So another way to ask this is, under which conditions does agent-like behavior imply agent-like structure? And, do those conditions feel like they formally encapsulate a naturally occurring condition?
A major concession of the introduction post was limiting our treatment to finite sets of states. These are easier to reason about, and the math is usually cleaner, so it's a good place to start when trying to define and understand a concept. But infinities can be important. It seems quite plausible that our universe has infinitely many states, and it is frequently most convenient for even the simplest models to have continuous parameters. So if the concept of entropy as the number of bits you need to uniquely distinguish a state is going to be useful, we'd better figure out how to apply it to continuous state spaces.[1]
In recent years, there have been several cases of alignment researchers using Conway's Game of Life as a research environment; Conway's Game of Life is by far the most popular and well-known cellular automaton. And for good reason; it's immediately appealing and just begs to be played with. It is a great model context in which to research things like optimization and agency; These properties do a good job at mimicking the real universe while being significantly more tractable. But those who play with Life will notice some odd things about it that are not mimicked in the real world, especially if they're familiar with physics;
Orienting around the ideas and conclusions involved with AI x-risk can be very difficult. The future possibilities can feel extreme and far-mode, even when we whole-heartedly affirm their plausibility. It helps me to remember that everything around me that feels normal and stable is itself the result of an optimization process that was, at the time, an outrageous black swan. If you were teleported into the body of a random human throughout history, then most likely, your life would look nothing like the present. You would likely be a hunter-gatherer, or perhaps a farmer. You would be poor by any reasonable standard. You would probably die as a child. You would have nothing resembling your current level of comfort, and your modern daily life would be utterly alien to most humans. What currently feels normal is a freeze-frame state of a shriekingly fast feedback loop involving knowledge, industry, and population. It is nowhere near normal, and it is nowhere near equilibrium.
Explanations of entropy tend to only be concerned about the application of the concept in their particular sub-domain. Here, I try to take on the task of synthesizing the abstract concept of entropy, to show what's so deep about it. Entropy is so fundamental because it applies far beyond our own specific universe. It applies in any system with different states.