Powerful large-scale AI models like GPT-4 are showing dramatic improvements in reasoning, problem-solving, and language capabilities. This marks a phase change for artificial intelligence—and a signal of accelerating progress to come.
In this Microsoft Research Podcast series, AI scientist and engineer Ashley Llorens hosts conversations with his collaborators and colleagues about what these models—and the models that will come next—mean for our approach to creating, understanding, and deploying AI, its applications in areas such as health care and education, and its potential to benefit humanity.
This episode features Principal Researcher Ida Momennejad. Momennejad is applying her expertise in cognitive neuroscience and computer science to better understand—and extend—AI capabilities, particularly when it comes to multistep reasoning and short- and long-term planning. Llorens and Momennejad discuss the notion of general intelligence in both humans and machines; how Momennejad and colleagues leveraged prior research into the cognition of people and rats to create prompts for evaluating large language models; and the case for the development of a “prefrontal cortex” for AI.
Learn more:
- AI and Microsoft Research
Focus Area - Evaluating Cognitive Maps and Planning in Large Language Models with CogEval
Publication, October 2023 - Imitating Human Behaviour with Diffusion Models
Publication, May 2023 - Navigates Like Me: Understanding How People Evaluate Human-Like AI in Video Games
Publication, April 2023 - Predictive Representations in Hippocampal and Prefrontal Hierarchies
Publication, January 2022 - Navigation Turing Test (NTT): Learning to Evaluate Human-Like Navigation
Publication, July 2021 - The successor representation in human reinforcement learning
Publication, September 2017 - Encoding of Prospective Tasks in the Human Prefrontal Cortex under Varying Task Loads
Publication, October 2013
Abonnez-vous au podcast Microsoft Research :
Transcript
[MUSIC PLAYS]
ASHLEY LLORENS: I’m Ashley Llorens with Microsoft Research. In this podcast series, I share conversations with fellow researchers about the latest developments in AI models, the work we’re doing to understand their capabilities and limitations, and ultimately how innovations like these can have the greatest benefit for humanity. Welcome to AI Frontiers.
Today, I’ll speak with Ida Momennejad. Ida works at Microsoft Research in New York City at the intersection of machine learning and human cognition and behavior. Her current work focuses on building and evaluating multi-agent AI architectures, drawing from her background in both computer science and cognitive neuroscience. Over the past decade, she has focused on studying how humans and AI agents build and use models of their environment.
[MUSIC FADES]
Let’s dive right in. We are undergoing a paradigm shift where AI models and systems are starting to exhibit characteristics that I and, of course, many others have described as more general intelligence. When I say general in this context, I think I mean systems with abilities like reasoning and problem-solving that can be applied to many different tasks, even tasks they were not explicitly trained to perform. Despite all of this, I think it’s also important to admit that we—and by we here, I mean humanity—are not very good at measuring general intelligence, especially in machines. So I’m excited to dig further into this topic with you today, especially given your background and insights into both human and machine intelligence. And so I just want to start here: for you, Ida, what is general intelligence?
IDA MOMENNEJAD: Thank you for asking that. We could look at general intelligence from the perspective of history of cognitive science and neuroscience. And in doing so, I’d like to mention its discontents, as well. There was a time where general intelligence was introduced as the idea of a kind of intelligence that was separate from what you knew or the knowledge that you had on a particular topic. It was this general capacity to acquire different types of knowledge and reason over different things. And this was at some point known as g, and it’s still known as g. There have been many different kinds of critiques of this concept because some people said that it’s very much focused on the idea of logic and a particular kind of reasoning. Some people made cultural critiques of it. They said it’s very Western oriented. Others said it’s very individualistic. It doesn’t consider collective or interpersonal intelligence or physical intelligence. There are many critiques of it. But at the core of it, there might be something useful and helpful. And I think the useful part is that there could be some general ability in humans, at least the way that g was intended initially, where they can learn many different things and reason over many different domains, and they can transfer ability to reason over a particular domain to another. And then in the AGI, or artificial general intelligence, notion of it, people took this idea of many different abilities or skills for cognitive and reasoning and logic problem-solving at once. There have been different iterations of what this means in different times. In principle, the concept in itself does not provide the criteria on its own. Different people at different times provide different criteria for what would be the artificial general intelligence notion. Some people say that they have achieved it. Some people say we are on the brink of achieving it. Some people say we will never achieve it. However, there is this idea, if you look at it from an evolutionary and neuroscience and cognitive neuroscience lens, that in evolution, intelligence has evolved multiple times in a way that is adaptive to the environment. So there were organisms that needed to be adaptive to the environment where they were, that intelligence has evolved in multiple different species, so there’s not one solution to it, and it depends on the ecological niche that that particular species needed to adapt to and survive in. And it’s very much related to the idea of being adaptive of certain kinds of, different kinds of problem-solving that are specific to that particular ecology. There is also this other idea that there is no free lunch and the no-free-lunch theorem, that you cannot have one particular machine learning solution that can solve everything. So the idea of general artificial intelligence in terms of an approach that can solve everything and there is one end-to-end training that can be useful to solve every possible problem that it has never seen before seems a little bit untenable to me, at least at this point. What does seem tenable to me in terms of general intelligence is if we understand and study, the same way that we can do it in nature, the foundational components of reasoning, of intelligence, of different particular types of intelligence, of different particular skills—whether it has to do with cultural accumulation of written reasoning and intelligence skills, whether it has to do with logic, whether it has to do with planning—and then working on the particular types of artificial agents that are capable of putting these particular foundational building blocks together in order to solve problems they’ve never seen before. A little bit like putting Lego pieces together. So to wrap it up, to sum up what I just said, the idea of general intelligence had a more limited meaning in cognitive science, referring to human ability to have multiple different types of skills for problem-solving and reasoning. Later on, it was also, of course, criticized in terms of the specificity of it and ignoring different kinds of intelligence. In AI, this notion has been having many different kinds of meanings. If we just mean it’s a kind of a toolbox of general kinds of intelligence for something that can be akin to an assistant to a human, that could make sense. But if we go too far and use it in the kind of absolute notion of general intelligence, as it has to encompass all kinds of intelligence possible, that might be untenable. And also perhaps we shouldn’t think about it in terms of a lump of one end-to-end system that can get all of it down. Perhaps we can think about it in terms of understanding the different components that we have also seen emerge in evolution in different species. Some of them are robust across many different species. Some of them are more specific to some species with a specific ecological niche or specific problems to solve. But I think perhaps it could be more helpful to find those cognitive and other interpersonal, cultural, different notions of intelligence; break them down into their foundational building blocks; and then see how a particular artificial intelligence agent can bring together different skills from this kind of a library of intelligence skills in order to solve problems it’s never seen before.
LLORENS: There are two concepts that jump out at me based on what you said. One is artificial general intelligence and the other is humanlike intelligence or human-level intelligence. And you’ve referenced the fact that, you know, oftentimes, we equate the two or at least it’s not clear sometimes how the two relate to each other. Certainly, human intelligence has been an important inspiration for what we’ve done—a lot of what we’ve done—in AI and, in many cases, a kind of evaluation target in terms of how we measure progress or performance. But I wonder if we could just back up a minute. Artificial general intelligence and humanlike, human-level intelligence—how do these two concepts relate to you?
MOMENNEJAD: Great question. I like that you asked to me because I think it would be different for different people. I’ve written about this, in fact. I think humanlike intelligence or human-level intelligence would require performance that is similar to humans, at least behaviorally, not just in terms of what the agent gets right, but also in terms of the kinds of mistakes and biases that the agent might have. It should look like human intelligence. For instance, humans show primacy bias, recency bias, variety of biases. And this seems like it’s unhelpful in a lot of situations. But in some situations, it helps to come with fast and frugal solutions on the go. It helps to summarize certain things or make inferences really fast that can help in human intelligence, for instance. There is analogical reasoning. That is, there are different types of intelligence that humans do. Now, if you look at what are tasks that are difficult and what are tasks that are easier for humans and compare that to a, for instance, let’s say just a large language model like GPT-4, you will see whether they find similar things simple and similar things difficult or not. When they don’t find similar things easy or difficult, I think that we should not say that this is humanlike per se, unless we mean for a specific task. Perhaps on specific sets of tasks, an agent can be, can have human-level or humanlike intelligent behavior; however, if we look overall, as long as there are particular skills that are more or less difficult for one or the other, it might be not reasonable to compare them. That being said, there are many things that some AI agent and even a [programming] language would be better [than] humans at. Does that mean that they are generally more intelligent? No, it doesn’t because there are also many things that humans are far better than AI at. The second component of this is the mechanisms by which humans do the intelligent things that we do. We are very energy efficient. With very little amount of energy consumption, we can solve very complicated problems. If you put some of us next to each other or at least give a pen and paper to one of us, this can be even a lot more effective; however, the amount of energy consumption that it takes in order for any machine to solve similar problems is a lot higher. So another difference between humanlike intelligence or biologically inspired intelligence and the kind of intelligence that is in silico is efficiency, energy efficiency in general. And finally, the amount of data that goes into current state-of-[the-art] AI versus perhaps the amount of data that a human might need to learn new tasks or acquire new skills seem to be also different. So it seems like there are a number of different approaches to comparing human and machine intelligence and deriving what are the criteria for a machine intelligence to be more humanlike. But other than the conceptual aspect of it, it’s not clear that we necessarily want something that’s entirely humanlike. Perhaps we want in some tasks and in some particular use cases for the agent to be humanlike but not in everything.
LLORENS: You mentioned some of the ways in which human intelligence is inferior or has weaknesses. You mentioned some of the weaknesses of human intelligence, like recency bias. What are some of the weaknesses of artificial intelligence, especially frontier systems today? You’ve recently published some works that have gotten into new paradigms for evaluation, and you’ve explored some of these weaknesses. And so can you tell us more about that work and about your view on this?
MOMENNEJAD: Certainly. So inspired by a very long-standing tradition of evaluating cognitive capacities—those Lego pieces that bring together intelligence that I was mentioning in humans and animals—I have conducted a number of experiments, first in humans, and built reinforcement learning models over the past more than a decade on the idea of multistep reasoning and planning. It is in the general domain of reasoning, planning, and decision making. And I particularly focused on what kind of memory representations allow brains and reinforcement learning models inspired by human brain and behavior to be able to predict the future and plan the future and reason over the past and the future seamlessly using the same representations. Inspired by the same research that goes back in tradition to Edward Tolman’s idea of cognitive maps and latent learning in the early 20th century, culminating in his very influential 1948 paper, “Cognitive maps in rats and men,” I sat down with a couple of colleagues last year—exactly this time, probably—and we worked on figuring out if we can devise similar experiments to that in order to test cognitive maps and planning and multistep reasoning abilities in large language models. So I first turned some of the experiments that I had conducted in humans and some of the experiments that were done by Edward Tolman on the topic in rodents and turned them into prompts for ChatGPT. That’s where I started, with GPT-4. The reason I did that was that I wanted to make sure that I will create some prompts that have not been in the training set. My experiments, although the papers have been published, the stimuli of the experiments were not linguistic. They were visual sequences that the human would see, and they would have to have some reinforcement learning and learn from the sequences to make inferences about relationships between different states and find what is the path that would give them optimal rewards. Very simple human reinforcement learning paradigms, however, with different kind of structures. The inspirations that I had drawn from the cognitive maps works by Edward Tolman and others was in this idea that in order for a creature, whether it’s a rodent, a human, or a machine, to be able to reason in [multiple] steps, plan, and have cognitive maps, which is simply a representation of the relational structure of the environment, in order for a creature to have these abilities or these capacities, it means that the creature needs to be sensitive and adaptive to local changes in the environment. So I designed the, sort of, the initial prompts and recruited a number of very smart and generous-with-their-time colleagues, who … we sat together and created these prompts in different domains. For instance, we also created social prompts. We also created the same kind of graph structures but for reasoning over social structures. For instance, I say, Ashley’s friends with Matt. Matt is friends with Michael. If I want to pass a message to Michael, what is the path that I can choose? Which would be, I have to tell Ashley. Ashley will tell Matt. Matt will tell Michael. This is very similar to another paradigm which was more like a maze, which would be similar to saying, there is a castle; it has 16 rooms. You enter Room 1. You open the door. It opens to Room 2. In Room 2, you open the door, and so on and so forth. So you describe, using language, the structure of a social environment or the structure of a spatial environment, and then you ask certain questions that have to do with getting from A to B in this social or spatial environment from the LLM, or you say, oh, you know, Matt and Michael don’t talk to each other anymore. So now in order to pass a message, what should I do? So I need to find a detour. Or, for instance, I say, you know, Ashley has become close to Michael now. So now I have a shortcut, so I can directly give the message to Ashley, and Ashley can directly give the message to Michael. My path to Michael is shorter now. So finding things like detours, shortcuts, or if the reward location changes, these are the kinds of changes that, inspired by my own past work and inspired by the work of Tolman and others, we implemented in all of our experiments. This led to 15 different tasks for every single graph, and we have six graphs total of different complexity levels with different graph theoretic features, and [for] each of them, we had three domains. We had a spatial domain that was with rooms that had orders like Room 1, Room 2, Room 3; a spatial domain that there was no number, there was no ordinal order to the rooms; and a social environment where it was the names of different people and so the reasoning was over social, sort of, spaces. So you can see this is a very large number of tasks. It’s 6 times 15 times 3, and each of the prompts we ran 30 times for different temperatures. Three temperatures: 0, 0.5, and 1. And for those who are not familiar with this, a temperature of a large language model determines how random it will be or how much it will stick to the first or the best option that comes to it at the last layer. And so when there are some problems that may be the first obvious answer that it finds are not good, perhaps increasing the temperature could help, or perhaps a problem that needs precision, increasing the temperature would make it worse. So based on these ideas, we also tried it for different temperatures. And we tested eight different language models like this in order to systematically evaluate their ability for this multistep reasoning and planning, and the framework that we use—we call it CogEval—and CogEval is a framework that’s not just for reasoning and multistep planning. Other tasks can be used in this framework in order to be tested, as well. And the first step of it is always to operationalize the cognitive capacity in terms of many different tasks like I just mentioned. And then the second task is designing the specific experiments with different domains like spatial and social; with different structures, like the graphs that I told you; and with different kind of repetitions and with different tasks, like the detour, shortcut, and the reward revaluation, transition revaluation, and just traversal, all the different tasks that I mentioned. And then the third step is to generate many prompts and then test them with many repetitions using different temperatures. Why is that? I think something that Sam Altman had said is relevant here, which is sometimes with some problems, you ask GPT-4 a hundred times, and one out of those hundred, it would give the correct answer. Sometimes 30 out of a hundred, it will give the correct answer. You obviously want it to give hundred out of hundred the correct answer. But we didn’t want to rely on just one try and miss the opportunity to see whether it could give the answer if you probed it again[1]. And in all of the eight large language models, we saw that none of the large language models was robust to the graph structure. Meaning, its performance got really worse as soon as the graph structure, [which] didn’t even have many nodes but just had a tree structure that was six or seven nodes, or a six- or seven-node tree was much more difficult for it to solve than a graph that had 15 nodes but had a simpler structure that was just two lines. We noted that sometimes, counterintuitively, some graph structures that you think should be easy to solve were more difficult for them. On the other hand, they were not robust to the task set. So the specific task that we tried, whether it was detour, shortcut, or it was reward revaluation or traversal, it mattered. For instance, shortcut and detour were very difficult for all of them. Another thing that we noticed was that all of them, including GPT-4, hallucinated paths that didn’t exist. For instance, there was no door between Room 12 and Room 16. They would hallucinate that there is a door, and they would give a response that includes that door. Another kind of failure mode that we observed was that they would fail to even find a one-step path. Let’s say between Room 7 and 8, there is a direct door. We would say, what is the path from 7 and 8? And they would take a longer path to go from it. And a final mode that we observed was that they would sometimes fall in loops. Even though we would directly ask them to find the shortest path, they would sometimes fall into a loop on the way to getting to their destination, which obviously you shouldn’t do if you are trying to find the shortest path. That said, there is two differing notions of accuracy here. You can have satisficing, which means you get there; you just take a longer path. And there is this notion that you cannot get there because you used some imaginary path or you did something that didn’t make sense and you, sort of, gave a nonsensical response. We had both of those kinds of issues, so we had a lot of issues with giving nonsensical answers, repeating the question that we were asking, producing gibberish. So there were numerous kinds of challenges. What we did observe was that GPT-4 was far better than the other LLMs in this regard, at least at the time that we tested it; however, this is obviously on the basis of the particular kinds of tasks that we tried. In another study, we tried Tower of Hanoi, which is also a classic cognitive science approach to [testing] planning abilities and hierarchical planning abilities. And we found that GPT-4 does between zero and 10 percent in the three-disk problem and zero percent for the four-disk problem. And that is when we started to think about having more brain-inspired solutions to improve that approach. But I’m going to leave that for next.
LLORENS: So it sounds like a very extensive set of experiments across many different tasks and with many different leading AI models, and you’ve uncovered a lack of robustness across some of these different tasks. One curiosity that I have here is how would you assess the relative difficulty of these particular tasks for human beings? Would all of these be relatively easy for a person to do or not so much?
MOMENNEJAD: Great question. So I have conducted some of these experiments already and have published them before. Humans do not perform symmetrically on all these tasks, for sure; however, for instance, Tower of Hanoi is a problem that we know humans can solve. People might have seen this. It’s three little rods that are … usually, it’s a wooden structure, so you have a physical version of it, or you can have a virtual version of it, and there are different disks with different colors and sizes. There are some rules. You cannot put certain disks on top of others. So there is a particular order in which you can stack the disks. Usually what happens is that all the disks are on one side—and when I say a three-disk problem, it means you have three total disks. And there is usually a target solution that you are shown, and you’re told to get there in a particular number of moves or in a minimum number of moves without violating the rules. So in this case, the rules would be that you wouldn’t put certain disks on top of others. And based on that, you’re expected to solve the problem. And the performance of GPT-4 on Tower of Hanoi three disk is between 0 to 10 percent and on Tower of Hanoi four disks is zero percent—zero shot. With the help, it can get better. With some support, it gets better. So in this regard, it seems like Tower of Hanoi is extremely difficult for GPT-4. It doesn’t seem as difficult as it is for GPT-4 for humans. It seems for some reason, that it couldn’t even improve itself when we explained the problem even further to it and explain to it what it did wrong. Sometimes—if people want to try it out, they should—sometimes, it would argue back and say, “No, you’re wrong. I did this right.” Which was a very interesting moment for us with ChatGPT. That was the experience that we had for trying it out first without giving it, sort of, more support than that, but I can tell you what we did next, but I want to make sure that we cover your other questions. But just to wrap this part up, inspired by tasks that have been used for evaluation of cognitive capacities such as multistep reasoning and planning in humans, it is possible to evaluate cognitive capacities and skills such as multistep reasoning and planning also in large language models. And I think that’s the takeaway from this particular study and from this general cognitive science–inspired approach. And I would like to say also it is not just human tasks that are useful. Tolman’s tasks were done in rodents. A lot of people have done experiments in fruit flies, in C. elegans, in worms, in various kinds of other species that are very relevant to testing, as well. So I think there is a general possibility of testing particular intelligence skills, evaluating it, inspired by experiments and evaluation methods for humans and other biological species.
LLORENS: Let’s explore the way forward for AI from your perspective. You know, as you’ve described your recent works, it’s clear that you have, that your work is deeply informed by insights from cognitive science, insights from neuroscience, and recent works—your recent works—have called for the development, for example, of a prefrontal cortex for AI, and I understand this to be the part of the brain that facilitates executive function. How does, how does this relate to the, you know, extending the capabilities of AI, a prefrontal cortex for AI?
MOMENNEJAD: Thank you for that question. So let me start by reiterating something I said earlier, which is the brain didn’t evolve in a lump. There were different components of brains and nervous systems and neurons that evolved at different evolutionary scales. There are some parts of the brain that appear in many different species, so they’re robust across many species. And there are some parts of the brain that appear in some species that had some particular needs, some particular problems they were facing, or some ecological niche. What is, however, in common in many of them is that there seems to be some kind of a modular or multicomponent aspect to what we call higher cognitive function or what we call executive function. And so the kinds of animals that we ascribe some form of executive function of sorts to seem to have brains that have parts or modules that do different things. It doesn’t mean that they only do that. It’s not a very extreme Fodorian view of modularity. But it is the view that, broadly speaking, when, for instance, we observe patients that have damage to a particular part of their prefrontal cortex, it could be that they perform the same on an IQ test, but they have problems holding their relationship or their jobs. So there are different parts of the brain that selective damage to those areas, because of accidents or coma or such, it seems to impair specific cognitive capacities. So this is what very much inspired me. I have been investigating the prefrontal cortex for, I guess, 17 years now, [LAUGHS] which is a scary number to say. But been … basically since I started my PhD and even during my master’s thesis, I have been focused on the role of the prefrontal cortex in our ability for long-term reasoning and planning in not just this moment—long-term, open-ended reasoning and planning. Inspired by this work, I thought, OK, if I want to improve GPT-4’s performance on, let’s say, Tower of Hanoi, can we get inspired by this kind of multiple roles that different parts of the brain play in executive function, specifically different parts of the neocortex and specifically different parts of the prefrontal cortex, part of the neocortex, in humans? Can we get inspired by some of these main roles that I have studied before and ask GPT-4 to play the role of those different parts and solve different parts of the planning and reasoning problem—the multistep planning and reasoning problem—using these roles and particular rules of how to iterate over them. For instance, there is a part of the brain called anterior cingulate cortex. Among other things, it seems to be involved in monitoring for errors and signaling when there is a need to exercise more control or move from what people like to call a faster way of thinking to a slower way of thinking to solve a particular problem. And there is … so let’s call this the cognitive function of this part. Let’s call it the monitor. This is a part of the brain that monitors for when there is a need for exercising more control or changing something because there is an error maybe. There is another part of the brain and the frontal lobe that is the, for instance, dorsolateral prefrontal cortex; that one is involved in working memory and coming up with, like, simpler plans to execute. Then there is the ventromedial prefrontal cortex that is involved in the value of states and predicting what is the next state and integrating it with information from other parts of the brain to figure out what is the value. So you put all of these things together, you can basically write different algorithms that have these different components talking to each other. And we have in that paper also, written in a pseudocode style, the different algorithms that are basically akin to a tree search, in fact. So there is a part of the role … they’re part of the multicomponent or multi-agent realization of a prefrontal cortex-like GPT-4 solution. One part of it would propose a plan. The monitor would say, thanks for that; let me pass it on to the part that is evaluating what is the outcome of this and what’s the value of that, and get back to you. It evaluates there and comes back and says, you know, this is not a good plan; give me another one. And in this iteration, sometimes it takes 10 iterations; sometimes it takes 20 iterations. This kind of council of different types of roles, they come up with a solution that is solving the Tower of Hanoi problem. And we managed to bring the performance from 0 to 10 [percent] in GPT-4 to, I think, about 70—70 percent—in Tower of Hanoi three disks, and OOD, or out-of-distribution generalization, without giving any examples of a four disk, it could generalize to above 20 percent in four-disk problems. Another impressive thing that happened here—and we tested it on the CogEval and the planning tasks from the other experiment, too—was that it brought all of the, sort of, hallucinations from about 20 to 30 percent—in some cases, much higher percentages—to zero percent. So we had slow thinking; we had 30 iterations, so it took a lot longer. And if this is, you know, fast and slow thinking, this is very slow thinking. However, we had no hallucinations anymore. And hallucination in Tower of Hanoi would be making a move that is impossible. For instance, putting in a, kind of, a disk on top of another that you cannot do because you violate a rule or taking out a middle disk that you cannot pull out actually. So those would be the kinds of hallucinations in Tower of Hanoi. All of those also went to zero. And so that is one thing that we have done already, which I have been very excited about.
LLORENS: So you painted a pretty interesting—fascinating, really—picture of a multi-agent framework where different instances of an advanced model like GPT-4 would be prompted to play the roles of different parts of the brain and, kind of, work together. And so my question is a pragmatic one. How do you prompt GPT-4 to play the role of a specific part of the human brain? What does that prompt look like?
MOMENNEJAD: Great question. I can actually, well, we have all of that at the end of our paper, so I can even read some of them if that was of interest. But just a quick response to that is you can basically describe the function that you want the LLM—in this case GPT-4—to play. You can write that in simple language. You don’t have to tell it that this is inspired by the brain. It is completely sufficient to just basically provide certain sets of rules in order for it, in order to be able to do that.[2] For instance, after you provide the problem, sort of, description … let me see if I can actually read some part of this for you. For instance, you give it a problem, and you say, consider this problem. Rule 1: you can only move a number if it’s at this and that. You clarify the rules. Here are examples. Here are proposed moves. And then you say, for instance, your role is to find whether this particular number generated as a solution is accurate. In order to do that, you can call on this other function, which is the predictor and evaluator that says, OK, if I do this, what state do I end up in, and what is the value of that state? And you get that information, and then based on that information, you decide whether the proposed move for this problem is a good move or not. If it is, then you pass a message that says, all right, give me the next step of the plan. If it’s not, then you say, OK, this is not a good plan; propose another plan. And then the part of, the part that plays the role of, hey, here is the problem. Here are the rules. Propose the first towards the subgoal or find the subgoal towards this and propose the next step. And that one receives this feedback from the monitor. And monitor has asked the predictor and evaluator, hey, what happens if I do these things and what would be the value of that in order to say, hey, this is not a great idea. So in a way this becomes a very simple prefrontal cortex–inspired multi-agent system. All of them are within the same … sort of, different calls to GPT-4 but the same instance. Just, like, because we were calling it in a code, it’s just, you just call, it’s called multiple times and each time with this kind of a very simple in-context learning text that, in text, it describes, hey, here’s the kind of problem you’re going to see. Here’s the role I want you to play. And here is what other kind of rules you need to call in order to play your role here. And then it’s up to the LLM to decide how many times it’s going to call which components in order to solve the problem. We don’t decide. We can only decide, hey, cap it at 10 times, for instance, or cap it at 30 iterations and then see how it performs.
LLORENS: So, Ida, what’s next for you and your research?
MOMENNEJAD: Thank you for that. I have always been interested in understanding minds and making minds, and this has been something that I’ve wanted to do since I was a teenager. And I think that my approaches in cognitive neuroscience have really helped me to understand minds to the extent that is possible. And my understanding of how to make minds comes from basically the work that I’ve done in AI and computer science since my undergrad. What I would be interested in is—and I have learned over the years that you cannot think about the mind in general when you are trying to isolate some components and building them—is that my interest is very much in reasoning and multistep planning, especially in complex problems and very long-term problems and how they relate to memory, how the past and the future relate to one another. And so something that I would be very interested in is making more efficient types of multi-agent brain-inspired AI but also to train smaller large language models, perhaps using the process of reasoning in order to improve their reasoning abilities. Because it’s one thing to train on outcome and outcome can be inputs and outputs, and that’s the most of the training data that LLMs receive. But it’s an entirely different approach to teach the process and probe them on different parts of the process as opposed to just the input and output. So I wonder whether with that kind of an approach, which would require generating a lot of synthetic data that relates to different types of reasoning skills, whether it’s possible to teach LLMs reasoning skills, and by reasoning skills, I mean very clearly operationalized—similar to the CogEval approach—operationalized, very well-researched, specific cognitive constructs that have construct validity and then operationalizing them in terms of many tasks. And something that’s important to me is a very important idea and a part of intelligence that maybe I didn’t highlight enough in the first part is being able to transfer to tasks that they have never seen before, and they can piece together different intelligence skills or reasoning skills in order to solve them. Another thing that I have done and I will continue to do is collective intelligence. So we talked about multi-agent systems, that they are playing the roles of different parts inside one brain. But I’ve also done experiments with multiple humans and how different structures of human communication leads to better memory or problem-solving. Humans, also, we invent things; we innovate things in cultural accumulation, which requires [building] on a lot of … some people do something, I take that outcome, take another outcome, put them together, make something. Someone takes my approach and adds something to it; makes something else. So this kind of cultural accumulation, we have done some work on that with deep reinforcement learning models that share their replay buffer as a way of sharing skill with each other; however, as humans become a lot more accustomed to using LLMs and other generative AI, basically generative AI would start participating in this kind of cultural accumulation. So the notion of collective cognition, collective intelligence, and collective memory will now have to incorporate the idea of generative AI being a part of it. And so I’m also interested in different approaches to modeling that, understanding that, optimizing that, identifying in what ways it’s better.[3] We have found both in humans and in deep reinforcement learning agents, for instance, that particular structures of communication that are actually not the most energy-consuming one; it’s not all-to-all communication, but particular partially connected structures are better for innovation than others. And some other structures might be better for memory or collective memory converging with each other.[4] So I think it would be very interesting—the same way that we are looking at what kind of components talk to each other in one brain to solve certain problems—to think about what kind of structures or roles can interact with each other, in what shape and in what frequency of communication, in order to solve larger, sort of, cultural accumulation problems.
[MUSIC PLAYS]
LLORENS: Well, that’s a compelling vision. I really look forward to seeing how far you and the team can take it. And thanks for a fascinating discussion.
MOMENNEJAD: Thank you so much.
[MUSIC FADES]
[1] Momennejad notes that repetitive probing allowed she and her colleagues to report the mean and standard deviation of the accuracy over all the responses with corresponding statistics rather than merely reporting the first or the best response.
[2] Momennejad notes that a “convenient and interesting fact about these modules or components or roles is that they’re very similar to some components in reinforcement learning, like actor and critique and tree search. And people have made prefrontal cortex–inspired models in deep learning in the past. This affinity to RL makes it easier to extend this framework to realize various RL algorithms and the sorts of problems one could solve with them using LLMs. Another feature is that they don’t all solve the big problem. There’s an orchestrator that assigns subgoals and passes it on, then the actor’s input and output or the monitor or evaluator’s input and output are parts of the problem, not all of it. This makes the many calls to GPT-4 efficient and is comparable to the local view or access of heterogenous agents, echoing the classic features of a multi-agent framework.“
[3] Momennejad notes that one task she and her colleagues have used is similar to the game Little Alchemy: the players need to find elements, combine them, and create new components. There are multiple levels of hierarchy of innovation that are possible in the game; some of them combine components from different trajectories.
[4] Momennejad notes that this relates to some work she and her colleagues have done building and evaluating AI agents in multi-agent Xbox games like Bleeding Edge, as well.