IA Summit 2023 Ali Farhadi: The State of Open-Source Models & Importance of an Open AI Ecosystem

Post on

October 18, 2023

It was a pleasure hosting the second annual IA Summit in person on October 11, 2023. Over 250 founders, builders, investors, and thought leaders across the AI community guests and over 40 speakers dove into the world of intelligent and generative applications — from the open-source and closed models needed to run them to the emerging architectures and frameworks needed to build them to the battle emerging between Gen-Native and Gen-Enhanced applications. We’re excited to share the recording and transcript of the Keynote AI2 CEO Ali Farhadi gave during the Summit: The State of Open-Source Models & The Path to Foundation Models at the Edge.

TL;DR (generated with AI and edited for clarity)

AI2 CEO Ali Farhadi discusses the future of AI models, emphasizing the importance of open ecosystems for AI development. He mentions the gaps in AI technology and argues that the progress of AI has been made possible through an open AI ecosystem where researchers build upon each other’s work. He also discusses the need for open access to training data, open model configuration, and integrated evaluation systems to advance AI while ensuring transparency and control. He encourages continued collaboration and open access to data and models to drive AI progress. You can watch all the IA Summit session recordings here and read more about Madrona’s AI thesis here.

This transcript was automatically generated.


Matt: Ali Farhadi is somebody I’ve known for almost a decade, and it’s an absolute journey — from being a computer scientist and professor at the University of Washington and with the Artificial Intelligence Institute. Ali and his co-founder, Mohammad, started a company called Xnor about seven years ago. He has an expertise in vision models, an expertise in how you run those models in very resource-constrained environments, like at the edge. That company was bought by Apple in 2019, and Ali spent the last three and a half years working at Apple, and then recently joined AI2 in the role of CEO. And so, he’s going to combine these worlds that he’s lived in, from research and nonprofit and for-profit, small and large, into, I think what you’ll find is a very engaging and thought-provoking talk. So, without further ado, Ali Farhadi.


Ali Farhadi: Thank you, Matt. Thank you to the Madrona family for this amazing event. Very excited to be here. Don’t worry about the crow in the title. I’ll explain my way out of it. I’m going to spend very little time talking to this crowd about how amazing these models are. We take it for granted. We know the capabilities, we know how well these models behave. Today, I’m going to actually talk a little bit about what it means to think about the future of these models. And what are the places that I should think about or worry about as I’m building these models? And at the end I’ll talk about some logistical issues that I think I see will come, and face the problem and how we’re going to solve them.

So, 10 years ago, I showed this video. I was tasked to set up a Grand Challenge in AI. My task was, think about a problem that we cannot solve in 10 years. So I used this video as my running example. Look at the video. It’s a crow watching a person digging a hole in an Arctic area. They put something in and the crow is watching this whole scene. I want you to think about the crow’s detailed understanding of what’s happening in this scene.

How situational, how contextual, how accurate is her understanding? She knows that she should fly to the hole. She knows exactly what’s happening and what to expect at the other end of the rope. She knows how to run a sequence of actions, one after the other, in a very careful way. Look at her foot. When she actually used the beak to pull the rope, she put her foot on the rope so it doesn’t slip back in. And she just run a sequence of actions, one after the other, knowing what to expect as an outcome of an action.

And the minute that the crow gets what she wants, but she also knows what to expect. The minute that the crow sees that there’s a fisherman actually running back, she knows what she did to the fisherman. She knows how the fisherman actually thinks about the crow’s actions and she start acting on it. She flies away. So I was using this as my narrative to explain how important actions and interactions are in understanding of intelligent systems and intelligent applications and how we should think about these kinds of behaviors as we move towards building these systems.

Little I knew about how fast the progress of AI has been. I think that I failed the Grand Challenge task, less than 10 years. The problem moved, actually, significantly. So I grabbed this video and I fed it to your favorite large multimodal model of year choice. In this case, this model has, I don’t know, 300 million parameters.

And look at the description that you get out of this video. Some of them are actually pretty good. And some of them, there are actually some details missing on what you actually look at. For example, I think it’s missing the fact that it actually confuses a stick with a knife. You see that some of the big pieces are actually missing. The fact that early on in this video, our understanding has no reflection of the intention of this person. It’s just an explanation of what’s happening in the video.

What do I have to do these days? Obviously, I’m going to grab the description and plug it into our favorite model, in this case, GPT-4. I’m going to ask GPT-4, “Imagine you’re a crow, and here’s a description of the environment. What would you do?” It’s not bad. So, directionally, it understands that, if I’m a crow, I’m going to actually fly to the hole, I’m looking for shiny things. At this point you actually understand that there’s now a little bit of a mix of contextual things, prior knowledge about crows and what they like, and a little bit of actual reaction. But it’s actually missing some of the most important elements of this scene, the intent, the goal, the set of actions.

All of those are not popping out. And I would argue that, to be able to actually get to that next level, to close, to go from our 85% accuracy to an actually higher number, we need to be more situational, more contextual, and understand the problem in terms of actions, interactions and relationships. What I’m trying to get at is I’m looking for building these LLMs for the crow and for the physical world around it.

And I would argue that a good portion of our knowledge on how we do things are not written in texts, because they’re just so obvious for us to talk about them. And we need to actually get them out of actions and interactions, and that’s what I’m trying to build. Now, let me contrast how we build language models with this. Among many crucial elements that help us build language models, I like two factors in this. And I think these two factors play this key role here.

One was our ability to crawl the web and index the web. So we get large amount of relatively clean and useful pieces of text, or images on the web, or videos, if you like. The other piece is we stumbled upon this magical loss function. The core of these models, at the end of the day, is I’m going to read a sentence, I’m going to pause and ask, “Predict the next word.” And, by doing this, we would have never predicted that these many amazing properties would emerge out of just predicting the next word.

And, to be honest with you, we still don’t know and can’t explain our way out of this, why this magical loss function behaves this way, but we’ve seen enough evidence that this loss function is magical, and it’s empowering those kind of emergent properties. So I’m going to build the correspondence of this for the physical world. So what do I need to do first? Well, I need to find a way to crawl the world. I need to find a way to index the world. And then I need to find a way to experiment with various different loss functions to figure out what is the corresponding elements to this magical loss function.

That means that I need to start thinking about embodiment and what it means to actually be embodied, as intelligent agent in the world, and act on this? And a kind of flexibility and adaptability that you need to have to be able to act in the world. I’m going to start by crawling the world. Well, crawling the world is hard. It requires lots of moving agents, we’re still waiting for our robotics friends to give us this 10-penny robot so we can actually deploy it at scale. But, till that point, there’s a lot that can be done.

And a key here is synthetic environments. For a while, we’ve been pushing for this notion that we can actually just start building a lot of capabilities in made-up synthetic environments. These days, it’s actually a common practice. We’re very comfortable with it. Many of these industry people were just talking about it, using a GPT-4, a bigger model, to fake data to tune the smaller model. That is now a common practice. Back in the day, it was actually a big scene. There was a lot of push-backs, that, “Oh, my God. You cannot touch synthetic data. The texture is wrong. The shading is wrong. The reflections are wrong. Your physics of the environments are not as accurate as the physics of the world. Therefore, what you learn out of these systems are actually going to be wrong.”

Little by little, actually, this changed, to the state that we’re in today. These models, these environments, are extremely popular. They are the common grounds when you start thinking about interacting with the real world. At AI2 we build one of those called THOR. It’s a decade-long project. The latest piece of it is called ProcTHOR, and I’m going to touch upon ProcTHOR mainly because it draws interesting analogies to language models.

When you think about the progress from the first of these contextual language models, ELMo and BERT and the like, all the way to the new ones, one of the things that have changed is the amount of data that we fed them, and the behavior and the scaling laws that everybody’s talking about today. It’s actually a powerful tool. So we wanted to do the same with this environment because we had a small number of them. So we had to scale them up. So we actually found a way to procedurally generate rather a limited amount of these environments. And these are the houses. You can actually interact with this. You can open a fridge, you can actually break things. You can pour water in a glass.

And we started trying to build things one after the other. But, remember, my task is not to build you the robot for your dishwashers. That’s a different task. I’m actually after crawling so I could actually learn the model. So let’s see what happened. The simplest task you could do in these environments is just to learn to move around. You want to move around in these environments without bumping into people and things and find a way to get from A to B. Actually, before this one, our accuracy around 10 years ago was around 10% success rate.

And, little by little, it improved. The biggest bump come from large number of data. The same thing that happens with language models. And now, actually, we can comfortably move around these environments. But these are still synthetic environments. This does not mean that I could actually start deploying robots off of this to the world so I could start interacting with the world. So we needed to find a way around this problem.

And the tools, actually, there’s yet another set of interesting analogies to what we talked about. And I think Matt discovered a solution to this earlier today when he was talking about, well, cell phones are actually great things, and there are certain interactions that comes with interactions with cell phones. And we all, at this point, are very comfortable to talk about prompting and prompting a language model, so I want to do the same thing.

I want to now prompt these models that I built. But how do I prompt a physical model of the world? When I prompt a language model, what am I doing? I’m actually providing enough of a contextual knowledge so my language model can attend to what I care about for that task. So let’s do the same. I’m going to grab my phone and I’m going to scan the environment as a way of getting my model to attend to what I care about. The other piece that we learned a lot about today, and probably yesterday, is rag. So I can do the same thing here. Now, I could actually grab this phone, get a rough 3D structure of this environment. It doesn’t need to be accurate, and it won’t be accurate enough.

But I could use this and I could retrieve. Now that I could retrieve, I can start augmenting. I can actually start faking so many rooms that are structurally similar to what I care about, on the fly, on the spot. And now that I have these, I can start tuning my model and nudging my model a little bit toward that environment. And, boom, what happens is that now I can actually deploy things to environments they’ve never been before, and they act very, very reliably.

I’m going to draw the analogies one more time. I’m actually prompting with my phone. I’m getting the right set of information to retrieve similar things, and I’m going to augment, and I’m going to fine-tune. Now that I could actually start now moving around in the real world, the next question is how do I index this much of data? When you think about indexing and using it we’re comfortable with these giant networks that produces hundreds of thousands of labels, the space of all worlds that we have oftentimes is our ceiling. Sometimes when we go to actions, it’s a little bit less or more.

But that’s the size of the leaf nodes of these networks that we care about. But when I go to indexing the world, I need to find a way to… Let me skip through this for now. I need to find a way to index a wide range of things. But before going into that part, I wanted to tell you what else I learned by just being in that environment and acting beyond navigation. We could start building these agents in a similar environment that could interact with the world. They can actually cooperate with each other and start moving around.

We learned a lot about emergent properties. I’m actually going to draw yet another analogy here. One thing we did is we actually get these large number of environments and had agents play hide-and-seek against each other. So there is no supervision. One agent’s role is given a random object assigned to that agent, hide it somewhere. And the other agent’s is to find that thing. The hiding agent would get rewarded by how successful he is in hiding, so the finder cannot find it. The seeker’s reward is actually can you successfully find this or not?

And interesting things emerged. I want to actually connect this back to 1960 and 1970s studies on human intelligence. And a big discovery that happened back in that time. Psychologists and computational psychologists actually discovered that infants start developing an understanding of the world around them, they understand a sense of what objects are. Objects have persistency. They don’t disappear. They don’t evaporate. If I show you an apple moving around and block the apple, an airplane won’t come out. I expect the apple to come out of this.

They actually measured the surprised reaction of the kids to these kinds of inconsistencies in the environment as a way to measure their cognitive development. And, depends on who you ask, children, anything between four to six months already show these kinds of behaviors. So I wanted to actually test the same with the agents. We ran the same [inaudible 00:15:24] experiences with these agents and measured their surprise as an element. And the same behavior popped up. Now, agents are being surprised if I showed them an apple and then I block and a car comes out.

And there’s actually whole literature on this. So these kind of properties are emerging out of this. I won’t go into the details of it, but I’d be happy to talk offline. But let me go back to indexing this. Now that I have the whole world, and I need to actually go beyond hundreds of thousands of objects. I need to go to trillions, if not more. Because now I’m dealing with instances in the world. This chair and this chair are very, very different chairs in the world. To me classifier, oftentimes, they were the same.

So we started thinking about how can we actually start indexing the world in a way that it becomes manageable? Time does not allow me to actually go to the details, but it turned out that you could actually build nested representations. If you are very careful about how to represent these interactions and our favorite binary and coding actually came to the rescue. This is now actually already in the running in some of these giants of technologies. But to give you a glimpse of what’s happening there, you can actually index one billion images at eight gig of memory. That means you can index a billion images on a device with these kind of technologies.

So we kind of have an understanding of how to move around, how to collect data, maybe have an understanding of how to index them on a scale. We can actually start toying with these systems, as what else do I need to be able to actually act in the world? When you start acting in the world, there are interesting things that fall apart. We have assumptions about in distribution and out of distribution. We try to get the training distribution close to the test distribution to get the best results that we can. And there’s some nice literature on how to adjust with this transfer between these distributions.

But whatever we do, if you’re moving around in the world, you’re pretty much guaranteed that you’re going to see things that you’re not prepared for. And you need to find a way to act and adjust. I’m going to spend a little bit of time talking about those adjustments. Let me do a little bit of a quick history here. We all talk about these language models, or these big models, large transformer models. And if you see me, despite the common trend, I refrain to use the word foundation model, but these large language models are pretty powerful. You can build a multimodal and you can use them.

And just in the panel previously, we talked about there was this notion that there exists this God-given model that does everything. Little by little, we decided that, okay, that might not be the case. We got a little more relaxed. We talked about ensemble and I’m going to touch upon this a little later. But these God-given models were good in some tasks. So you could actually build these large models, and if you measure them on how well they do on unseen data, “unseen” because when the data is so big, unseen is actually a questionable measurement by itself, they perform pretty well.

So that was the first shock that we saw in multimodal models. These are the results from a CLIP model. Then we started testing them in some other models and it turned out that there were gaps in them. These magical, amazing models that we were so excited about, performed worse than chance in some of the datasets that we were working on. Very quickly, we realized that, okay, maybe I need to get these models and fine-tune them. There’s a whole literature on fine-tuning. I don’t want to spend a lot of time on it, but we learned a lot. Fine-tuning helps us a lot by getting these base models to where they should be, or where I want them to be.

Compare the orange with the blue and how much gain you get by just fine-tuning to your task. But fine-tuning came at a cost. And later we realized that every time I fine-tune my model, I’m losing the robustness. Compared to where red was from before fine-tuning to after fine-tuning, I get a lot of gain in my target task, which is the blue, but I’m losing a lot of robustness. That means that this model that I have today is not going to perform as well as I expected on other things.

This is the plot that I’m going to spend a little bit of time on, and this plot shows me that I had a star model. The star model is this CLIP zero-shot model. It’s a very good base that I could build upon it, and I could fine-tune. The same phenomena. If I fine-tune, I’m going to move in the right directions on the x-axis, but I’m going to go down on the y-axis. X is the accuracy of my desired task, and Y is my control task. That measures my robustness.

I see a drop. And the question is, how can I actually deal with this drop? Can I actually build a model that gets the best of the both worlds? Because if every time I fine-tune, I’m losing something, I need to worry about my future. Because my notion of my target task will change across time. And after five iterations of fine-tuning, I might actually get into trouble myself because I might be in a place where I can’t come back from it, after spending millions of dollars on building that model.

So how should we do this? Let me do a little bit of a quick detour here. I promise it’s going to be very, very light. So I have a model, theta-0. What I do is I get lots of data and I optimize my way from theta-0 to get to theta-1. That’s where I say, “Oh, I’ve trained a great model. Go grab-test it.” And what I’m showing you is my loss landscape. That means that I’m actually going down the loss value, and that’s how we optimize models. I can do another one of those models and I can actually get another model, fine-tune it and get another fine-tuned model.

Now I have two fine-tuned models at hand, theta-1 and theta-prime-1. Something magical happened in that vicinity. Little by little, we discovered that the space between these models, I mean a higher dimension of dimensional space, the space in between these models is actually a space where, if I grab a point randomly in that space, that corresponds to a model by itself. I’m going to pause for a second. So that means that I trained a model, or two models, and now I can actually arrive at multiple models without ever training anything.

Notice, all I’ve done is a summation. These properties actually do exist in practice. And they also have characteristics that we cannot completely explain. But something very interesting happened in doing this. So now I can actually start going back to my plot. I had my star, I had my square. Now, let me start interpolating between those two models. At this point, all I’m doing is simple arithmetic. I have a curve that traces out, a curve from star to the square.

Look at the blue point. A blue point is a point over there that I arrived at for free, almost free, it’s just a summation. That blue point is more accurate than my square on the target task, and way more accurate than the star on the control task. So now we can actually get a ton of models this way, put them together in a tech thing we call model soup. And that ended up actually winning the ImageNet challenge for the most accurate single model out there. Very simple technique. Get a bunch of models, sum them up, and you arrive magically at a model that behaves in an interesting way.

Now, this is what happened to my patching problem. Now, this model can actually behave in an interesting way on unseen tasks, and close up some gaps. The most exciting thing to me is this part. We talked about interpolating between two models. Now I’m going to scratch my head and see, how can I extend this? I’ve done interpolation. Let’s do extrapolation. Can I get two models and extrapolate them against each other?

While you’re thinking about what does it even mean to extrapolate two models against each other, I’ll get to this in a little bit. Let me define a notion. Now, I’m going to actually start defining arithmetic in the space of models that are already trained. So I have a bunch of fine-tuned models and I’m going to define arithmetic in this space. There’s a thing called task vector. Very, very simply, task vector is a vector that gets me from the pre-trained model to a fine-tuned model.

Now that I have two task vectors, I can start negating one from each other. I can start adding things together, and many other things that you could actually do with this task arithmetic. What does it mean to negate a model from another one? Well, we all have seen about notions on learning. You train your GPT model and you realize that it’s actually a toxic mode. Two things you could do. You could just put a bandage on it and scold every time it is toxic, and then it will be jailbroken. Or we can actually start to unlearn the characteristics of the model learning.

So, I had a base model. I trained a toxic model. And I actually just move in the direction of less toxicity. I just extrapolate this from this, and I end up with a model that is as accurate as the original model, but 10x less toxic. Matt talks a lot about ensembles. Now, I have a very cheap way of ensembling these models together. I can just add them up. Remember, ensembling a model is a fairly expensive practice. If I want to run an ensemble of 10 models, my inference cost is 10x. But this way I’m just summing all 10 together and run once. And I get the behavior that I get out of the 10 models.

Quite fascinating. We’re still actually trying to wrap our head around this topic. But what it means to me is actually quite interesting. This is actually a little dated but I’m sure that the numbers are even actually exploding further. This is Huggingface. On Huggingface, this is the number of fine-tuned models off of the shared backbone, and see how it’s exploding. So we already have a large number of fine-tuned models at our hand.

And here’s a speculation I will make here. We can actually start thinking about model training not in a standard, conventional, Stochastic gradient descent way. That’s very expensive, very hard to do. But, rather, as simple as let me go over here, grab whatever that makes sense, and I just sum them up together, or interpolate between them or extrapolate between them to arrive at a model.

We talked about Jensen’s code, that there’s going to be so many apps out there. I would argue that there’s going to be so many models out there in this way. And just managing these large number of models by itself is actually yet another phenomena that we need to start thinking about. But now training a model can be as simple as very, very cheap arithmetic in this space of existing models.

Let me talk very quickly on one more phenomena. When you go to a meeting, oftentimes you have notes or somebody prep you for the meeting before you’re getting to the meeting. And your experience during that meeting is quite different if you just walk in the meeting with no context, or if somebody prep you. I hope that you all remember the night before exams and what we did. We spent a lot of time reading upon the stuff that you’ve already learning. So this phenomena has a lot of actually proof points in human cognition, that by re-reading the stuff that you want… These are not new. You’ve already learned them before, but you’re reading it again.

You’re actually priming your neurons to be ready for the task at hand. Does it resemble anything that we talked about today? Is any part of what actually prompting you? So now I can start thinking about neuro cognition the same way. The the image classification as a task. You give me an image and you want me to recognize that image. All I do is I actually get that image, go to my training set, the large training set that I already trained on. These models are already trained on these datasets, and I fetch a few similar instances. I have no labels, just a fresh, few similar instances. And I just nudge the model a little bit on those instances on the fly and see what happens.

So this gap on the CLIP performance is probably you need to wait for three to five years to see this much of gap of the progress. I’m talking about the blue bars, CLIP versus priming. This is priming on top of CLIP. A very, very simple thing. Now, you give me a sample, at inference time I fetch similar instances and I nudge the model a little bit on those. Happy to talk offline about the details.

The other piece that I would like to talk about is, now that we’re actually getting into the world of unknowns, we’re pretty much guaranteed to have unknowns. But the one piece that we know is that not all the given tasks are equally hard. And we should be able to decide at inference time, right when I need to actually think about these problems, on how to make sense of this. And it turns out that, again, another possibility is I can actually start building representations. These are nested representations. We train them all together. That means that once you’re done training, you actually have a whole family of models at your disposal. And depends on the task and resources and what you want, you can just grab the chunk that makes sense to you.

One thing that I’d like to note here is that, gradually, we’re actually moving the notion of what a model is, from a static creature that we train at training time, freeze it and deploy it, to a fluid structure that we need to mess with all the time continuously forever. Because now a model is not a model anymore. A model is a whole family of models. From these kind of models I can just get thousands of models at your disposal, almost for free.

And these models actually perform very well. To your left is a vision model. Compare the red with the blue and see the gap. Each of the red dots are the models that are trained independently. And the whole family of blue stuff are just one model. And on your right is a language model. The same phenomena is there. Now, I can actually start thinking about, what does deployment of these models mean in the era of this fluid structures known as model 2.0?

As we talked about these problems, you see that these technologies are phenomenal. We are very happy about it. We’re talking about what are the next generation, how are we going to deploy it. But I want you to also be aware, there are technology gaps here that we need to work on. And we need to understand how to fill these technology gaps in a time-sensitive way because some of these problems and gaps are quite big and important.

And I would argue that… Sorry, I messed up with my clicker. I would argue for the technology gaps to be solved the same way we solved the problems today. And those problems, the thing AI was developed. AI was developed in an open ecosystem. Little by little by little people built on top of each other and now we’re moving to “Let’s close up the whole thing.” That will only do one thing. That will hinder the progress of AI.

And this would actually just slow us down, if not block us from addressing those technology gaps that we need to address today to be able to actually have a trustworthy AI that can be deployed across industries. So I’m actually sort of pushing for this open AI ecosystem. And when I say open, open does not mean to train a model beyond closed doors on unknown data and toss it over the fence and slap a license to it so everybody could use it.

Those are great directions. We love it. We cherish them, but that is barely not enough. We need to know the training data so we can actually study the effect of the data on the [inaudible 00:33:34] space of these models. We need to understand how to control these models by messing with the input. We need to have an open configuration of the models. Training these models are expensive today. And we are repeating those very, very expensive experiments one after the other because I need to reverse engineer what you did.

I look at your model. I say, “Oh, that’s pretty cool. I want one of those.” What do I do? I run hundreds of very, very expensive experiments to discover what you did. And those experiments, A, they’re expensive; two, they’re environmental very, very damaging. We’ve all see the staff, hundreds of car a year per experiment, hundreds of gallons of water being evaporated per experiments. By opening this whole ecosystem up and having this big models training available open at public, we’ll just stop this reverse engineering task. People should start building upon those models rather than redoing what I did before.

And another key piece to this is evaluation. We need to hook up the evaluation to this system so we can actually start nudging models. The same way I explained to you, towards less toxicity and other things that we want, at the get-go. If you let some level of information to be encoded in these models, we’re going to have a hard time controlling these models, not to spit out bad things out. Even with all those advantages that we actually do after the fact, we still cannot control them. The way to control them is to actually loop in evaluation with open data and open model together.

At AI2, we are actually trying to push for this notion, so recently we released Dolma. Dolma is an open dataset, three trillion tokens dataset released a month ago. It’s getting really, really popular. This dataset is actually five terabyte of data, so if you want to download it, you would spend a couple of days downloading it. But look at the [inaudible 00:35:30]. There is actually a really, really big need for open data, so we’re going to just start studying this.

And we’re releasing OLMo soon as an open model that allow us to start thinking about reproducibility, transparency, how we’re going to safeguard and guardrail these models, with looped in evaluation to those. So the way I think about it is that AI was born and raised in open. Now, putting it behind closed doors will only slow down, if not hinder the progress, and I want us to actually all think about what it means to make progress in AI, in the context of how AI got to here. With that, I’ll stop and I’ll take questions.

Matt: We’ll take one burning question right there.

Audience: It’s rumored that GPT-4 is an ensemble model where they’re doing inference multiple times. If ensembling is as simple as interpolating or extrapolating, then why do they need to do that? What are the limits of interpolation?

Ali Farhadi: First of all, I can’t comment on the rumors. I haven’t seen what GPT-4 is doing, other than the amazing output that it produces. So I won’t be able to comment on that. But ensembling has different characteristics when you think about it. If you start ensembling with the same backbone, I would argue that you don’t need to do that. But oftentimes, when people ensemble, they actually break the architecture apart. And the minute you start breaking the architectures, none of these facts would actually follow. As a result, you need to actually ensemble at the output. That means that you need to run multiple models.

But if you look at the behavior of how your responses to these chat engines change over time, you see an interesting phenomena. And this is just my speculation, that at some point we wanted to ensemble. We figure out it’s too expensive, we go to the smaller model. We figure out that people are unhappy. Now, we’re actually eating more cost to serve a bigger model.

So we’re still experimenting with all of those to figure out how these models behave. But if the backbone architecture is the same, I would argue that you don’t need to actually waste your cycles. If the backbone architecture changes and now you actually have two very different architecture, none of these follows and you actually have to do the expensive ensemble.

Matt: I think many of you are going to want to chat with Ali. That was an absolutely awesome talk. We’re going to take a 10-minute break. We’ve got some great Intelligent Applications top-40 companies to present [inaudible 00:38:10] presentations and a terrific panel that follows after that. So let’s take a 10-minute break. Thank you, again.

Ali Farhadi: Thank you, Matt.

Other stories

Share The IA40
Copy link