Placeholder Image

Subtitles section Play video

  • So, a little bit about me.

  • Head of a Plataeat ramp.

  • I've been working on LLMs for four years, which is, well, which is a kind of a long time, I guess, in LM land.

  • Everything started happening really when ChatGPD came out.

  • So, I was trying to build what people would now call an AI agent company.

  • Back then, we were just doing customer support.

  • We're trying to make our chatbot smarter, and we're trying to figure out what models to use, or what tech to use to get them to respond to customers better.

  • And we were messing with GPD 2 on support, and models were so frustratingly stupid, and the context windows were small, and they were not very smart reasoning, and it was just incredibly annoying.

  • And we just wrote lots of code around these models to get them to work, at least somewhat reliably.

  • And along the way, as models got smarter, this kind of had to delete more of that code.

  • And this ended up seeing a lot of patterns in what code needs to get deleted, how to build agents in what ways that will scale with more intelligence.

  • And clearly, we're going to continue to get a lot more intelligence.

  • And I just wanted to maybe talk about a single idea throughout the talk through various examples.

  • We'll do some setting, but I'll also have a bunch of demos to kind of drive home the point, and maybe I can convince you guys that there's a certain way of building agents that's slightly better than other ways.

  • I also built a structure extraction library called JSONformer.

  • I think it was the first one.

  • I'm not fully sure, but timing-wise, it was before all the other major ones.

  • And that was also scaffolding around a model.

  • Models were too stupid to output JSON, and we were just really begging and pleading it, and forcing it to act in ways that we want it to be.

  • So as I said earlier, I just have one core agenda item here, which is I want to convey one idea.

  • We'll start off, all of you probably read the essay a bit or less, and just quickly go through what it is.

  • We'll go through a production agent we have at Ramp, and how it works, and three different ways of architecting it.

  • And then I have a demo to really push maybe how we all think about how software and backends and things will work in the future.

  • So very simply, the idea is just that systems that scale to compute beat systems that don't.

  • So there's two systems, and without any effort, one of the systems can just think more or use more compute in some way.

  • That system tends to beat systems that are rigid and fixed and just deterministic.

  • So from that idea, it's pretty clear, like, if you're building systems, you might as well build systems that improve with more compute.

  • And this seems pretty obvious, like, obvious conclusion from the bitter lesson.

  • Taking it a step further, why is this true?

  • It's because exponentials are rare.

  • Like, they just don't exist, and most things in the world aren't exponential.

  • So when you find one, you just should hop on, strap on, and just take the free pass and go for the ride.

  • And you probably shouldn't try too hard.

  • And there's a lot of examples from history that kind of reflect this.

  • So for chess and Go and computer vision, Atari games, like, people have tried to build lots of systems and written a lot of code.

  • And my way of thinking about rigid systems is just, like, spending a lot of time writing weekends and writing very clever software, well-abstracted, maybe trying to synthesize human reasoning and thought process into features and then using them in clever ways and trying to approximate how a human would think.

  • And if you actually fix the amount of compute, that approach will win.

  • But if it just turns out, if you end up scaling up how much search you're doing, the general method always ends up winning, even, like, in all these cases, so Atari, Go, and computer vision.

  • A little bit about RAMP.

  • So RAMP is a finance platform that helps businesses manage expenses, payments, procurement, travel, bookkeeping more efficiently.

  • And we have a ton of AI across the product, so automate a lot of boring stuff the finance teams do and employees do with submitting expense reports and booking your flights and hotels and submitting reimbursements, all of that.

  • And so a lot of the work behind the scenes is just we're interacting with other systems and helping, like, legacy systems and helping employees get their work done faster.

  • So let's actually talk through one of the systems we have today at RAMP, and maybe some talk through, like, the different versions of the system and how it developed over time.

  • So we're going to talk about something called a switching report.

  • It's a very simple agent.

  • All it needs to do is take in a CSV, a CSV in arbitrary format, so the schema could be seriously anything from the internet.

  • And we want these CSVs to come from third-party card providers.

  • So when people onboard to RAMP, we want to give them a nice checklist and say, hey, here are all the transactions you have on other platforms, and we want to help you move them over.

  • And the more transactions come on RAMP, the more we can help you, and the more you'll use our software, and the more everyone benefits.

  • And so the switching report is just really a checklist.

  • But to read people's CSV transactions, we need to understand those, and other platforms have all these kinds of crazy schemas.

  • And so the description of the problem we have here is just for an arbitrary, arbitrary, like, CSV, how can we support parsing it and then into some format that we understand?

  • So let's just start with the simple approach, right?

  • It's like, let's just take the 50 most common third-party card vendors, and let's just manually write code for all of them.

  • Now, obviously, like, this will just work.

  • It is some work, not a lot of work, but you still have to maybe go to 50 different platforms and download their CSVs, see what schemas they have, and then write code.

  • Maybe if they decide one day they change their format, your thing will break, but that's okay, you'll get paged, and you can wake up and go fix it.

  • So let's maybe introduce some LLMs in here.

  • So from the over-engineered code where you ended up writing 100,000 lines, maybe we want a more general system.

  • So let's introduce a little bit of LLMs, a little bit of AI in here.

  • And so in the deterministic flow, let's maybe add some, or just like scripting in classical scripting land, let's add some more calls to OpenAI, or you have an embedding model, you want to do somatic similarity or something like that.

  • So then let's just take every column in the CSV that comes in.

  • Let's try to classify what kind of column it is.

  • Is it a date?

  • Is it a transaction?

  • Is it a transaction amount?

  • Is it a merchant name?

  • Or is it the user's name?

  • And then we map it on, and then we probably could end up in a schema that we're happy with.

  • Again, most of the compute is running in classical land.

  • Some of it is running in fuzzy LLM land, but this is somewhat looking like a more general system.

  • Let's go maybe a different approach, and like we just go all the way through.

  • Let's just say we're just going to literally give the CSV to an LLM and say, you have a code interpreter, so you can write whatever code you want, pandas or all the faster Rust-based ones.

  • You have all these Python packages.

  • You're allowed to look at the head of the CSV, the tail, whichever rows you want.

  • And then I just want you to give me a CSV with this specific format.

  • Here's a unit test.

  • Here's a verifier that you can use to tell if it's working or not.

  • Turns out this approach actually doesn't work.

  • Like, we tried it.

  • If you only run it once.

  • But instead, if you run it 50 times in parallel, it's actually very likely that it works really well and generalizes across a ton of different formats.

  • The amount of compute here is actually probably like, what is that number?

  • 10,000 times more than the first approach we came up with.

  • But again, like, what is truly scarce in the world is engineer time.

  • Maybe not in a while, but at least today.

  • And we'd rather have a system that works really well, and even with the 10,000 times more compute, it will probably cost less than a dollar, and every transaction that's switched over, every failed CSV will cost RAMP way more money than whatever money we spend on this exact architecture.

  • So this is a very specific example.

  • It's like, how does this apply to the agents that we all build, and maybe the systems we're all working on?

  • Turns out, something like this actually generalizes.

  • So if you look at the three approaches, and let's assume, like, the black arrow is just classical compute, and then the blue arrows are fuzzy land.

  • So it goes into neural net, and all sort of weird matrix multiplication happens, and then we're in latent space, and it gets all alien intelligency, and then comes back to classical land.

  • First approach, there was no AI.

  • We just wrote code, and it just worked, mostly.

  • The constrained agent, so the second approach, we broke into fuzzy land from classical land when we decided we wanted similarity scores or something like that.

  • And then the third approach is actually flipped, where the LLM decides it needs to go into classical land.

  • So it writes some code, it writes some pandas or Python code, and it decides to break in into this classical land when it needs to, but most of the compute is fuzzy.

  • Actually, this is maybe not the most accurate graph, like, because it proposed that we run it 50 times.

  • It more so looks like this.

  • But if you look at a back end in general, they're all request response.

  • So some sort of message is going in.

  • It's like a POST request, or GET, or UPDATE, or READ, any sort of CRUD operation.

  • And we're really just asking the back end to take this piece of information, do whatever you must with it, run out whatever mutations you want, and return me a response.

  • And almost all systems we've built so far is like humanity, I guess, like, look like the first one.

  • But more people are using OpenAI.

  • OpenAI makes billions of dollars, and probably a lot of the systems that use them look like number two, where just regular programming languages are calling into OpenAI servers, and we're running some fuzzy compute.

  • What we're seeing in, like, more and more parts of the RAM codebase, we're moving to the third approach, because it just tends to work well.

  • Because all the blue arrows, if you did nothing, absolutely nothing, we all went to vacation for the next year, the big labs are still working and spending billions of dollars making those models better.

  • So the blue arrows will get better.

  • And so how much blue arrow you're using in your codebase actually will help directly your company without much effort from your end.

  • So this is what I was saying is, like, the bitter lesson is just so powerful, and exponential trends are so powerful, that you can just hitch the ride.

  • Let's take this idea, like, further.

  • Let's actually, like, go all the way, like something crazy.

  • On the left, you'll see a traditional web app.

  • So usually the way it works is you open gmail.com, and some static file server at Google is sending you a bunch of JavaScript and HTML and CSS.

  • The browser renders that and shows you some nice UI, nice HTML that's user-friendly.

  • Maybe you see some emails, maybe you click on one of them.

  • The frontend makes a request to the backend, the frontend asks the backend, give me the content for Gmail and whatever ID it is, and then the backend has a database and gives you the result.

  • And maybe they use CodeGen, maybe they use all the CodeGen tools available to make Gmail.

  • So that was probably, the LM only worked when the software engineer was writing the code, but once the code is written and it's, like, pushed to production, it's just classical compute.

  • And on the right, I'm actually proposing a different model, which is the backend is the LM.

  • It's not CodeGen, it's this LM is doing the execution, it is the backend.

  • So the LM has access to tools like Coder Interpreter and potentially has access to, through that, making requests, network requests, and also has an access to DB.

  • So I have a mail client actually that works with this principle, and this is my test email.

  • So if you all want to see any emails you send to me in a minute or so, you can send me an email.

  • But please be nice.

  • All right, I think it's probably enough time.

  • So I'm going to go over here.

  • So we have this email client.

  • I mean, we still have some regular JavaScript to hook into the LM, hook the LM into the browser, but when I do log in, I'm going to use my Okay, we're good, we're good.

  • All right, we're saved, I think.

  • Thankfully, I have a room full of engineers.

  • So there's a dot, but the reason it's so slow is because when I open this page and log into Gmail, the Gmail token is actually being sent to an LM.

  • We're saying literally this is an LM chat session.

  • What we're seeing on the screen is like, hey, LM, you're actually simulating a Gmail client.

  • You have access to all the emails.

  • You have access to Rahul's Gmail token and Coder Interpreter, and so just render some UI based on what you think is reasonable for the homepage for a Gmail client.

  • And so it looks like it decided to render as markdown.

  • I think we actually tell it to render as markdown.

  • And it's rendering all the emails that a bunch of people sent me from here.

  • So it looks like it says, hello from California.

  • So I'm going to click on that.

  • When I click on that, we're actually not running any like back-end calls or anything like that.

  • We're just telling the LM the user clicked on that piece of text.

  • In this case, it was hello from California and the ID number.

  • The LM now has the information on what the user clicked on, and it has the chance to re-render the page much like a web framework would.

  • So again, it goes back.

  • It probably hits a GET request for that specific email and pulls the body.

  • What is this agent going to do?

  • I'm watching you live.

  • So the LM just decided this is the appropriate UI for a Gmail client.

  • Also, I have other features the LM thought was reasonable.

  • So it looks like I can mark it as unread or delete the email if I want to.

  • Maybe I'll delete it because it's not that good of an email.

  • I'm sorry.

  • It is very slow because we're doing a lot.

  • But I wanted to push you in this direction because this kind of software barely works.

  • Dang.

  • I guess not.

  • Also, I clicked on it, and now the LM is trying to do something with me clicking on it.

  • But anyway, this kind of software barely works today, and it doesn't mean it won't work in the future.

  • But with exponential trends, things like this might just take off.

  • So I just wanted to push you all to think in this direction.

  • Yeah.

  • Will more software look like this?

  • I don't know.

  • We'll see.

  • Thank you. Thank you.

So, a little bit about me.

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it