Placeholder Image

Subtitles section Play video

  • Since ChatGPT launched in 2022, large language models have progressed at a rapid pace, often developing unpredictable abilities.

  • When GPT-4 came out, it clearly felt like the chatbot has some level of understanding.

  • But do these abilities reflect actual understanding?

  • Or are the models simply repeating their training data, like so-called stochastic parrots?

  • Recently, researchers from Princeton and Google DeepMind created a mathematically provable argument for how language models develop so many skills.

  • And designed a method for testing them.

  • The results suggest that the largest models develop new skills in a way that hints at understanding.

  • Language models are basically trained to solve next word prediction tasks.

  • So they are given a lot of text and at every step it has some idea of what the next word is.

  • And that idea is expressed in terms of a probability.

  • And if the next word didn't get high enough probability, there's a slight adjustment that's done.

  • And after many, many, many trillions of such small adjustments, it learns to predict the next word.

  • Over time, researchers have observed neural scaling laws, an empirical relationship between the performance of language models and the data used to train them.

  • As models improve, they minimize training loss or make fewer errors.

  • This sudden increase in performance produces new behaviors, a phenomenon called emergence.

  • There's no scientific explanation as to why that's happening.

  • So this phenomenon is not well understood.

  • The researchers wondered if GPT-4 sudden improvements could be explained by emergence.

  • Perhaps the model had learned compositional generalization, the ability to combine language skills.

  • This was some kind of a meta capability.

  • There was no mathematical framework to think about that.

  • And so we had to come up with a mathematical framework.

  • The researchers found their first hint by considering neural scaling laws.

  • So those scaling laws already suggest that there's some statistical phenomenon going on.

  • So random graphs have a long history in terms of thinking about statistical phenomena.

  • Random graphs are made of nodes which are connected by randomly generated edges.

  • The researchers built their mathematical model with bipartite graphs, which contain two types of nodes, one representing chunks of text and the other language skills.

  • The edges of the graph, the connections correspond to which skill is needed to understand that piece of text.

  • Now, the researchers needed to connect these bipartite graphs to actual language models.

  • But there was a problem.

  • Don't have access to the training data.

  • So if I'm evaluating that language model on my evaluation set, how do I know that the language model hasn't seen that data into the training corpus?

  • There was one crucial piece of information that the researchers could access.

  • Using that scaling law, we made a prediction as models get better at predicting the next word, that they will be able to combine more of the underlying skills.

  • According to random graph theory, every combination arises from a random sampling of possible skills.

  • If there are 100 skill nodes in the graph and you want to combine four skills, then there are about 100 to the fourth power or 100 million ways to combine them.

  • The researchers developed a test called SkillMix to evaluate if large language models can generalize to combinations of skills they likely hadn't seen before.

  • So the model is given a list of skills and a topic, and then it's supposed to create a piece of text on that topic using that list of skills.

  • For example, the researchers asked GPT-4 to generate a short text about sewing that exhibits spatial reasoning, self-serving bias and metaphor.

  • Here's what it answered.

  • In the labyrinth of sewing, I am the needle navigating between the intricate weaves.

  • Any errors are due to the faulty compass of low quality thread, not my skill.

  • We showed in our mathematical framework that as we scale up, the model is able to learn these skills.

  • You would see this increase in compositional capability as you scale up the models.

  • When given the SkillMix test, small language models struggled to combine just a couple of skills.

  • Medium-sized models could combine two skills more comfortably, but the largest models, like GPT-4, could combine five or six skills.

  • Because these models couldn't have seen all possible combinations of skills, the researchers argue that it must have developed compositional generalization through emergence.

  • Once the model has learned these language skills, a model can generalize to random, unseen compositions of these skills.

  • What they showed was that their mathematical model had this property of compositionality, and that by itself gives this ability to extrapolate and compose new combinations from existing pieces.

  • And that is really the hallmark of novelty and the hallmark of creativity.

  • And so the argument is that large language models can move beyond being stochastic parents.

  • The researchers are already working to extend the SkillMix evaluation to other domains as part of a larger effort to understand the capabilities of large language models.

  • Can we create an ecosystem of SkillMix, which is not just valid for language skills, but mathematical skills as well as coding skills?

  • So SkillMix was one example where we made a prediction by just mathematical thinking, and that was correct.

  • But there are all kinds of other phenomena that we probably are not aware of, and we need some understanding of that.

  • Quantum systems are some of the most complex structures in nature.

  • To model them, you need to compute a Hamiltonian, a super equation that describes how particles interact locally to produce the system's possible physical properties.

  • But entanglement spreads information across the system, correlating particles that are far apart.

  • This makes computing Hamiltonians exceptionally difficult.

  • You have a giant system of atoms.

  • It's a very big problem to learn all those parameters.

  • You could never hope to write down the Hamiltonian.

  • If you ever even tried to write it down, the game would be over and you wouldn't have an efficient algorithm.

  • People were actually trying to prove that efficient algorithms were impossible in this regime.

  • But a team of computer scientists from MIT and UC Berkeley cracked the problem.

  • They created an algorithm that can produce the Hamiltonian of a quantum system at any constant temperature.

  • The results could have big implications for the future of quantum computing and understanding exotic quantum behavior.

  • So when we have systems that behave and do interesting things like superfluidity and superconductivity, you want to understand the building blocks and how they fit together to create those properties that you want to harness for technological reasons.

  • So we're trying to learn this object, which is the Hamiltonian.

  • It's defined by a small set of parameters.

  • And what we're trying to do is learn these parameters.

  • What we have access to is these experimental measurements of the quantum system.

  • So the question then becomes, can you learn a description of the system through experiments?

  • Previous efforts in Hamiltonian learning produced algorithms that could measure particles at high temperatures.

  • But these systems are largely classical, so there's no entanglement between the particles.

  • The MIT and Berkeley team set their sights on the low temperature quantum regimes.

  • I wanted to understand what kinds of strategies worked algorithmically on the classical side and what could be manifestations of those strategies on the quantum side.

  • Once you look at the problem in the right way and you bring to bear these tools, it turns out that you can really make progress on these problems.

  • First, the team ported over a tool from classical machine learning called polynomial optimization.

  • This allowed them to approximate the measurements of their system as a family of polynomial equations.

  • We were like, maybe we can write Hamiltonian learning as a polynomial optimization problem.

  • And if we manage to do this, maybe we can try to optimize this polynomial system efficiently.

  • So all of a sudden, it's in a domain that's more familiar and you have a bunch of algorithmic tools at your disposal.

  • You can't solve polynomial systems, but what you can do is you can sort of solve a relaxation of them.

  • We use something called the sum of squares relaxation to actually solve this polynomial system.

  • Starting with a challenging polynomial optimization problem, the team used the sum of squares method to relax its constraints.

  • This expanded the equations to a larger allowable set of solutions, effectively converting it from a hard problem to an easier one.

  • The real trick is to argue that when you've expanded the set of solutions, you can still find a good solution inside it.

  • You need a procedure to take that approximate relaxed solution and round it back into an actual solution to the problem you really cared about.

  • So that's really where the coolest parts of the proof happen.

  • The researchers proved that the sum of squares relaxation could solve their learning problem, resulting in the first efficient Hamiltonian algorithm in a low temperature regime.

  • So we first make some set of measurements of the macroscopic properties of the system, and then we use these measurements to set up a system of polynomial equations.

  • And then we solve the system of polynomial equations.

  • So the output is a description of the local interactions in the system.

  • There are actually some very interesting learning problems that are at the heart of understanding quantum systems.

  • And to me, that was the most exciting part was really a connection between two different worlds.

  • This combination of tools is really interesting and something I haven't seen before.

  • I'm hoping it's like a useful perspective with which to tackle other questions as well.

  • I think we find ourselves at the start of this new bridge between theoretical computer science and quantum mechanics.

Since ChatGPT launched in 2022, large language models have progressed at a rapid pace, often developing unpredictable abilities.

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it