Placeholder Image

Subtitles section Play video

  • Elon Musk and his ex-AI startup have built the largest and most powerful artificial intelligence training supercomputer in the world.

  • Elon has named this beast Colossus.

  • It is equipped with the latest Nvidia GPU hardware, it's liquid-cooled with vast amounts of water, and is powered by giant Tesla Megapack batteries.

  • Elon believes that all of this combined will create the world's most artificial intelligence, one that will literally solve the mysteries of the universe.

  • And what we see today is only the beginning.

  • This is what's inside Colossus.

  • The location is Memphis, Tennessee, in an industrial park southwest of the city center, on the bank of the mighty Mississippi River.

  • The building itself wasn't constructed by ex-AI, it was previously home to Electrolux, which is a Swedish appliance manufacturer.

  • So, if you've been wondering why Elon chose Memphis and not Austin, it basically just comes down to finding the right building in the right location to get this thing up and running as fast as possible.

  • Now, as unassuming as the exterior of Colossus might be, it's what's inside that counts.

  • And inside is the largest AI training cluster in the world.

  • Currently, over 100,000 Nvidia HGX H100 GPUs connected with exabytes of data storage over a super fast network.

  • Nvidia CEO Jensen Huang has said himself that Colossus is quote, easily the fastest supercomputer on the planet.

  • And it was all built to power Grok, an AI model that Elon Musk and ex-AI will evolve into something far more capable than a simple chatbot.

  • This is the breeding ground for artificial super intelligence.

  • The entire facility as we see it was built in just 122 days.

  • That is insane.

  • A more traditional supercomputer cluster would have just one half to one quarter the amount of GPUs as Colossus, but the construction of those traditional systems would take years from start to finish.

  • The training work happens in an area called the data hall.

  • XAI uses a configuration known as the raised floor data hall, which splits the system into three levels.

  • Above is the power, below is the cooling, and in the middle is the GPU cluster.

  • There are four data halls inside Colossus, each with 25,000 GPUs plus storage and the fiber optic network that ties it all together.

  • Colossus uses water for liquid cooling.

  • Below the GPU cluster is a network of giant pipes that move vast amounts of water in and out of the facility.

  • Hot water from the server is sent outside to a chiller, which lowers the temperature of the water by a few degrees before pumping it back in.

  • This doesn't necessarily need to be cold water though.

  • Without getting too deep into thermodynamics, just remember that energy always travels from hot to cold.

  • So as long as the temperature of the water is lower than the working GPUs, which get pretty hot, then the excess heat energy will be drawn into the water as it flows past and heat will be removed from the system.

  • Here is what those GPU racks look like.

  • Each tray is loaded with 8 NVIDIA H100 GPUs, the current state-of-the-art chip for AI training.

  • That will change in a relatively short amount of time and Elon already has plans to upgrade Colossus to the NVIDIA B200 chip when that becomes widely available, but for right now, there's no time to waste.

  • There are 8 of these racks built into one cabinet with a total of 64 GPU chips and 16 CPU chips in every vertical stack.

  • Each of the racks has its own independent water cooling system, with these small tubes that lead directly into the GPU housing, blue tubes for cold water delivery and red tubes for hot water extraction.

  • The beauty of these GPU racks built for XAI by Supermicro is that each one can be pulled individually for maintenance and it's serviceable on the tray.

  • That means the entire cabinet doesn't need to be shut down and disassembled just to replace one chip.

  • The technician can simply pull the rack, perform the service right there on the tray and then slide it back in and get back to training.

  • This is unique in the AI industry.

  • Only XAI has a setup like this and it will allow them to keep their downtime to an absolute minimum.

  • The same is true for the water system.

  • Each cabinet has its own cooling management unit at the base that's responsible for monitoring flow rate and temperature with an individual water pump that can easily be removed and serviced.

  • Now, the thing to keep in mind about gigantic computer systems like this is that things will break.

  • There's no way to avoid that, but having a plan to keep failures localized and problems solved as fast as possible, that is going to make an incredible difference in the overall productivity of the cluster.

  • On the back of each cabinet is a rear door heat exchanger that's basically just a really big fan that pulls air through the rack and facilitates the heat transfer from the hot chips to the cool water.

  • This replaces giant air conditioning units that are found in typical data centers and again keeps each of the racks self-contained.

  • Every fan is glowing with a colored light.

  • That's not for aesthetics.

  • It's a way for technicians to quickly identify failures.

  • A healthy fan will have a blue light while a bad fan will switch to a red light and then they just replace those individual units as they go down.

  • While GPU chips do the heavy lifting for AI training, CPU chips are used for preparing the data and running the operating system.

  • There are two CPUs for every eight GPUs.

  • All of the data used to train Grok is held in a hard drive storage system.

  • Exabytes of text, images, and video that are fed into the training cluster.

  • One exabyte is a billion gigabytes and all of that data is handled by a super high-speed network system.

  • Data is moved around Colossus by Ethernet, but this is not anything like your home network.

  • The XAI network is powered by NVIDIA BlueField 3 DPUs.

  • That's a data processing unit and these chips can handle 400 gigabits per second through a network of fiber optic cables.

  • That's around 400 times faster than a very fast home internet connection.

  • The Ethernet is necessary for scaling beyond the size of a traditional supercomputer system.

  • See AI training requires a massive amount of storage that needs to be accessible by every server in the data center.

  • Now, this massive amount of equipment requires an equally massive amount of power.

  • And again, XAI has done something totally unique with their energy delivery.

  • They are using Tesla Energy.

  • Colossus doesn't use solar energy.

  • It's draining power from traditional generators.

  • But there was a problem that XAI encountered when they started to bring their 100,000 GPU system online.

  • The tiny millisecond variations in power coming from the grid would create inconsistencies in the training process.

  • We are talking very small fluctuations, but at this giant scale, those will add up quickly.

  • So the solution was to bring in Tesla Megapack battery units.

  • So what they do now is pipe input power from the grid into the Megapacks, then the batteries discharged directly into the training cluster.

  • This provides the super consistent direct energy required for the entire network to have the most efficient training session that is physically possible.

  • This unique energy upgrade will become even more critical when XAI doubles the size of Colossus to over 200,000 H100 GPUs, something that Elon claims will happen within the next two months.

  • That is an insane rate of growth, and it's got the established AI giant scared.

  • There have been reports that OpenAI CEO Sam Altman has already told Microsoft executives that he's concerned Elon will soon overtake them in access to computing power.

  • Of course, this stuff ain't cheap.

  • It was just a few months ago that XAI raised $6 billion in venture capital funding, bringing the one-year-old company to a valuation of $24 billion.

  • That's a lot of money for a young company that only had one basic product on the market at the time.

  • But they did have the richest man in the world at the controls, so obviously that counts for a lot.

  • Now, we've just seen reports from the Wall Street Journal that Elon is already looking for a lot more money, enough to bring the value of XAI to $40 billion.

  • For a sense of scale, the industry giant OpenAI is currently valued at $157 billion.

  • While a smaller-scale operation like Perplexity, who makes a highly regarded AI search tool, they're expected to soon hit a valuation of $8 billion.

  • As for Grok, the AI chatbot is continuing to rapidly evolve thanks to new power provided by Just recently, Grok was upgraded to include vision capabilities, meaning that the AI can analyze and comprehend input from images alongside its existing text functions.

  • This new feature is integrated into the X social media platform for premium users.

  • Now when you see an image in a post, you can click a button to send that image to Grok, where you can now ask the AI any question you want about the content of that image.

  • Grok can analyze or provide additional context.

  • This is an important step for XAI on their path towards achieving artificial general intelligence.

  • That's a big buzz term right now, it basically just means an AI that can do pretty much anything.

  • Essentially, an artificial reproduction of the human mind and its incredible versatility.

  • We can write words, we can make music, we can solve complex problems, invent new things.

  • In theory, an artificial general intelligence would have all of the knowledge of the entire human race all concentrated into one super powerful computer brain, making it infinitely smarter than any human being.

  • Then the AGI can use that knowledge to learn even more, to discover the undiscoverable, solve the unsolvable, invent the uninventable.

  • According to Elon Musk, this is how we unlock the mysteries of the universe and the very nature of our own existence.

  • Or the AI will go rogue and kill us all.

  • But that's where Neuralink comes in, which is a whole other video that we've already made, make sure you check one of those out next.

Elon Musk and his ex-AI startup have built the largest and most powerful artificial intelligence training supercomputer in the world.

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it

B1 US

Inside Elon Musk's Colossus Supercomputer!

  • 10 2
    Adam Lin posted on 2024/11/29
Video vocabulary