Placeholder Image

Subtitles section Play video

  • How many calculations do you think your graphics card performs every second while running video games with incredibly realistic graphics?

    在運行圖形逼真度極高的視頻遊戲時,您認為您的顯卡每秒鐘要進行多少次運算?

  • Maybe a hundred million?

    也許有一億?

  • Well, a hundred million calculations a second is what's required to run Mario 64 from 1996.

    運行 1996 年的《馬里奧 64》需要每秒進行一億次計算。

  • We need more power.

    我們需要更多的力量。

  • Maybe a hundred billion calculations a second?

    也許每秒要計算一千億次?

  • Well, then you would have a computer that could run Minecraft back in 2011.

    那麼,在 2011 年,你就有一臺可以運行 Minecraft 的電腦了。

  • In order to run the most realistic video games, such as Cyberpunk 2077, you need a graphics card that can perform around 36 trillion calculations a second.

    要運行《賽博朋克 2077》等最逼真的視頻遊戲,你需要一塊每秒能進行約 36 萬億次計算的顯卡。

  • This is an unimaginably large number, so let's take a second to try to conceptualize it.

    這是一個大得難以想象的數字,讓我們花點時間來構思一下。

  • Imagine doing a long multiplication problem once every second.

    想象一下,每秒鐘做一次長乘法題。

  • Now let's say everyone on the planet does a similar type of calculation, but with different numbers.

    現在,假設地球上的每個人都做了類似的計算,但數字不同。

  • To reach the equivalent computational power of this graphics card and its 36 trillion calculations a second, we would need about 4,400 Earths filled with people.

    要達到這塊顯卡的計算能力和每秒 36 萬億次的計算速度,我們需要 4400 個地球上的居民。

  • All working together and completing one calculation each every second.

    大家齊心協力,每秒鐘完成一次計算。

  • It's rather mind-boggling to think that a device can manage all these calculations, so in this video we'll see how graphics cards work in two parts.

    一個設備就能完成所有這些運算,想想都令人匪夷所思,所以在本視頻中,我們將分兩部分來了解顯卡的工作原理。

  • First, we'll open up this graphics card and explore the different components inside, as well as the physical design and architecture of the GPU, or graphics processing unit.

    首先,我們將打開這款顯卡,探索內部的不同組件,以及 GPU(即圖形處理單元)的物理設計和架構。

  • Second, we'll explore the computational architecture and see how GPUs process mountains of data and why they're ideal for running video game graphics, Bitcoin mining, neural networks, and AI.

    其次,我們將探索計算架構,瞭解 GPU 如何處理大量數據,以及為什麼 GPU 是運行視頻遊戲圖形、比特幣挖礦、神經網絡和人工智能的理想選擇。

  • So, stick around and let's jump right in.

    所以,請不要走開,我們馬上開始。

  • This video is sponsored by Micron, which manufactures the graphics memory inside this graphics card.

    本視頻由美光贊助,該顯卡內的顯存由美光生產。

  • Before we dive into all the parts of the GPU, let's first understand the differences between GPUs and CPUs.

    在深入瞭解 GPU 的各個部分之前,我們先來了解一下 GPU 和 CPU 之間的區別。

  • Inside this graphics card, the graphics processing unit, or GPU, has over 10,000 cores.

    在這款顯卡內部,圖形處理單元(GPU)擁有超過 10,000 個內核。

  • However, when we look at the CPU, or central processing unit that's mounted to the motherboard, we find an integrated circuit or chip with only 24 cores.

    然而,當我們看到安裝在主板上的 CPU 或中央處理器時,我們會發現集成電路或芯片只有 24 個內核。

  • So, which one is more powerful?

    那麼,哪一個更強大呢?

  • 10,000 is a lot more than 24, so you would think the GPU is more powerful.

    10,000 比 24 多很多,所以你會認為 GPU 功能更強大。

  • However, it's more complicated than that.

    然而,情況比這更復雜。

  • A useful analogy is to think of a GPU as a massive cargo ship and a CPU as a jumbo jet airplane.

    一個有用的比喻是,把 GPU 想象成一艘巨大的貨輪,而把 CPU 想象成一架巨型噴氣式飛機。

  • The amount of cargo capacity is the amount of calculations and data that can be processed, and the speed of the ship or airplane is the rate at which how quickly those calculations and data are being processed.

    貨運量是指可處理的計算量和數據量,而船舶或飛機的速度則是指處理這些計算量和數據的速度。

  • Essentially, it's a tradeoff between a massive number of calculations that are executed at a slower rate versus a few calculations that can be performed at a much faster rate.

    從本質上講,這是在以較慢速度執行的大量計算與以更快速度執行的少量計算之間的權衡。

  • Another key difference is that airplanes are a lot more flexible since they can carry passengers, packages, or containers and can take off and land at any one of tens of thousands of airports.

    另一個主要區別是,飛機更加靈活,因為它們可以運載乘客、包裹或貨櫃,並且可以在數萬個機場中的任何一個機場起飛和降落。

  • Likewise, CPUs are flexible in that they can run a variety of programs and instructions.

    同樣,CPU 也很靈活,可以運行各種程序和指令。

  • However, giant cargo ships carry only containers with bulk contents inside and are limited to traveling between ports.

    不過,巨型貨輪只能運載裝有散裝貨物的貨櫃,而且僅限於在港口之間航行。

  • Similarly, GPUs are a lot less flexible than CPUs and can only run simple instructions like basic arithmetic.

    同樣,GPU 的靈活性也遠不如 CPU,只能運行基本算術等簡單指令。

  • Additionally, GPUs can't run operating systems or interface with input devices or networks.

    此外,GPU 無法運行作業系統,也無法連接輸入設備或網絡。

  • This analogy isn't perfect, but it helps to answer the question of which is faster, a CPU or a GPU?

    這個比喻並不完美,但它有助於回答 CPU 和 GPU 哪個更快?

  • Essentially, if you want to perform a set of calculations across mountains of data, then a GPU will be faster at completing the task.

    從根本上說,如果您想在堆積如山的數據中執行一組計算,那麼 GPU 會更快地完成任務。

  • However, if you have a lot less data that needs to be evaluated quickly, then a CPU will be faster.

    不過,如果需要快速評估的數據較少,那麼 CPU 的速度會更快。

  • Furthermore, if you need to run an operating system or support network connections in a wide range of different applications and hardware, then you'll want a CPU.

    此外,如果您需要在各種不同的應用程序和硬件中運行作業系統或支持網絡連接,那麼您就需要 CPU。

  • We're planning a separate video on CPU architecture, so make sure to subscribe so you don't miss it.

    我們計劃單獨製作一個關於 CPU 架構的視頻,請務必訂閱,以免錯過。

  • But let's now dive into this graphics card and see how it works.

    現在,讓我們深入瞭解這款顯卡,看看它是如何工作的。

  • In the center of this graphics card is the printed circuit board, or PCB, with all the various components mounted on it.

    顯卡的中心是印刷電路板(PCB),上面安裝有各種組件。

  • And we'll start by exploring the brains, which is the graphics processing unit, or GPU.

    我們將從圖形處理器(GPU)的大腦開始探索。

  • When we open it up, we find a large chip, or die, named GA-102, built from 28.3 billion transistors.

    打開後,我們會發現一個名為 GA-102 的大型芯片(或稱裸片),由 283 億個半導體組成。

  • The majority of the area of the chip is taken up by the processing cores, which have a hierarchical organization.

    芯片的大部分面積由處理內核佔據,處理內核採用分層組織結構。

  • Specifically, the chip is divided into seven graphics processing clusters, or GPCs, and within each processing cluster are 12 streaming multiprocessors, or SMs.

    具體來說,該芯片分為 7 個圖形處理集群(或稱 GPC),每個處理集群內有 12 個流式多處理器(或稱 SM)。

  • Next, inside each of these streaming multiprocessors are four warps and one ray tracing core, and then inside each warp are 32 CUDA, or shading cores, and one tensor core.

    接下來,每個流式多處理器內有四個 warp 和一個光線追蹤核心,每個 warp 內有 32 個 CUDA(即著色核心)和一個張量核心。

  • Across the entire GPU are 10,752 CUDA cores, 336 tensor cores, and 84 ray tracing cores.

    整個 GPU 擁有 10,752 個 CUDA 內核、336 個張量內核和 84 個光線追蹤內核。

  • These three types of cores execute all the calculations of the GPU, and each has a different function.

    這三種內核執行 GPU 的所有計算,各自具有不同的功能。

  • CUDA cores can be thought of as simple binary calculators, with an addition button, a multiply button, and a few others, and are used the most when running video games.

    CUDA 內核可以看作是簡單的二進制計算器,有一個加法按鈕、一個乘法按鈕和其他幾個按鈕,在運行視頻遊戲時使用最多。

  • Tensor cores are matrix multiplication and addition calculators, and are used for geometric transformations and working with neural networks and AI.

    張量核是矩陣乘法和加法計算器,用於幾何變換以及神經網絡和人工智能。

  • And ray tracing cores are the largest, but the fewest, and are used to execute ray tracing algorithms.

    光線追蹤核心的規模最大,但數量最少,用於執行光線追蹤算法。

  • Now that we understand the computational resources inside this chip, one rather interesting fact is that the 3080, 3090, 3080 Ti, and 3090 Ti graphics cards all use the same GA102 chip design for their GPU.

    現在我們瞭解了這塊芯片內部的計算資源,一個相當有趣的事實是,3080、3090、3080 Ti 和 3090 Ti 顯卡的 GPU 都採用了相同的 GA102 芯片設計。

  • This might be counterintuitive, because they have different prices and were released in different years, but it's true.

    這可能有違直覺,因為它們的價格不同,發行年份也不同,但事實確實如此。

  • So why is this?

    為什麼會這樣?

  • Well, during the manufacturing process, sometimes patterning errors, dust particles, or other manufacturing issues cause damage and create defective areas of the circuit.

    在製造過程中,有時圖案錯誤、灰塵微粒或其他製造問題會造成損壞,使電路出現缺陷區域。

  • Instead of throwing out the entire chip because of a small defect, engineers find the defective region and permanently isolate and deactivate the nearby circuitry.

    工程師們不會因為一個小缺陷就扔掉整個芯片,而是找到缺陷區域,永久隔離並停用附近的電路。

  • By having a GPU with a highly repetitive design, a small defect in one core only damages that particular streaming multiprocessor circuit, and doesn't affect the other areas of the chip.

    通過採用高度重複設計的 GPU,一個內核的微小缺陷只會損壞特定的流式多處理器電路,而不會影響芯片的其他區域。

  • As a result, these chips are tested and categorized, or binned, according to the number of defects.

    是以,要對這些芯片進行檢測,並根據缺陷數量進行分類或分檔。

  • The 3090 Ti graphics card have flawless GA102 chips, with all 10,752 CUDA cores working properly.

    3090 Ti 顯卡的 GA102 芯片完美無瑕,所有 10,752 個 CUDA 內核都能正常工作。

  • The 3090 has 10,496 cores working.

    3090 有 10,496 個內核在工作。

  • The 3080 Ti has 10,240, and the 3080 has 8,704 CUDA cores working, which is equivalent to having 16 damaged and deactivated streaming multiprocessors.

    3080 Ti 擁有 10,240 個 CUDA 內核,而 3080 擁有 8,704 個 CUDA 內核,這相當於擁有 16 個已損壞和停用的流式多核處理器。

  • Additionally, different graphics cards differ by their maximum clock speed, and the quantity and generation of graphics memory that supports the GPU, which we'll explore in a little bit.

    此外,不同顯卡的最大時鐘速度以及支持 GPU 的顯存數量和世代也不盡相同,我們稍後將對此進行探討。

  • Because we've been focusing on the physical architecture of this GA102 GPU chip, let's zoom into one of these CUDA cores and see what it looks like.

    因為我們一直在關注 GA102 GPU 芯片的物理架構,所以讓我們放大其中一個 CUDA 內核,看看它長什麼樣。

  • Inside this simple calculator is a layout of approximately 410,000 transistors.

    在這個簡單的計算器內部,佈局了大約 41 萬個半導體。

  • This section of 50,000 transistors performs the operation of A times B plus C, which is called fused multiply and add, or FMA, and is the most common operation performed by graphics cards.

    這部分由 50,000 個半導體組成,執行 A 乘以 B 加 C 的運算,這種運算被稱為融合乘法和加法(或 FMA),是顯卡執行的最常見運算。

  • Half of the CUDA cores execute FMA using 32-bit floating-point numbers, which is essentially scientific notation, and the other half of the cores use either 32-bit integers or 32-bit floating-point numbers.

    一半的 CUDA 內核使用 32 位浮點數(基本上是科學符號)執行 FMA,另一半內核使用 32 位整數或 32 位浮點數。

  • Other sections of this core accommodate negative numbers and perform other simple functions like bit-shifting and bit-masking, as well as collecting and queuing the incoming instructions and operands, and then accumulating and outputting the results.

    該內核的其他部分可處理負數,並執行其他簡單功能,如位移位和位屏蔽,以及收集和排列輸入指令和操作數,然後累加並輸出結果。

  • As a result, this single core is just a simple calculator with a limited number of functions.

    是以,這個單核只是一個功能有限的簡單計算器。

  • This calculator completes one multiply and one add operation each clock cycle, and therefore with this 3090 graphics card and its 10,496 cores and 1.7 GHz clock, we get 35.6 trillion calculations a second.

    該計算器每個時鐘週期完成一次乘法運算和一次加法運算,是以,使用 3090 顯卡及其 10,496 個內核和 1.7 GHz 時鐘,我們每秒可進行 35.6 萬億次計算。

  • However, if you're wondering how the GPU handles more complicated operations like division, square root, and trigonometric functions, well, these calculator operations are performed by the special function units, which are far fewer, as only four of them can be found in each streaming multiprocessor.

    不過,如果你想知道 GPU 如何處理除法、平方根和三角函數等更復雜的運算,那麼,這些計算器運算是由特殊功能單元執行的,而特殊功能單元的數量要少得多,因為每個流式多核處理器中只有四個特殊功能單元。

  • Now that we have an understanding of what's inside a single core, let's zoom out and take a look at the other sections of the GA102 chip.

    現在我們已經瞭解了單核心的內部結構,下面讓我們放大看看 GA102 芯片的其他部分。

  • Around the edge, we find 12 graphics memory controllers, the NVLink controllers, and the PCIe interface.

    在邊緣,我們可以看到 12 個顯存控制器、NVLink 控制器和 PCIe 接口。

  • On the bottom is a 6MB Level 2 SRAM memory cache, and here's the Gigathread engine, which manages all the graphics processing clusters and streaming multiprocessors inside.

    底部是 6MB 二級 SRAM 內存緩存,這裡是管理內部所有圖形處理集群和流式多處理器的 Gigathread 引擎。

  • Now that we've explored this GA102 GPU's physical architecture, let's zoom out and take a look at the other parts inside the graphics card.

    既然我們已經瞭解了 GA102 GPU 的物理架構,那麼讓我們放大看看顯卡內部的其他部件。

  • On this side are the various ports for the displays to be plugged into.

    這一側是供顯示器插入的各種端口。

  • On the other side is the incoming 12V power connector, and then here are the PCIe pins that plug into the motherboard.

    另一側是輸入的 12V 電源連接器,然後是插入主板的 PCIe 插針。

  • On the PCB, the majority of the smaller components constitute the voltage regulator module, which takes the incoming 12V and converts it to 1.1V and supplies hundreds of watts of power to the GPU.

    在 PCB 上,大部分較小的元件構成了電壓調節器模塊,它將輸入的 12V 電壓轉換為 1.1V,併為 GPU 提供數百瓦的電力。

  • Because all this power heats up the GPU, most of the weight of the graphics card is in the form of a heat sink, with four heat pipes that carry heat from the GPU and memory chips to the radiator fins where fans then help to remove the heat.

    由於所有這些功率都會使 GPU 發熱,是以顯卡的大部分重量都以散熱器的形式存在,四根熱管將 GPU 和內存芯片的熱量輸送到散熱器的散熱片上,然後風扇幫助散熱。

  • Perhaps some of the most important components, aside from the GPU, are the 24GB of graphics memory chips, which are technically called GDD R6X SD-RAM, and were manufactured by Micron, which is the sponsor of this video.

    除了 GPU 之外,最重要的部件可能就是 24GB 的顯存芯片了,嚴格來說,這些芯片被稱為 GDD R6X SD-RAM,由本視頻的贊助商美光公司生產。

  • Whenever you start up a video game or wait for a loading screen, the time it takes to load is mostly spent moving all the 3D models of a particular scene or environment from the solid-state drive into these graphics memory chips.

    每當啟動視頻遊戲或等待加載螢幕時,加載所需的大部分時間都花在將特定場景或環境的所有 3D 模型從固態硬盤移動到這些顯存芯片上。

  • As mentioned earlier, the GPU has a small amount of data storage in its 6MB shared Level 2 cache, which can hold the equivalent of about this much of the video game's environment.

    如前所述,GPU 的 6MB 共享二級緩存中存儲了少量數據,大約相當於視頻遊戲環境的容量。

  • Therefore, in order to render a video game, different chunks of scene are continuously being transferred between the graphics memory and the GPU.

    是以,在渲染視頻遊戲時,不同的場景塊會在顯存和 GPU 之間不斷傳輸。

  • Because the cores are constantly performing tens of trillions of calculations a second, GPUs are data-hungry machines and need to be continuously fed terabytes upon terabytes of data.

    由於內核每秒要不斷執行數十萬億次計算,是以 GPU 是一種數據飢渴型機器,需要不斷獲得數 TB 的數據。

  • And thus, these graphics memory chips are designed kind of like multiple cranes loading a cargo ship at the same time.

    是以,這些顯存芯片的設計有點像多臺起重機同時裝載一艘貨船。

  • Specifically, these 24 chips transfer a combined 384 bits at a time, which is called the bus width, and the total data that can be transferred, or the bandwidth, is about 1.15 terabytes a second.

    具體來說,這 24 個芯片每次傳輸的總位數為 384 位,這就是總線寬度,每秒可傳輸的數據總量或帶寬約為 1.15 TB。

  • In contrast, the sticks of DRAM that support the CPU only have a 64-bit bus width and a maximum bandwidth closer to 64 gigabytes a second.

    相比之下,支持 CPU 的 DRAM 盤只有 64 位總線寬度,最大帶寬接近每秒 64 千兆字節。

  • One rather interesting thing is that you may think that computers only work using binary 1s and 0s.

    一個相當有趣的現象是,你可能認為計算機只使用二進制 1 和 0。

  • However, in order to increase static transfer rates, GDDR6X and the latest graphics memory, GDDR7, send and receive data across the bus wires using multiple voltage levels beyond just 0 and 1.

    不過,為了提高靜態傳輸速率,GDDR6X 和最新的顯存 GDDR7 在總線上發送和接收數據時使用了多個電壓電平,而不僅僅是 0 和 1。

  • For example, GDDR7 uses three different encoding schemes to combine binary bits into ternary digits, or PAM3 symbols with voltages of 0, 1, and negative 1.

    例如,GDDR7 採用三種不同的編碼方案,將二進制位組合成三元數字,或電壓為 0、1 和負 1 的 PAM3 符號。

  • Here's the encoding scheme on how three binary bits are encoded into two ternary digits, and this scheme is combined with an 11-bit to 7-ternary digit encoding scheme, resulting in sending 276 binary bits using only 176 ternary digits.

    下面是如何將三個二進制位編碼成兩個三進制位的編碼方案,該方案與 11 位到 7 位的三進制位編碼方案相結合,只需使用 176 個三進制位就能發送 276 個二進制位。

  • The previous generation GDDR6X, which is the memory in this 3090 graphics card, used a different encoding scheme called PAM4 to send two bits of data using four different voltage levels.

    上一代 GDDR6X(即這款 3090 顯卡中的內存)使用一種名為 PAM4 的不同編碼方案,通過四個不同的電壓電平發送兩個比特的數據。

  • However, engineers in the graphics memory industry agreed to switch to PAM3 for future generations of graphics chips in order to reduce encoder complexity, improve the signal-to-noise ratio, and improve power efficiency.

    不過,圖形存儲器行業的工程師們同意在未來幾代圖形芯片中改用 PAM3,以降低編碼器的複雜性,提高信噪比,並提高能效。

  • Micron delivers consistent innovation to push the boundaries on how much data can be transferred every second and to design cutting-edge memory chips.

    美光始終堅持創新,不斷突破每秒數據傳輸量的極限,設計出最先進的內存芯片。

  • Another advancement by Micron is the development of HBM, or the high-bandwidth memory that surrounds AI chips.

    美光的另一項進步是開發出了 HBM,即圍繞人工智能芯片的高帶寬內存。

  • HBM is built from stacks of DRAM memory chips and uses TSVs, or through-silicon vias, to connect the stack into a single chip, essentially forming a cube of AI memory.

    HBM 由 DRAM 存儲器芯片堆疊而成,並使用 TSV(硅通孔)將堆疊連接到單個芯片中,基本上形成了一個人工智能存儲器立方體。

  • For the latest generation of high-bandwidth memory, which is HBM3E, a single cube can have up to 24 to 36 gigabytes of memory, thus yielding 192 gigabytes of high-speed memory around the AI chip.

    對於最新一代的高帶寬內存(即 HBM3E),單個立方體可擁有多達 24 至 36 千兆字節的內存,從而在人工智能芯片周圍產生 192 千兆字節的高速內存。

  • Next time you buy an AI accelerator system, make sure it uses Micron's HBM3E, which uses 30% less power than the competitive products.

    下次購買人工智能加速器系統時,請確保使用美光的 HBM3E,它的功耗比同類產品低 30%。

  • However, unless you're building an AI data center, you're likely not in the market to buy one of these systems, which cost between $25,000 to $40,000 and are on backorder for a few years.

    不過,除非您正在建設一個人工智能數據中心,否則您很可能不會在市場上購買這些系統,它們的價格在 25,000 美元到 40,000 美元之間,而且在幾年內都處於滯銷狀態。

  • If you're curious about high-bandwidth memory or Micron's next generation of graphics memory, take a look at one of these links in the description.

    如果您對高帶寬內存或 Micron 的下一代顯存感到好奇,請查看說明中的鏈接。

  • Alternatively, if designing the next generation of memory chips interests you, Micron is always looking for talented scientists and engineers to help innovate on cutting-edge chips, and you can find out more about working for Micron using this link.

    另外,如果您對設計下一代內存芯片感興趣,美光一直在尋找有才華的科學家和工程師來幫助創新尖端芯片,您可以通過此鏈接瞭解有關在美光工作的更多資訊。

  • Now that we've explored many of the physical components inside this graphics card and GPU, let's next explore the computational architecture and see how applications like video game graphics and Bitcoin mining run what's called embarrassingly parallel operations.

    現在,我們已經探索了顯卡和 GPU 內部的許多物理組件,接下來讓我們探索一下計算架構,看看視頻遊戲圖形和比特幣挖礦等應用是如何運行所謂的令人尷尬的並行操作的。

  • Although it may sound like a silly name, embarrassingly parallel is actually a technical classification of computer problems where little or no effort is needed to divide the problem into parallel tasks, and video game rendering and Bitcoin mining easily fall into this category.

    雖然聽起來像個傻名字,但 "令人尷尬的並行 "實際上是對計算機問題的一種技術分類,在這種情況下,幾乎不需要將問題劃分為並行任務,視頻遊戲渲染和比特幣挖礦很容易就屬於這一類。

  • Essentially, GPUs solve embarrassingly parallel problems using a principle called SIMD, which stands for Single Instruction Multiple Data, where the same instructions or steps are repeated across thousands to millions of different numbers.

    從本質上講,GPU 利用一種名為 SIMD 的原理來解決令人尷尬的並行問題,SIMD 是單指令多數據的縮寫,即相同的指令或步驟在數千到數百萬個不同的數字上重複執行。

  • Let's see an example of how SIMD, or Single Instruction Multiple Data, is used to create this 3D video game environment.

    讓我們舉例說明 SIMD(單指令多數據)是如何用於創建 3D 視頻遊戲環境的。

  • As you may know already, this cowboy hat on the table is composed of approximately 28,000 triangles built by connecting together around 14,000 vertices, each with X, Y, and Z coordinates.

    大家可能已經知道,桌子上的這頂牛仔帽是由大約 2.8 萬個三角形組成的,這些三角形由大約 1.4 萬個頂點連接而成,每個頂點都有 X、Y 和 Z 座標。

  • These vertex coordinates are built using a coordinate system called model space, with the origin of 0, 0, 0 being at the center of the hat.

    這些頂點座標是使用模型空間座標系建立的,原點 0, 0, 0 位於帽子的中心。

  • To build a 3D world, we place hundreds of objects, each with their own model space, into the world environment, and, in order for the camera to be able to tell where each object is relative to other objects, we have to convert or transform all the vertices from each separate model space into the shared world coordinate system, or world space.

    為了構建一個 3D 世界,我們需要在世界環境中放置數以百計的物體,每個物體都有自己的模型空間,為了讓攝影機能夠分辨出每個物體相對於其他物體的位置,我們必須將每個獨立模型空間中的所有頂點轉換或變換到共享的世界座標系或世界空間中。

  • So, as an example, how do we convert the 14,000 vertices of the cowboy hat from model space into world space?

    那麼,舉個例子,我們如何將牛仔帽的 14000 個頂點從模型空間轉換到世界空間呢?

  • Well, we use a single instruction which adds the position of the origin of the hat in world space to the corresponding X, Y, and Z coordinate of a single vertex in model space.

    我們使用一條指令,將世界空間中帽子原點的位置與模型空間中單個頂點的相應 X、Y 和 Z 座標相加。

  • Next, we copy this instruction to multiple data, which is all the remaining X, Y, and Z coordinates of the other thousands of vertices that are used to build the hat.

    接下來,我們將這條指令複製到多重數據中,即用於構建帽子的其他數千個頂點的所有剩餘 X、Y 和 Z 座標。

  • Next, we do the same for the table and the rest of the hundreds of other objects in the scene, each time using the same instructions but with the different objects' coordinates in world space, and each object's thousands of vertices in model space.

    接下來,我們對桌子和場景中其他數百個物體進行同樣的處理,每次都使用相同的指令,但不同物體的座標在世界空間中,而每個物體的數千個頂點在模型空間中。

  • As a result, all the vertices and triangles of all the objects are converted to a common world space coordinate system, and the camera can now determine which objects are in front and which are behind.

    這樣,所有物體的所有頂點和三角形都被轉換到一個共同的世界空間座標系中,攝影機現在可以確定哪些物體在前面,哪些在後面。

  • This example illustrates the power of SIMD, or Single Instruction Multiple Data, and how a single instruction is applied to 5,629 different objects with a total of 8.3 million vertices within the scene, resulting in 25 million addition calculations.

    該示例展示了 SIMD(單指令多數據)的強大功能,以及一條指令是如何應用於場景中 5629 個不同的對象(總計 830 萬個頂點),從而進行 2500 萬次加法運算的。

  • The key to SIMD and embarrassingly parallel programs is that every one of these millions of calculations has no dependency on any other calculation, and thus all these calculations can be distributed to the thousands of cores of the GPU and completed in parallel with one another.

    SIMD 和令人尷尬的並行程序的關鍵在於,這些數以百萬計的計算中的每一個計算都不依賴於任何其他計算,是以所有這些計算都可以分配到 GPU 的數千個內核上,並相互並行地完成。

  • It's important to note that vertex transformation from model space to world space is just one of the first steps of a rather complicated video game graphics rendering pipeline, and we have a separate video that delves deeper into each of these other steps.

    值得注意的是,從模型空間到世界空間的頂點轉換隻是相當複雜的視頻遊戲圖形渲染管道的第一步,我們將在另一個視頻中深入介紹其他各個步驟。

  • Also, we skipped over the transformations for the rotation and scale of each object, but factoring in these values is a similar process that requires additional SIMD calculations.

    此外,我們跳過了每個對象的旋轉和縮放變換,但將這些值計算在內也是一個類似的過程,需要額外的 SIMD 計算。

  • Now that we have a simple understanding of SIMD, let's discuss how this computational architecture matches up with the physical architecture.

    既然我們已經對 SIMD 有了簡單的瞭解,下面我們就來討論一下這種計算架構如何與物理架構相匹配。

  • Essentially, each instruction is completed by a thread, and this thread is matched to a single CUDA core.

    基本上,每條指令都由一個線程完成,而這個線程與一個 CUDA 內核相匹配。

  • Threads are bundled into groups of 32 called warps, and the same sequence of instructions is issued to all the threads in a warp.

    線程被捆綁成 32 個一組,稱為經線,並向經線中的所有線程發出相同的指令序列。

  • Next, warps are grouped into thread blocks, which are handled by the streaming multiprocessor.

    接下來,經線被分組為線程塊,由流式多處理器處理。

  • And then finally, thread blocks are grouped into grids, which are computed across the overall GPU.

    最後,線程塊被分組為網格,在整個 GPU 上進行計算。

  • All these computations are managed or scheduled by the gigathread engine, which efficiently maps thread blocks to the available streaming multiprocessors.

    所有這些計算都由千兆線程引擎管理或調度,它能將線程塊高效地映射到可用的流式多處理器上。

  • One important distinction is that within SIMD architecture, all 32 threads in a warp follow the same instructions and are in lockstep with each other, kind of like a phalanx of soldiers moving together.

    一個重要的區別是,在 SIMD 架構中,一個 warp 中的所有 32 個線程都遵循相同的指令,彼此步調一致,有點像阿兵哥方陣一起行動。

  • This lockstep execution applied to GPUs up until around 2016.

    直到 2016 年左右,這種鎖步執行方式一直適用於 GPU。

  • However, newer GPUs follow a SIMT architecture, or single instruction multiple threads.

    不過,較新的 GPU 採用 SIMT 架構,即單指令多線程架構。

  • The difference between SIMD and SIMT is that while both send the same set of instructions to each thread, with SIMT, the individual threads don't need to be in lockstep with each other and can progress at different rates.

    SIMD 和 SIMT 的區別在於,雖然兩者都向每個線程發送相同的指令集,但在 SIMT 中,各個線程不需要彼此同步,可以以不同的速度運行。

  • In technical jargon, each thread is given its own program counter.

    用專業術語來說,每個線程都有自己的程序計數器。

  • Additionally, with SIMT, all the threads within a streaming multiprocessor use a shared 128 kilobyte L1 cache, and thus data that's output by one thread can be subsequently used by a separate thread.

    此外,通過 SIMT,流式多處理器中的所有線程都使用共享的 128 千字節 L1 高速緩存,是以一個線程輸出的數據隨後可被另一個線程使用。

  • This improvement from SIMD to SIMT allows for more flexibility when encountering warp divergence via data-dependent conditional branching and easier reconvergence for the threads to reach the barrier synchronization.

    從 SIMD 到 SIMT 的這一改進,使得在遇到翹曲分歧時,可以通過依賴數據的條件分支實現更大的靈活性,並使線程更容易重新融合,以達到障礙同步。

  • Essentially, newer architectures of GPUs are more flexible and efficient, especially when encountering branches in code.

    從本質上講,更新的 GPU 架構更加靈活高效,尤其是在遇到代碼分支時。

  • One additional note is that although you may think that the term warp is derived from warp drives, it actually comes from weaving, and specifically the Jacquard loom.

    還有一點需要注意的是,儘管您可能認為經線一詞來源於經線傳動裝置,但它實際上來源於紡織,特別是提花織布機。

  • This loom from 1804 used programmable punch cards to select specific threads out of a set to weave together intricate patterns.

    這臺 1804 年的織布機使用可編程的打孔卡,從一組線中選擇特定的線來編織複雜的圖案。

  • As fascinating as looms are, let's move on.

    雖然織布機很迷人,但我們還是繼續前進吧。

  • The final topics we'll explore are Bitcoin mining, tensor cores, and neural networks.

    最後,我們將探討比特幣挖礦、張量內核和神經網絡。

  • But first, we'd like to ask you to like this video, write a quick comment below, share it with a colleague, friend, or on social media, and subscribe if you haven't already.

    但首先,我們希望您能喜歡這段視頻,在下面寫下簡短的評論,與同事、朋友或在社交媒體上分享這段視頻,如果您還沒有訂閱,請訂閱。

  • The dream of Branch Education is to make free and accessible, visually engaging educational videos that dive deeply into a variety of topics on science, engineering, and how technology works, and then to combine multiple videos into an entirely free engineering curriculum for high school and college students.

    Branch Education 的夢想是製作免費的、易於觀看的、具有視覺吸引力的教育視頻,深入探討科學、工程和技術如何運作等各種主題,然後將多個視頻組合成一個完全免費的工程課程,供高中生和大學生使用。

  • Taking a few seconds to like, subscribe, and comment below helps us a ton.

    請花幾秒鐘在下面點贊、訂閱和評論,這對我們幫助很大。

  • Additionally, we have a Patreon page with AMAs and behind-the-scenes footage, and if you find what we do useful, we would appreciate any support.

    此外,我們還有一個 Patreon 頁面,提供 AMAs 和幕後花絮,如果您覺得我們的工作有用,我們將不勝感激。

  • Thank you.

    謝謝。

  • So, now that we've explored how single instruction multiple threads is used in video games, let's briefly discuss why GPUs were initially used for mining Bitcoin.

    既然我們已經探討了單指令多線程在電子遊戲中的應用,那麼讓我們簡單討論一下 GPU 最初用於比特幣挖礦的原因。

  • We're not going to get too far into the algorithm behind the blockchain, and we'll save it for a separate episode.

    區塊鏈背後的算法我們就不多說了,留到下一集再講。

  • But essentially, to create a block on the blockchain, the SHA-256 hashing algorithm is run on a set of data that includes transactions, a timestamp, additional data, and a random number called a nonce.

    但從本質上講,要在區塊鏈上創建一個區塊,需要在一組數據上運行 SHA-256 哈希算法,這組數據包括交易、時間戳、附加數據和一個稱為 nonce 的隨機數。

  • After feeding these values through the SHA-256 hashing algorithm, a random 256-bit value is output.

    將這些值輸入 SHA-256 哈希算法後,會輸出一個 256 位的隨機值。

  • You can kind of think of this algorithm as a lottery ticket generator, where you can't pick the lottery number, but based on the input data, the SHA-256 algorithm generates a random lottery ticket number.

    你可以把這種算法想象成彩票生成器,你不能選擇彩票號碼,但根據輸入數據,SHA-256 算法會隨機生成一個彩票號碼。

  • Therefore, if you change the nonce value and keep the rest of the transaction data the same, you'll generate a new random lottery ticket number.

    是以,如果更改 nonce 值並保持交易數據的其他部分不變,就會生成一個新的隨機彩票號碼。

  • The winner of this Bitcoin mining lottery is the first randomly generated lottery number to have the first 80 bits all zeros, while the rest of the 176 values don't matter.

    這次比特幣挖礦彩票的中獎者是第一個隨機生成的彩票號碼,其前 80 位全為零,而其餘 176 位的數值並不重要。

  • And once a winning Bitcoin lottery ticket is found, the reward is 3 Bitcoin, and the lottery resets with a new set of transactions and input values.

    一旦找到中獎的比特幣彩票,獎勵就是 3 個比特幣,彩票會以一組新的交易和輸入值重置。

  • So, why were graphics cards used?

    那麼,為什麼要使用顯卡呢?

  • Well, GPUs ran thousands of iterations of the SHA-256 algorithm, with the same transactions, timestamp, other data, but with different nonce values.

    GPU 對 SHA-256 算法進行了數千次迭代,使用相同的事務、時間戳和其他數據,但使用不同的 nonce 值。

  • As a result, a graphics card like this one could generate around 95 million SHA-256 hashes, or 95 million randomly numbered lottery tickets every second.

    是以,像這樣的顯卡每秒可生成約 9500 萬個 SHA-256 哈希值,或 9500 萬張隨機編號的彩票。

  • And hopefully, one of those lottery numbers would have the first 80 digits as all zeros.

    希望其中一個彩票號碼的前 80 個數字都是零。

  • However, nowadays computers filled with ASICs, or Application Specific Integrated Circuits, perform 250 trillion hashes a second, or the equivalent of 2,600 graphics cards, thereby making graphics cards look like a spoon when mining Bitcoin, next to an excavator that is an ASIC mining computer.

    然而,如今裝滿 ASIC(專用集成電路)的計算機每秒可執行 250 萬億次哈希值,相當於 2 600 塊顯卡,是以在挖掘比特幣時,顯卡就像一把勺子,而挖掘機就是 ASIC 挖礦計算機。

  • Let's next discuss the design of the tensor cores.

    接下來讓我們討論張量核的設計。

  • It'll take multiple full-length videos to cover generative AI and neural networks, so we'll focus on the exact matrix math that tensor cores solve.

    要介紹生成式人工智能和神經網絡需要多個完整的視頻,是以我們將重點介紹張量內核所解決的精確矩陣數學問題。

  • Essentially, tensor cores take three matrices and multiply the first two, add in the third, and then output the result.

    從本質上講,張量核利用三個矩陣,將前兩個矩陣相乘,再加上第三個矩陣,然後輸出結果。

  • Let's look at one value of the output.

    讓我們來看看輸出的一個值。

  • This value is equal to the sum of values of the first row of the first matrix, multiplied by the values from the first column of the second matrix, and then the corresponding value of the third matrix is added in.

    這個值等於第一個矩陣第一行的值之和乘以第二個矩陣第一列的值,然後再加上第三個矩陣的相應值。

  • Because all the values of the three input matrices are ready at the same time, the tensor cores complete all of the matrix multiplication and addition calculations concurrently.

    由於三個輸入矩陣的所有值同時準備就緒,是以張量內核可同時完成所有矩陣乘法和加法計算。

  • Neural networks and generative AI require trillions to quadrillions of matrix multiplication and addition operations, and typically uses much larger matrices.

    神經網絡和生成式人工智能需要數萬億到四萬億次的矩陣乘法和加法運算,通常使用的矩陣要大得多。

  • Finally, there are ray-tracing cores, which we explored in a separate video that's already been released.

    最後是光線追蹤內核,我們在已發佈的單獨視頻中對此進行了探討。

  • That's pretty much it for graphics cards.

    對於顯卡來說,差不多就是這樣了。

  • We're thankful to all our Patreon and YouTube membership sponsors for supporting our videos.

    感謝所有 Patreon 和 YouTube 會員贊助商對我們視頻的支持。

  • If you want to financially support our work, you can find the links in the description below.

    如果您想在經濟上支持我們的工作,可以在下面的說明中找到相關鏈接。

  • This is Branch Education, and we create 3D animations that dive deeply into the technology that drives our modern world.

    這裡是 Branch Education,我們製作的三維動畫深入探討了推動現代世界發展的技術。

  • Watch another Branch video by clicking one of these cards, or click here to subscribe.

    點擊其中一張卡片觀看分公司的其他視頻,或點擊此處訂閱。

  • Thanks for watching to the end.

    感謝您觀看到最後。

How many calculations do you think your graphics card performs every second while running video games with incredibly realistic graphics?

在運行圖形逼真度極高的視頻遊戲時,您認為您的顯卡每秒鐘要進行多少次運算?

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it