什麼是無 MatMul 語言建模 (What is MatMul-free Language Modeling)

Subtitles section Play video

Matmul stands for Matrix Multiplication Matmul is a fundamental operation in neural networks that combine two matrices to produce another matrix.

Matmul 是 Matrix Multiplication（矩陣乘法）的縮寫，是神經網絡中的一種基本操作，它將兩個矩陣結合起來，生成另一個矩陣。
Think of Matmul like a special kind of multiplication that helps neural networks learn and represent complex relationships between data.

把 Matmul 想象成一種特殊的乘法運算，它能幫助神經網絡學習和表示數據之間的複雜關係。
Imagine you have two sets of numbers or matrices that represent features or patterns in your data.

想象一下，你有兩組數字或矩陣，它們代表了數據中的特徵或模式。
Now Matrix Multiplication or Matmul combines these sets by multiplying and adding corresponding elements, creating a totally new matrix that represents the relationship between the original matrices or features.

現在，矩陣乘法或 Matmul 通過乘加相應的元素將這些集合結合起來，創建出一個全新的矩陣，代表原始矩陣或特徵之間的關係。
This process helps neural networks transform input into meaningful output like predictions or classification.

這一過程有助於神經網絡將輸入轉化為有意義的輸出，如預測或分類。
Imagine you have a matrix of images and a matrix of filters that detect edges.

想象一下，你有一個影像矩陣和一個檢測邊緣的濾波器矩陣。
Matmul combines these matrices to produce a new matrix that represents the edges in each image.

Matmul 將這些矩陣合併，生成一個新的矩陣，代表每幅影像的邊緣。
This paper, which you can see on your screen, has taken the social media of AI by storm by giving it a new twist.

您可以在螢幕上看到這篇論文，它為人工智能的社交媒體帶來了一場新的風暴。
Before I tell you in detail what exactly this paper has done here, let me give a bit more technical detail on this Matmul because it's a pivotal concept in this whole scenario to understand this better.

在詳細介紹這篇論文到底做了什麼之前，讓我先介紹一下 Matmul 的技術細節，因為要更好地理解這篇論文，它是整個方案中的一個關鍵概念。
Matrix Multiplication or Matmul, as I said, it's a quite dominant operation in most of the models these days, where dense layers involve vector matrix multiplication.

正如我所說，矩陣乘法或 Matmul 是目前大多數模型中的主要操作，其中密集層涉及矢量矩陣乘法。
Convolutions can be implemented as block sparse VMMs with shared weight and self-attention relies on matrix-matrix multiplication.

卷積可作為具有共享權重的塊稀疏 VMM 來實現，而自注意則依賴於矩陣-矩陣乘法。
The prevalence of Matmul is primarily due to graphics processing units or GPUs being optimized for Matmul operations.

Matmul 的流行主要是因為圖形處理器或 GPU 針對 Matmul 運算進行了優化。
By leveraging Compute Unified Device Architecture or CUDA and highly optimized linear algebra libraries such as Kaplas, the Matmul operation can be efficiently parallelized and accelerated.

通過利用計算統一設備架構或 CUDA 以及高度優化的線性代數庫（如 Kaplas），Matmul 運算可以高效地並行化和加速。
This optimization was a key factor in the victory of AlexNet, by the way, which is a very famous competition.

順便說一句，這種優化是 AlexNet 獲勝的關鍵因素，這是一項非常著名的比賽。
Despite its prevalence in deep learning, Matmul operations account for the dominant portion of computational expense, often consuming the majority of the execution time and memory access during both training and inference phases.

儘管 Matmul 操作在深度學習中非常普遍，但它卻佔據了計算費用的主要部分，在訓練和推理階段往往會消耗大部分執行時間和內存訪問。
A lot of work has already been done where Matmul has been replaced with simpler operations through two main stages.

我們已經做了大量工作，通過兩個主要階段用更簡單的操作取代了 Matmul。
First, strategy or stage involves substituting Matmul with elementary operations and the second approach employs binary or ternary quantization, simplifying Matmul to operations where values are either flipped or zeroed out before accumulation.

第一種策略或階段是用基本運算代替 Matmul，第二種方法是採用二元或三元量化，將 Matmul 簡化為在累加之前將數值翻轉或清零的運算。
In this paper, these researchers have developed the first scalable Matmul-free language model or Matmul-free LM by using additive operations in dense layers and element-wise enameled products for self-attention-like functions.

在本文中，這些研究人員通過在密集層中使用加法運算，以及對自注意類函數使用元素枚舉乘積，開發了首個可擴展的無 Matmul 語言模型或無 Matmul LM。
Specifically, ternary weights eliminate Matmul in dense layers, similar to BNNs.

具體來說，三元權重消除了密集層中的 Matmul，與 BNN 類似。
To remove Matmul from self-attention, they have optimized the GRU to rely solely on element-wise products and show that this model competes with state-of-the-art transformers while eliminating all Matmul operations.

為了消除 Matmul 的自我關注，他們對 GRU 進行了優化，使其完全依賴於元素產品，並證明該模型可與最先進的變壓器相媲美，同時消除了所有 Matmul 操作。
In this diagram, you can see the overview of this Matmul-free LM where the sequence of operations are shown for vanilla self-attention and the Matmul-free token mixer which is on the top right and ternary accumulations.

在此圖中，您可以看到無 Matmul LM 的概覽，其中顯示了香草自注意和右上方的無 Matmul 令牌混合器以及三元累加的操作順序。
The Matmul-free LM employs a Matmul-free token mixer and a Matmul-free channel mixer to while reducing compute cost.

無 Matmul LM 採用了無 Matmul 令牌混合器和無 Matmul 信道混合器，同時降低了計算成本。
Similarly, if you look at this diagram, this primarily shows you what exactly is the comparison with other models and a lot of other stuff around performance comparison and analysis of different models.

同樣，如果你看這張圖，它主要向你展示了與其他機型的比較，以及圍繞不同機型的性能比較和分析的許多其他內容。
I will also drop the link to this paper in Vute's description and you can read it as your layer because it's quite an interesting read in my humble opinion.

我也會在 Vute 的描述中附上這篇論文的鏈接，你們可以根據自己的需要閱讀，因為在我看來，這是一篇相當有趣的文章。
Now, look at this diagram.

現在，請看這張圖。
This actually shows you a bit more in-depth and easy way as what is happening here.

這實際上向你展示了一種更深入、更簡單的方法。
So, to test the power usage and effectiveness of the Matmul-free LM on custom hardware that can better exploit ternary operations, these researchers have created an FPGA accelerator in System Verilog and this is the whole overview of it.

是以，為了在能更好地利用三元運算的定製硬件上測試無 Matmul LM 的功耗和有效性，這些研究人員用 System Verilog 創建了一個 FPGA 加速器，這就是它的全貌。
There are four functional units in this design, row-wise operation, root-means-square, load-store and ternary matrix multiplication and they each allow for simple out-of-order execution.

該設計有四個功能單元，分別是行向運算、均方根運算、負載存儲和三元矩陣乘法，每個單元都可以進行簡單的無序執行。
They also wrote a custom assembler for their custom instruction set which was used to convert assembly files into an instruction row and there is a lot of detail around there that they also have this register router as you can see in the middle that delegates incoming instructions into available registers.

他們還為自己的自定義指令集編寫了一個自定義彙編器，用於將彙編文件轉換為指令行，其中有很多細節表明，他們還有一個寄存器路由器，正如你在中間看到的那樣，可以將輸入的指令委託到可用的寄存器中。
The register file consists of eight registers, each storing one vector in a separate SRAM array.

寄存器文件由八個寄存器組成，每個寄存器在單獨的 SRAM 陣列中存儲一個向量。
Each register SRAM array has a read and write port that are delegated to at most one instruction at a time.

每個寄存器 SRAM 陣列都有一個讀寫端口，每次最多委託給一條指令。
If an instruction requests access to a functional unit or a register that is busy, the program counter will stall until the functional unit or register has been freed.

如果指令請求訪問繁忙的功能單元或寄存器，程序計數器將停止運行，直到功能單元或寄存器被釋放。
If two instructions do not block each other, they execute simultaneously and then also we see there is a root-means-square functional unit that uses a specialized hardware.

如果兩條指令沒有相互阻塞，它們會同時執行，這時我們還可以看到有一個使用專用硬件的均方根功能單元。
I'll go to preserve precision and there are few stages of it and then we have this ternary matrix multiplication functional unit that takes in a DRAM address for a ternary matrix then performs a T-MATMUL on specified vector.

我們的三元矩陣乘法功能單元接收三元矩陣的 DRAM 地址，然後對指定向量執行 T-MATMUL。
This architecture entirely places a ternary matrices and DRAM while running a T-MATMUL instruction and SRAM FIFO is simultaneously filled with sequential DRAM fetch result.

該架構在運行 T-MATMUL 指令時，將三元矩陣和 DRAM 全部放置在 SRAM FIFO 中，同時將順序 DRAM 抓取結果填入 SRAM FIFO。
So the results are simply amazing and the performance and area impact shows that this is the simplest case where the core only receives 8-bit at a time from memory.

結果令人驚歎，性能和麵積影響表明，這是內核每次只從內存接收 8 位數據的最簡單情況。
All in all, amazing demonstration of the feasibility and effectiveness of the first scalable MATMUL-free language model.

總之，第一個可擴展的無 MATMUL 語言模型的可行性和有效性得到了令人驚歎的證明。
This work challenges the paradigm that MATMUL operations are indispensable for building high-performance language models and paves the way for development of more efficient and hardware-friendly architectures.

這項工作對 MATMUL 運算是構建高性能語言模型不可或缺的範式提出了挑戰，併為開發更高效、硬件友好型架構鋪平了道路。
They have also achieved performance on par with state-of-the-art transformers while eliminating the need for MATMUL operations with an optimized implementation that significantly enhances both training and inference efficiency reducing both memory usage and latency.

他們還實現了與最先進變壓器相當的性能，同時無需進行 MATMUL 運算，優化的實現方式顯著提高了訓練和推理效率，減少了內存使用量和延遲。
As the demand for deploying language models on various platforms grows, MATMUL-free LMs present a promising direction for creating models that are both effective and resource efficient.

隨著在各種平臺上部署語言模型的需求不斷增長，無 MATMUL 語言模型為創建既有效又節省資源的模型提供了一個很有前景的方向。
So amazing, amazing stuff and by prioritizing the development and deployment of MATMUL-free architecture such as this one, the future of LMs will only become more accessible, efficient and sustainable and probably that is why this paper has really taken the AI social media by storm and sort of everyone is talking about it because this seems real and I don't think so this will just remain present in the paper.

也許這就是為什麼這篇論文在人工智能社交媒體上引起了軒然大波，每個人都在談論它，因為它看起來是真實的，而且我認為它不會僅僅停留在論文中。
I think this we are going to see a lot of implementation of it.

我認為，我們將看到它的大量實施。
We will see a lot of memory reduction, efficiency gains and hardware optimization.

我們將看到大量的內存縮減、效率提升和硬件優化。
Let me know what you think.

讓我知道你的想法。
If you like the content, please consider subscribing to the channel and if you are already subscribed then please share it among your network as it helps.

如果您喜歡這些內容，請考慮訂閱該頻道；如果您已經訂閱，請在您的網絡中分享，因為這對您有幫助。
Thanks for watching.

感謝觀看。

No results