aboutsummaryrefslogtreecommitdiff
path: root/doc/design.txt
diff options
context:
space:
mode:
Diffstat (limited to 'doc/design.txt')
-rw-r--r--doc/design.txt158
1 files changed, 0 insertions, 158 deletions
diff --git a/doc/design.txt b/doc/design.txt
deleted file mode 100644
index cb78dbb..0000000
--- a/doc/design.txt
+++ /dev/null
@@ -1,158 +0,0 @@
- Overview of gemmlowp design
- ***************************
-
-
-Primer on GEMM, kernels, and cache friendliness
-===============================================
-
-gemmlowp, like most GEMMs, implements the straightforward matrix multiplication
-algorithm, which takes n^3 multiply-accumulate instructions for n*n sized
-matrices. Because the arithmetic complexity grows quicker than the memory
-complexity (n^3 vs. n^2), memory accesses are redundant (each matrix entry
-is accessed n times). A large part of a GEMM's performance and design goes
-toward minimizing the inefficiency resulting from these redundant memory
-accesses.
-
-Ultimately, once values are loaded into CPU registers, they cost nothing to
-access, so as long as we can work within registers, this problem doesn't exist.
-Thus, in order to be efficient, a GEMM's inner loops must wisely use the
-available registers to do as much arithmetic work as possible before loading
-more data from memory into registers. This means that
-a GEMM implementation needs to have architecture-specific inner loops tailored
-for architecture details such as the number of registers, and typically written
-in assembly. This 'inner loops' architecture-specific component is referred to
-as the GEMM kernel. (More details about kernels are in doc/kernels.txt).
-
-However, only small blocks can fit at a given time in registers, so at larger
-scales one needs to repeatedly load blocks of matrices from memory, and
-these accesses are redundant for the reason outlined above. The way that
-one minimizes the resulting inefficiency is by organizing for cache locality,
-so that most of these accesses hit the L1 cache, and most of the remaining
-ones hit the L2 cache, etc.
-
-This is achieved by subdividing the matrices into blocks sized to fit in L2
-cache, and subdividing these blocks into sub-blocks sizes to fit in L1 cache,
-and performing the matrix multiplication one such block at a time.
-
-In practice, it tends to pay off to "pack" input blocks for optimally
-efficient traversal by the kernel, since they will be traversed multiple times.
-"packing" means at least reordering the data layout for 1) simple access
-patterns that fit the CPU's cache behavior (in particular, the cache line size),
-and 2) simple loading into SIMD vector registers by the kernel.
-
-So a typical GEMM, in pseudo-code, tends to look like this:
-
-allocate(some_lhs_L2_block);
-allocate(some_rhs_L2_block);
-for (some_lhs_L2_block) {
- pack(some_lhs_L2_block);
- for (some_rhs_L2_block) {
- pack(some_rhs_L2_block);
- for (some_lhs_sub_block in some_lhs_L2_block) {
- for (some_rhs_sub_block in some_rhs_L2_block) {
- kernel(some_lhs_sub_block, some_rhs_sub_block);
- }
- }
- }
-}
-
-
-Impact of low-precision computation on gemmlowp design
-======================================================
-
-Refer to doc/low-precision.txt for specifics of the low-precision-computation
-paradigm and how it's implemented in gemmlowp.
-
-Inputs and outputs are matrices of uint8 values, but internally we are
-accumulating int32 values, only converting them back to uint8 at the end. This
-means that we need so store a block of int32 accumulators at a time. We compute
-a block of the result in int32 accumulators and then we "unpack" it into the
-destination matrix at once. In this way, we minimize the amount of memory used to
-store int32 values at a given time.
-
-Because of that, besides the "pack" and "kernel" stages outlined above, a third
-stage is needed in gemmlowp, which we call "unpack". Thus we arrive at the
-3-stage computation scheme that gemmlowp uses:
-
- 1. Pack lhs/rhs blocks from the input matrices.
- 2. Compute the product of the packed blocks, using the kernel.
- 3. Unpack the result block into the output matrix.
-
-The pseudo-code overview of gemmlowp now looks like:
-
-allocate(some_lhs_L2_block);
-allocate(some_rhs_L2_block);
-// new: temp storage for int32 accums
-allocate(some_int32_accumulators_block);
-for (some_lhs_L2_block) {
- pack(some_lhs_L2_block);
- for (some_rhs_L2_block) {
- pack(some_rhs_L2_block);
- for (some_lhs_sub_block in some_lhs_L2_block) {
- for (some_rhs_sub_block in some_rhs_L2_block) {
- // new: pass int32 accums to kernel
- kernel(&some_int32_accumulators_block,
- some_lhs_sub_block,
- some_rhs_sub_block);
- }
- }
- // new: unpack int32 accums into destination matrix
- unpack(some_int32_accumulators_block);
- }
-}
-
-
-Exploring gemmlowp code
-=======================
-
-The design outlined above can be readily matched to gemmlowp source code,
-in particular in this file, which gives a simple GEMM implementation fitting in
-one rather small function:
-
- internal/single_thread_gemm.h
-
-The reader can compare the above pseudo-code to the actual code in this file:
-
- for (int r = 0; r < rows; r += block_params.l2_rows) {
- int rs = std::min(block_params.l2_rows, rows - r);
-
- PackLhs(&packed_lhs, lhs.block(r, 0, rs, depth));
-
- for (int c = 0; c < cols; c += block_params.l2_cols) {
- int cs = std::min(block_params.l2_cols, cols - c);
-
- if (!pack_rhs_once) {
- PackRhs(&packed_rhs, rhs.block(0, c, depth, cs));
- }
-
- Compute(kernel, block_params, &packed_result, packed_lhs, packed_rhs);
-
- auto result_block = result->block(r, c, rs, cs);
- UnpackResult(&result_block, packed_result, packed_lhs, packed_rhs, depth,
- result_offset, result_mult_int, result_shift);
- }
- }
-
-The files in internal/ fall into a few categories:
-
-There are two top-level GEMM implementations,
- single_thread_gemm.h
- multi_thread_gemm.h
-
-They both call into pack/compute/unpack stages implemented in the following files:
- pack.h
- compute.h
- unpack.h
-
-The pack.h and unpack.h files contain generic templated code that can be overridden
-by optimized code in template specializations; see the NEON optimized code here:
- pack_neon.h
- unpack_neon.h
-
-The compute stage contains generic code in compute.h that only calls into
-optimized code through the Kernel::Run() entry point. Each kernel is basically just
-as struct offering a Run() implementation; see the NEON kernels in:
- kernel_neon.h
-
-More details about the interplay between these components can be found in this file:
- doc/kernels.txt