aboutsummaryrefslogtreecommitdiff
path: root/doc/low-precision.txt
diff options
context:
space:
mode:
Diffstat (limited to 'doc/low-precision.txt')
-rw-r--r--doc/low-precision.txt159
1 files changed, 0 insertions, 159 deletions
diff --git a/doc/low-precision.txt b/doc/low-precision.txt
deleted file mode 100644
index 893961f..0000000
--- a/doc/low-precision.txt
+++ /dev/null
@@ -1,159 +0,0 @@
- The low-precision paradigm in gemmlowp, and how it's implemented
- ****************************************************************
-
-
-Introduction
-============
-
-"Low-precision" means that the input and output matrix entries are integers
-on at most 8 bits. The scalar type is uint8_t.
-
-This isn't the same as just doing plain matrix arithmetic over uint8_t,
-because that would overflow. To avoid overflow, we internally accumulate
-results on more than 8 bits, and at the end we keep only some significant
-8 bits. This relies on the caller providing suitable offset/multiplier/shift
-parameters, which effectively govern how we extract some significant 8 bit
-from our more-than-8bit temporary accumulators.
-
-Gemmlowp supports further reducing precision below 8 bits. That is not
-the subject of this document; for that, refer to doc/less-than-8-bit.txt.
-
-
-The low-precision paradigm
-==========================
-
-gemmlowp is an implementation of the EightBitIntGemm paradigm, where quantized
-matrix multiplication takes the following input parameters:
- - the lhs matrix of uint8 quantized values
- - the rhs matrix of uint8 quantized values
- - the following int32 "quantization parameters", which control how the
- uint8 quantized values in the matrices are to be interpreted during the
- matrix computation:
- - lhs_offset
- - rhs_offset
- - result_offset
- - result_mult_int
- - result_shift
-
-The mathematical expression to be computed is the result of the following steps:
- 1. Cast lhs entries from uint8 to int32 and add lhs_offset to each of them.
- 2. Cast rhs entries from uint8 to int32 and add rhs_offset to each of them.
- 3. Compute the int32 matrix product of the resulting lhs times rhs.
- 4. Add result_offset to each entry of the result.
- 5. Multiply each entry of the result by the following fraction, and round
- to the nearest integer:
-
- result_mult_int
- --------------- (1)
- 2^result_shift
-
- 6. Clamp the resulting int32 values to the [0..255] range and cast to uint8.
-
-Thus the caller of this interface is expected to have precomputed suitable
-quantization parameters
-
-The rationale for these parameters is as follows:
- - The three offsets may improve quantization accuracy in cases where the
- range of values is limited, and they also conveniently allow to reduce all
- eight combinations of signednesses to just the unsigned*unsigned->unsigned
- case. One may at first glance worry that these offsets would incur
- substantial overhead to the GEMM computation, but that is actually not the
- case thanks to a trick described below (see "Efficient handling of
- offsets").
- - The result_mult_int and result_shift parameters allow approximating
- arbitrarily closely any real multiplier, as a fraction of the form given
- in (1) above, without using floating-point arithmetic and without using
- a division instruction (only a right shift).
-
-
-Efficient handling of offsets
-=============================
-
-At first glance it may seem like the above-described quantized computation
-scheme requires adding the lhs_offset and rhs_offset to each of the lhs and
-rhs matrix entries.
-
-Doing that in the GEMM kernel would incur substantial overhead:
- - It would mean extra arithmetic work in the GEMM kernel;
- - It would require storing the lhs_offset and rhs_offset in registers,
- which would eat into the register space available for the rest of the
- GEMM kernel.
-
-One may then consider adding the lhs_offset and rhs_offset once and for all
-to lhs and rhs blocks, in a GEMM implementation operating on one lhs block
-and one rhs block at a time. However, doing so would require storing lhs and
-rhs blocks in 32 bit (or at least in 16 bit in real-world cases), which would
-partially negate the memory bandwidth benefits of low-precision computation.
-
-Fortunately, there is another way to handle these offsets that has none of
-the costs of the approaches described above. The idea is as follows.
-
-Let P denote the matrix shaped like lhs, but filled with 1's.
-Let Q denote the matrix shaped like rhs, but filled with 1's.
-
-Adding lhs_offset to each entry of lhs, means adding lhs_offset * P to lhs.
-Adding rhs_offset to each entry of rhs, means adding rhs_offset * Q to rhs.
-
-Thus, as far as handling lhs_offset and rhs_offset goes, the matrix product to be
-computed is:
-
- (lhs + lhs_offset * P) * (rhs + rhs_offset * Q)
-
-Expanding this (using distributivity of matrix multiplication over addition),
-we see that the above product is equal to the following sum of 4 terms:
-
- lhs * rhs (2)
- + lhs_offset * P * rhs
- + lhs * rhs_offset * Q
- + lhs_offset * rhs_offset * P * Q
-
-The first term, lhs * rhs, is just the matrix multiplication ignoring the
-offsets, i.e. as if lhs_offset==rhs_offset==0. Our claim here is that this
-is all what we have to compute in the GEMM kernel.
-
-In the second term, lhs_offset * P * rhs, notice that since P is filled
-with 1's, P * rhs has all its rows equal to each other, and equal to the
-row-vector of sums of all the entries in each column of rhs.
-
-Thus, we can compute the second term, lhs_offset * P * rhs, by summing
-each column of rhs. This produces a single row-vector, and in order to add the
-second term, we simply need to add this row-vector (multiplied by lhs_offset)
-to each row of the result. This is just a rank one update of the result
-(equivalently, the second term is a rank one matrix), and we can efficiently
-store it as a single vector.
-
-The third term, lhs * rhs_offset * Q, is entirely similar to the second one,
-and can be similarly computed by summing each row of lhs, storing this in a
-single column-vector, and later multiplying these sums by rhs_offset.
-
-The fourth term is a single constant, repeated into all the entries of the
-matrix. The matrix P * Q is filled with the single constant value 'depth'
-(the depth the the matrix product i.e. the number of columns of the lhs).
-Thus the fourth term is simply the rank zero update adding this constant
-to each matrix entry:
-
- lhs_offset * rhs_offset * depth
-
-
-Implementation of this technique in gemmlowp
-============================================
-
-In gemmlowp, at the packing stage (where we traverse blocks of the lhs and rhs
-to prepare them for efficient repeated traversal by the kernel), we compute
-the sum of each row of the lhs block and the sum of each column of the rhs
-block.
-
-See in internal/pack.h, in the PackedSideBlock class, the following member:
-
- // Handle on the additional buffer backing the vector of sums of slices
- // associated with this block. Owned.
- Allocator::Handle sums_of_each_slice_handle_;
-
-sums_of_each_slice_handle_ is the handle to the buffer allocated to store
-the vector containing sums of rows of lhs, or of sums of columns of rhs.
-
-After these rank one updates have been computed at the packing stage, they are
-ignored at the compute kernel stage, since that stage is only concerned
-with the first of the four terms in (2); they are only used at the unpacking
-stage. See the default/reference implementation, UnpackResultImpl, in
-internal/unpack.h.