diff options
Diffstat (limited to 'doc/low-precision.txt')
-rw-r--r-- | doc/low-precision.txt | 159 |
1 files changed, 0 insertions, 159 deletions
diff --git a/doc/low-precision.txt b/doc/low-precision.txt deleted file mode 100644 index 893961f..0000000 --- a/doc/low-precision.txt +++ /dev/null @@ -1,159 +0,0 @@ - The low-precision paradigm in gemmlowp, and how it's implemented - **************************************************************** - - -Introduction -============ - -"Low-precision" means that the input and output matrix entries are integers -on at most 8 bits. The scalar type is uint8_t. - -This isn't the same as just doing plain matrix arithmetic over uint8_t, -because that would overflow. To avoid overflow, we internally accumulate -results on more than 8 bits, and at the end we keep only some significant -8 bits. This relies on the caller providing suitable offset/multiplier/shift -parameters, which effectively govern how we extract some significant 8 bit -from our more-than-8bit temporary accumulators. - -Gemmlowp supports further reducing precision below 8 bits. That is not -the subject of this document; for that, refer to doc/less-than-8-bit.txt. - - -The low-precision paradigm -========================== - -gemmlowp is an implementation of the EightBitIntGemm paradigm, where quantized -matrix multiplication takes the following input parameters: - - the lhs matrix of uint8 quantized values - - the rhs matrix of uint8 quantized values - - the following int32 "quantization parameters", which control how the - uint8 quantized values in the matrices are to be interpreted during the - matrix computation: - - lhs_offset - - rhs_offset - - result_offset - - result_mult_int - - result_shift - -The mathematical expression to be computed is the result of the following steps: - 1. Cast lhs entries from uint8 to int32 and add lhs_offset to each of them. - 2. Cast rhs entries from uint8 to int32 and add rhs_offset to each of them. - 3. Compute the int32 matrix product of the resulting lhs times rhs. - 4. Add result_offset to each entry of the result. - 5. Multiply each entry of the result by the following fraction, and round - to the nearest integer: - - result_mult_int - --------------- (1) - 2^result_shift - - 6. Clamp the resulting int32 values to the [0..255] range and cast to uint8. - -Thus the caller of this interface is expected to have precomputed suitable -quantization parameters - -The rationale for these parameters is as follows: - - The three offsets may improve quantization accuracy in cases where the - range of values is limited, and they also conveniently allow to reduce all - eight combinations of signednesses to just the unsigned*unsigned->unsigned - case. One may at first glance worry that these offsets would incur - substantial overhead to the GEMM computation, but that is actually not the - case thanks to a trick described below (see "Efficient handling of - offsets"). - - The result_mult_int and result_shift parameters allow approximating - arbitrarily closely any real multiplier, as a fraction of the form given - in (1) above, without using floating-point arithmetic and without using - a division instruction (only a right shift). - - -Efficient handling of offsets -============================= - -At first glance it may seem like the above-described quantized computation -scheme requires adding the lhs_offset and rhs_offset to each of the lhs and -rhs matrix entries. - -Doing that in the GEMM kernel would incur substantial overhead: - - It would mean extra arithmetic work in the GEMM kernel; - - It would require storing the lhs_offset and rhs_offset in registers, - which would eat into the register space available for the rest of the - GEMM kernel. - -One may then consider adding the lhs_offset and rhs_offset once and for all -to lhs and rhs blocks, in a GEMM implementation operating on one lhs block -and one rhs block at a time. However, doing so would require storing lhs and -rhs blocks in 32 bit (or at least in 16 bit in real-world cases), which would -partially negate the memory bandwidth benefits of low-precision computation. - -Fortunately, there is another way to handle these offsets that has none of -the costs of the approaches described above. The idea is as follows. - -Let P denote the matrix shaped like lhs, but filled with 1's. -Let Q denote the matrix shaped like rhs, but filled with 1's. - -Adding lhs_offset to each entry of lhs, means adding lhs_offset * P to lhs. -Adding rhs_offset to each entry of rhs, means adding rhs_offset * Q to rhs. - -Thus, as far as handling lhs_offset and rhs_offset goes, the matrix product to be -computed is: - - (lhs + lhs_offset * P) * (rhs + rhs_offset * Q) - -Expanding this (using distributivity of matrix multiplication over addition), -we see that the above product is equal to the following sum of 4 terms: - - lhs * rhs (2) - + lhs_offset * P * rhs - + lhs * rhs_offset * Q - + lhs_offset * rhs_offset * P * Q - -The first term, lhs * rhs, is just the matrix multiplication ignoring the -offsets, i.e. as if lhs_offset==rhs_offset==0. Our claim here is that this -is all what we have to compute in the GEMM kernel. - -In the second term, lhs_offset * P * rhs, notice that since P is filled -with 1's, P * rhs has all its rows equal to each other, and equal to the -row-vector of sums of all the entries in each column of rhs. - -Thus, we can compute the second term, lhs_offset * P * rhs, by summing -each column of rhs. This produces a single row-vector, and in order to add the -second term, we simply need to add this row-vector (multiplied by lhs_offset) -to each row of the result. This is just a rank one update of the result -(equivalently, the second term is a rank one matrix), and we can efficiently -store it as a single vector. - -The third term, lhs * rhs_offset * Q, is entirely similar to the second one, -and can be similarly computed by summing each row of lhs, storing this in a -single column-vector, and later multiplying these sums by rhs_offset. - -The fourth term is a single constant, repeated into all the entries of the -matrix. The matrix P * Q is filled with the single constant value 'depth' -(the depth the the matrix product i.e. the number of columns of the lhs). -Thus the fourth term is simply the rank zero update adding this constant -to each matrix entry: - - lhs_offset * rhs_offset * depth - - -Implementation of this technique in gemmlowp -============================================ - -In gemmlowp, at the packing stage (where we traverse blocks of the lhs and rhs -to prepare them for efficient repeated traversal by the kernel), we compute -the sum of each row of the lhs block and the sum of each column of the rhs -block. - -See in internal/pack.h, in the PackedSideBlock class, the following member: - - // Handle on the additional buffer backing the vector of sums of slices - // associated with this block. Owned. - Allocator::Handle sums_of_each_slice_handle_; - -sums_of_each_slice_handle_ is the handle to the buffer allocated to store -the vector containing sums of rows of lhs, or of sums of columns of rhs. - -After these rank one updates have been computed at the packing stage, they are -ignored at the compute kernel stage, since that stage is only concerned -with the first of the four terms in (2); they are only used at the unpacking -stage. See the default/reference implementation, UnpackResultImpl, in -internal/unpack.h. |