1 files changed, 0 insertions, 159 deletions
diff --git a/doc/low-precision.txt b/doc/low-precision.txt
deleted file mode 100644
index 893961f..0000000
--- a/doc/low-precision.txt
+++ /dev/null
@@ -1,159 +0,0 @@
-      The low-precision paradigm in gemmlowp, and how it's implemented
-      ****************************************************************
-
-
-Introduction
-============
-
-"Low-precision" means that the input and output matrix entries are integers
-on at most 8 bits. The scalar type is uint8_t.
-
-This isn't the same as just doing plain matrix arithmetic over uint8_t,
-because that would overflow. To avoid overflow, we internally accumulate
-results on more than 8 bits, and at the end we keep only some significant
-8 bits. This relies on the caller providing suitable offset/multiplier/shift
-parameters, which effectively govern how we extract some significant 8 bit
-from our more-than-8bit temporary accumulators.
-
-Gemmlowp supports further reducing precision below 8 bits. That is not
-the subject of this document; for that, refer to doc/less-than-8-bit.txt.
-
-
-The low-precision paradigm
-==========================
-
-gemmlowp is an implementation of the EightBitIntGemm paradigm, where quantized
-matrix multiplication takes the following input parameters:
-  - the lhs matrix of uint8 quantized values
-  - the rhs matrix of uint8 quantized values
-  - the following int32 "quantization parameters", which control how the
-    uint8 quantized values in the matrices are to be interpreted during the
-    matrix computation:
-    - lhs_offset
-    - rhs_offset
-    - result_offset
-    - result_mult_int
-    - result_shift
-
-The mathematical expression to be computed is the result of the following steps:
-  1. Cast lhs entries from uint8 to int32 and add lhs_offset to each of them.
-  2. Cast rhs entries from uint8 to int32 and add rhs_offset to each of them.
-  3. Compute the int32 matrix product of the resulting lhs times rhs.
-  4. Add result_offset to each entry of the result.
-  5. Multiply each entry of the result by the following fraction, and round
-     to the nearest integer:
-
-                        result_mult_int
-                        ---------------                                   (1)
-                        2^result_shift
-
-  6. Clamp the resulting int32 values to the [0..255] range and cast to uint8.
-
-Thus the caller of this interface is expected to have precomputed suitable
-quantization parameters
-
-The rationale for these parameters is as follows:
-  - The three offsets may improve quantization accuracy in cases where the
-    range of values is limited, and they also conveniently allow to reduce all
-    eight combinations of signednesses to just the unsigned*unsigned->unsigned
-    case. One may at first glance worry that these offsets would incur
-    substantial overhead to the GEMM computation, but that is actually not the
-    case thanks to a trick described below (see "Efficient handling of
-    offsets").
-  - The result_mult_int and result_shift parameters allow approximating
-    arbitrarily closely any real multiplier, as a fraction of the form given
-    in (1) above, without using floating-point arithmetic and without using
-    a division instruction (only a right shift).
-
-
-Efficient handling of offsets
-=============================
-
-At first glance it may seem like the above-described quantized computation
-scheme requires adding the lhs_offset and rhs_offset to each of the lhs and
-rhs matrix entries.
-
-Doing that in the GEMM kernel would incur substantial overhead:
-  - It would mean extra arithmetic work in the GEMM kernel;
-  - It would require storing the lhs_offset and rhs_offset in registers,
-    which would eat into the register space available for the rest of the
-    GEMM kernel.
-
-One may then consider adding the lhs_offset and rhs_offset once and for all
-to lhs and rhs blocks, in a GEMM implementation operating on one lhs block
-and one rhs block at a time. However, doing so would require storing lhs and
-rhs blocks in 32 bit (or at least in 16 bit in real-world cases), which would
-partially negate the memory bandwidth benefits of low-precision computation.
-
-Fortunately, there is another way to handle these offsets that has none of
-the costs of the approaches described above. The idea is as follows.
-
-Let P denote the matrix shaped like lhs, but filled with 1's.
-Let Q denote the matrix shaped like rhs, but filled with 1's.
-
-Adding lhs_offset to each entry of lhs, means adding lhs_offset * P to lhs.
-Adding rhs_offset to each entry of rhs, means adding rhs_offset * Q to rhs.
-
-Thus, as far as handling lhs_offset and rhs_offset goes, the matrix product to be
-computed is:
-
-  (lhs + lhs_offset * P) * (rhs + rhs_offset * Q)
-
-Expanding this (using distributivity of matrix multiplication over addition),
-we see that the above product is equal to the following sum of 4 terms:
-
-    lhs * rhs                                                             (2)
-  + lhs_offset * P * rhs
-  + lhs * rhs_offset * Q
-  + lhs_offset * rhs_offset * P * Q
-
-The first term, lhs * rhs, is just the matrix multiplication ignoring the
-offsets, i.e. as if lhs_offset==rhs_offset==0. Our claim here is that this
-is all what we have to compute in the GEMM kernel.
-
-In the second term, lhs_offset * P * rhs, notice that since P is filled
-with 1's, P * rhs has all its rows equal to each other, and equal to the
-row-vector of sums of all the entries in each column of rhs.
-
-Thus, we can compute the second term, lhs_offset * P * rhs, by summing
-each column of rhs. This produces a single row-vector, and in order to add the
-second term, we simply need to add this row-vector (multiplied by lhs_offset)
-to each row of the result. This is just a rank one update of the result
-(equivalently, the second term is a rank one matrix), and we can efficiently
-store it as a single vector.
-
-The third term, lhs * rhs_offset * Q, is entirely similar to the second one,
-and can be similarly computed by summing each row of lhs, storing this in a
-single column-vector, and later multiplying these sums by rhs_offset.
-
-The fourth term is a single constant, repeated into all the entries of the
-matrix. The matrix P * Q is filled with the single constant value 'depth'
-(the depth the the matrix product i.e. the number of columns of the lhs).
-Thus the fourth term is simply the rank zero update adding this constant
-to each matrix entry:
-
-  lhs_offset * rhs_offset * depth
-
-
-Implementation of this technique in gemmlowp
-============================================
-
-In gemmlowp, at the packing stage (where we traverse blocks of the lhs and rhs
-to prepare them for efficient repeated traversal by the kernel), we compute
-the sum of each row of the lhs block and the sum of each column of the rhs
-block.
-
-See in internal/pack.h, in the PackedSideBlock class, the following member:
-
-  // Handle on the additional buffer backing the vector of sums of slices
-  // associated with this block. Owned.
-  Allocator::Handle sums_of_each_slice_handle_;
-
-sums_of_each_slice_handle_ is the handle to the buffer allocated to store
-the vector containing sums of rows of lhs, or of sums of columns of rhs.
-
-After these rank one updates have been computed at the packing stage, they are
-ignored at the compute kernel stage, since that stage is only concerned
-with the first of the four terms in (2); they are only used at the unpacking
-stage. See the default/reference implementation, UnpackResultImpl, in
-internal/unpack.h.