diff options
author | Miao Wang <miaowang@google.com> | 2017-07-06 11:06:31 -0700 |
---|---|---|
committer | Miao Wang <miaowang@google.com> | 2017-07-06 20:20:07 +0000 |
commit | a9fd919a0080e2c3c7ed1ce451c85a4d86f2f8c1 (patch) | |
tree | c6c4113609be686ad91ff6710922d8416ca3be1c /doc/quantization.md | |
parent | b76ecb153a784e70e07bf7e19bcf3dfd6caec815 (diff) | |
download | gemmlowp-a9fd919a0080e2c3c7ed1ce451c85a4d86f2f8c1.tar.gz |
Rebase gemmlowp to 36ffd29
Test: mm
Test: build system image for sailfish
Test: BLAS CTS tests pass
Change-Id: I4cc9dbfd586f6653fc2d04e8e7ad78ada5d7dbe9
Diffstat (limited to 'doc/quantization.md')
-rw-r--r-- | doc/quantization.md | 346 |
1 files changed, 346 insertions, 0 deletions
diff --git a/doc/quantization.md b/doc/quantization.md new file mode 100644 index 0000000..3e0df16 --- /dev/null +++ b/doc/quantization.md @@ -0,0 +1,346 @@ +# Building a quantization paradigm from first principles + +**TLDR:** If you prefer example code over theory, look at +[doc/quantization_example.cc](quantization_example.cc). + +## Overview + +gemmlowp allows to perform calculations on matrices on uint8 values, but these +matrices are only useful insofar as they somehow approximate matrices of real +numbers. By a _quantization paradigm_ we mean a correspondence between matrices +of quantized 8bit values and matrices of real numbers. The choice of a +quantization paradigm affects the calculations that gemmlowp itself needs to +perform, specifically, it affects how one goes from internal 32bit accumulator +to final 8bit outputs. + +The part of gemmlowp transforming internal internal 32bit accumulator to final +8bit outputs is the "output pipeline" described in [output.md](output.md). + +gemmlowp's `GemmWithOutputPipeline` entry point allows specifying an arbitrary +output pipeline, allowing the user to implement their own preferred quantized +arithmetic paradigm. + +In the present document, our purpose is to show how, reasoning from first +principles and some domain-specific knowledge of neural networks, we can arrive +naturally at some specific quantization paradigm, and how that can be +implemented using a specific output pipeline. + +We also aim to show how that differs from the older, legacy quantization +paradigm implemented by gemmlowp's legacy interfaces and why the change to the +newer quantization paradigm described in this document was useful as far as some +applications of gemmlowp were concerned. + +## Quantization as an affine map. + +In order for arithmetic on real values to map directly to arithmetic on +quantized uint8 values, the mapping between real and quantized uint8 values must +be affine, which means it must be of the form + +``` +real_value = A * quantized_value + B (1) +``` + +for some constants A, B, or equivalently, of the form + +``` +real_value = C * (quantized_value + D) (2) +``` + +for some constants C, D. Indeed, anything else than such an affine map would +mean that the result of the quantized calculations do no longer readily provide +an approximation to the result of the real-numbers calculation. + +## Domain-specific constraint: the real value 0 must be exactly representable. + +Here a domain-specific constrain from neural networks appears: for some neural +network layers, it is very useful for optimized implementations that the +real-value 0 be exactly representable. + +For instance, in a Convolutional or Pooling layer with padding, it is useful to +be able to implement the padding by zero-padding the input array, so that +optimized loops do not need to become more complex to avoid overrunning the +array bounds. + +In order for such zero-padding to be feasible in a quantized implementation of +such layers, it is important that the real value '0' be exactly representable in +quantized form, i.e. that it correspond exactly to some quantized value, which +we call the _zero-point_. + +Indeed, if '0' were not exactly representable, then we would have to use some +quantized value for padding, that does not exactly correspond to the real value +'0'. That would typically introduce inaccuracy in the result. In fact, using +always the same such value would be worse: it would introduce _bias_ in the +result. + +## The final form of the quantization equation + +Now let us phrase what this constraint — that the real value 0 be exactly +representable — means in either quantization equations, (1) and (2). + +In equation (1), plugging `real_value = 0` and `quantized_value = zero_point`, +we get: + +``` +0 = A * zero_point + B +``` + +equivalently: + +``` +zero_point = -B / A +``` + +We are thus left with a rather awkward constraint: the real number `-B / A` must +somehow be guaranteed to be exactly integral, so that the special uint8 value +`zero_point` can be exactly equal to it. Quite awkward! + +Now let us look at equation (2). Plugging `real_value = 0` and +`quantized_value = zero_point`, we get: + +``` +0 = C * (zero_point + D) +``` + +Conveniently, the constant `C` plays no role anymore, so this equation +simplifies to: + +``` +0 = zero_point + D +``` + +In other words, `D = -zero_point`. This suggests rewriting the quantization +equation (2) into the following form (3), which will be the final form that we +will consistently use: + +``` +real_value = scale * (quantized_value - zero_point) (3) +``` + +To go from (2) to (3), we merely renamed `C` to `scale` and `D` to +`-zero_point`. + +With this quantization equation (3), the condition that 0 be exactly +representable is vacuously satisfied: `zero_point` is by definition one of the +possible `quantized_value`'s, and equation (3) maps it to a `real_value` of +exactly 0. + +Note that the final quantizaton equation (3) depends on two constants, one +integral, the other an arbitrary positive real number: + +* `zero_point` is integral, more specifically is one of the possible quantized + values (i.e. typically is a uint8 value). +* `scale` is a positive real number. Thus at this stage we have not yet shown + how to eliminate all usage of floating-point arithmetic. That will come + below. + +## Quantizing a matrix multiplication + +Now that we know — equation (3) — how real numbers are to correspond +to quantized values (typically uint8), we turn to applying this knowledge to +rewriting a multiplication of matrices of real numbers, by the equivalent +multiplication of matrices of quantized values. + +Say that we have two matrices of real values `lhs_real_matrix`, +`rhs_real_matrix`. Each entry of their product is the sum (accumulation) of many +products of individual matrix entries, say `lhs_real_value * rhs_real_value`. + +Now suppose that we have already quantized these two matrices according to the +above equation (3), with some already-known quantization parameters `lhs_scale`, +`rhs_scale`, `lhs_zero_point`, `rhs_zero_point`, so that their matrix entries +are quantized as + +``` +lhs_real_value[i] = lhs_scale * (lhs_quantized_value[i] - lhs_zero_point) +rhs_real_value[i] = rhs_scale * (rhs_quantized_value[i] - rhs_zero_point) +``` + +We then rewrite the matrix product accumulator accordingly: + +``` +result_real_value + = Sum_over_i(lhs_real_value[i] * rhs_real_value[i]) + = Sum_over_i( + lhs_scale * (lhs_quantized_value[i] - lhs_zero_point) * + rhs_scale * (rhs_quantized_value[i] - rhs_zero_point) + ) + = lhs_scale * rhs_scale * Sum_over_i( + (lhs_quantized_value[i] - lhs_zero_point) * + (rhs_quantized_value[i] - rhs_zero_point) + ) (4) +``` + +Now our goal is to represent this result itself as a quantized matrix, i.e. +still according to equation (3), for some pre-established quantization +parameters `result_scale` and `result_zero_point`, as + +``` +result_real_value = result_scale * + (result_quantized_value - result_zero_point) +``` + +Here we need to keep in mind that our goal is to specify what the quantized +matrix multiplication should do, i.e. how to compute `result_quantized_value`. +The last equation above is equivalent to + +``` +result_quantized_value = result_zero_point + + result_real_value / result_scale +``` + +Now we can use equation (4) above to plug into this the expression of +result_real_value in terms of the quantized operands, and we obtain: + +``` +result_quantized_value = result_zero_point + + (lhs_scale * rhs_scale / result_scale) * + Sum_over_i( + (lhs_quantized_value[i] - lhs_zero_point) * + (rhs_quantized_value[i] - rhs_zero_point) + ) (5) +``` + +Equation (5) is the conclusion of this general discussion of how to specify what +"quantized matrix multiplication" should actually compute, in order to be able +to replace real matrix multiplications. + +## Implementation of quantized matrix multiplication + +Having obtained the mathematical form (5) of quantized matrix multiplication, we +now turn to its actual implementation. + +The inner-most part of (5), + +``` +int32_accumulator = + Sum_over_i( + (lhs_quantized_value[i] - lhs_zero_point) * + (rhs_quantized_value[i] - rhs_zero_point) +) +``` + +is the "kernel" accumulation loop. It is where the bulk of the computational +cost goes. Luckily, it only involves integers: the quantized operands matrix +entries, and their `zero_point` quantization parameters. Typically, all of these +values are uint8. Typically, the above differences of uint8 values would be +represented as signed int16; their products as signed int32. + +It is out of scope of the present doc to discuss how to avoid the overhead of +having to subtract these `zero_point` constants in this inner loop; refer to +[this section of +low-precision.md](low-precision.md#efficient-handling-of-offsets) for that. The +gist of it is that a mathematical trick allows us to take the handling of these +`zero_point` constants out of this accumulation loop, so that it simplifies to + +``` +int32_accumulator = + Sum_over_i( + lhs_quantized_value[i] * + rhs_quantized_value[i] + ) (6) +``` + +Anyway, the result is a `int32_accumulator` that we now plug back into the rest +of (5): + +``` +result_quantized_value = result_zero_point + + (lhs_scale * rhs_scale / result_scale) * int32_accumulator (7) +``` + +The difficulty here is of course that `(lhs_scale * rhs_scale / result_scale)` +is a positive real number, not an integer in general. It is a constant, though. +So what we have to implement here is the (approximate) scaling of a int32 value +by some arbitrary positive constant multiplier. + +Moreover, it is safe to assume that this positive constant multiplier is smaller +than one — each of the `scale` values here is typically smaller than one, +as we are typically mapping the `[0..255]` quantized uint8 value range to an +interval of real values that is much narrower than that, typically within +`[-10,10]` in most neural networks. For example, a neural network using Relu6 +activation functions will typically have real activation values in the interval +[0,6]. + +So how do we implement the multiplication of a int32 value by a positive real +constant that is smaller than one? Typically, by multiplying by a fixed-point +constant multiplier in the normalized interval `[1/2,1)`, and right-shifting +the result to achieve the correct multiplier. + +At this point we have obtained the int32 value of the product + +``` +(lhs_scale * rhs_scale / result_scale) * int32_accumulator +``` + +Looking at (7), it only remains to add to it the integral value +`result_zero_point`, and we are done. + +## How this is implemented in gemmlowp + +The different parts of gemmlowp implementing aspects of the above discussion +are: + +* The packing stage (see [packing.md](packing.md)) implements the special + mathematical trick to handle `lhs_offset`, `rhs_offset` that we alluded to + above, see [this section of + low-precision.md](low-precision.md#efficient-handling-of-offsets) for + details. Thanks to is, the rest of the calculation can proceed as if + `lhs_offset`, `rhs_offset` were 0. + +* The compute/kernel stage (see [kernel.md](kernel.md)) performs the core + accumulation loop producing the `int32_accumulator`, see equation (6) above. + +* The unpacking stage feeds into the output pipeline (see + [output.md](output.md)), which implements the rest of the evaluation of the + above equation (5), that we discussed in the previous section. + +Now, the point of gemmlowp's flexible output-pipelines mechanism (see +[output.md](output.md)) is to support different quantization paradigms, so we +now have to specify which particular flavor of output pipeline corresponds to +the particular quantization paradigm that we detailed above in this document. + +The specific output pipeline stage implementing the present quantization +paradigm, i.e. implementing the precise computation detailed in the previous +section (equation (5)), is +`OutputStageQuantizeDownInt32ToUint8ScaleByFixedPoint`. + +Please refer to the comment explaining it in +[public/output_stages.h](../public/output_stages.h). + +## How this differs from the older legacy gemmlowp quantization paradigm + +The difference between the older legacy quantization paradigm described in +[low-precision.md](low-precision.md) and the newer one described in this +document boils down to the difference between the legacy output stage +implementing it, `OutputStageQuantizeDownInt32ToUint8Scale`, and the new output +stage implementing the new paradigm, +`OutputStageQuantizeDownInt32ToUint8ScaleByFixedPoint`. + +Please refer to the comments in +[public/output_stages.h](../public/output_stages.h) for details about these two +output stages and how they differ. + +Issues with the old output stage `OutputStageQuantizeDownInt32ToUint8Scale` are: + +1. The int32 accumulators (inputs to the output stage) undergo a plain int32 + multiplication with a int32 multiplier, which may overflow. By contrast, in + the newer `OutputStageQuantizeDownInt32ToUint8ScaleByFixedPoint`, this + integer multiplication becomes a fixed-point multiplication and cannot + overflow. + + * In practice, to limit the risk of overflow, this pushes users to choose + smaller values for this integer multiplier, which means limited + multiplicative accuracy, which may cause multiplicative bias depending + on how it is used. + +2. Note how the order of multiplying by the multipler and adding the + `result_offset` are swapped. This reflects a quantizatin equation of the + form (1) above, as opposed to the form (2)/(3) that the new quantization + paradigm uses. As a result, it is essentially impossible to guarantee that 0 + is an exactly-representable value, which as discussed above is an issue at + least in some convolutional neural network applications. + +## Example code illustrating the new quantization paradigm + +Example code showing how to perfom a quantized matrix multiplication in the +quantization paradigm discussed here is in +[doc/quantization_example.cc](quantization_example.cc). |