aboutsummaryrefslogtreecommitdiff
path: root/doc/packing.txt
diff options
context:
space:
mode:
Diffstat (limited to 'doc/packing.txt')
-rw-r--r--doc/packing.txt206
1 files changed, 0 insertions, 206 deletions
diff --git a/doc/packing.txt b/doc/packing.txt
deleted file mode 100644
index 2b94512..0000000
--- a/doc/packing.txt
+++ /dev/null
@@ -1,206 +0,0 @@
- The packing stage in gemmlowp
- *****************************
-
-
-Introduction
-============
-
-We assume familiarity with doc/design.txt and with the overall
-3 stages of computations described there: packing, kernel, unpacking.
-
-This page goes into more details about the first stage: packing.
-
-We also assume familiarity with doc/kernel.txt as it describes the
-packed format requirements that the kernels expect, and that forms
-basically the contract that the packing stage must honor.
-
-Some parts below also assume familiarity with doc/low-precision.txt
-as the packing stage also has to compute the vectors of sums or columns as
-described there.
-
-
-The storage order of packed blocks, partly hidden behind sequential access
-==========================================================================
-
-As explained in doc/design.txt, the primary purpose of packing is to
-ensure that when the kernel traverses Lhs/Rhs matrix data, it can do
-so efficiently thanks to having the data stored in an order that is
-as similar as possible to the order in which the compute stage has to
-traverse this data.
-
-This traversal order is nontrivial for the reasons outlined in
-doc/design.txt: at the innermost level, one tries to work within registers
-as much as possible; at the next level, one tries to stay within L1 cache
-as much as possible. The packed blocks that we handle are supposed
-to fit entirely in L2 cache.
-
-Thus it has become standard in GEMM design to describe complicated
-"Z-order" or "fractal order" storage for packed blocks.
-
-However, we should keep in mind that the whole point of the packed storage
-order is to be as similar as possible to the order of traversal during
-the compute stage. The storage order doesn't matter in itself; the only
-thing that matters is simple access patterns during the compute stage.
-
-This suggests the following approach to implementing packing: take the
-exact same hierarchy of nested loops of the compute stage, drop the loops
-that are not relevant to the side (Lhs or Rhs) being packed, and try to use
-mostly sequential access to the destination packed data.
-
-This hierarchy of nested loops can be seen in PackSideBlockImpl
-(PackL2, PackL1, PackRun), compare to the similar hierarchy of loops
-in internal/compute.h.
-
-In this way, the more intricate small-scale details or the packed data format
-never need to be made explicit (which would be complicated). We still use
-some "seeking" but only at larger scales, where the storage order is less\
-complicated to describe.
-
-
-Sequential access to PackedSideBlock data
------------------------------------------
-
-See PackedSideBlock in internal/pack.h, specifically the following data
-members:
-
- // Handle on the buffer backing this packed block. Owned.
- Allocator::Handle data_handle_;
-
-and:
-
- // pos_ is the current position in the buffer, which we access
- // sequentially, like a file.
- // The idea is that we pack data in the same order as it is
- // going to be traversed during the computation, which for
- // cache-friendliness reasons is complicated to random-access,
- // as the offsets calculations would be intricate. So we
- // give up random-access addressing, and instead content ourselves
- // with sequential access.
- //
- // pos_ is mutable because during the computation we will want to
- // be able to iterate on the data in a const PackedSideBlock.
- mutable int pos_;
-
-The methods exposing sequential access are:
-
- std::uint8_t* current_data() {
- return allocator_->GetPointer<std::uint8_t>(data_handle_) + pos_;
- }
-
-and:
-
- void seek_next_cell() const { pos_ += KernelSideFormat::Cell::kSize; }
-
- void seek_forward_n_cells(int n) const {
- pos_ += n * KernelSideFormat::Cell::kSize;
- }
-
-
-Random access to PackedSideBlock data at larger scales
-------------------------------------------------------
-
-We still need some random access at larger scales (with high granularity),
-which is unavoidable since GEMM is O(n^3) and has to traverse each of the
-O(n^2) inputs O(n) times.
-
-The watershed between sequential access and random access is at the level
-of a 'Run'. Throughout gemmlowp we consistently use the term 'Run' to refer
-to the innermost GEMM loop in the depth dimension. That's the critical
-inner loop that must be as fast as possible, thus for which we absolutely
-want sequential access to packed data so that the storage order is optimal
-by construction. At larger scales i.e. between runs, we accept that
-the storage order is less optimal and since it's also less intricate, it's
-not too hard to implement random access there.
-
-This is done by the seek_run method:
-
- void seek_run(int start_width, int start_depth) const {
- int kernel_run_depth =
- std::min<int>(params_.l1_depth, params_.l2_depth - start_depth);
- pos_ = params_.l2_width * start_depth + start_width * kernel_run_depth;
- }
-
-We see that the formula involves the l1_depth parameter, which is how the
-packed storage order depends on L1 cache size. Again, the whole packed
-block is supposed to fit in L2 cache.
-
-
-The innermost loop of the packing stage, PackRun, and PackingRegisterBlock
-==========================================================================
-
-Keeping with our consistent usage of the term 'Run' throughout gemmlowp,
-the innermost loop is called PackRun().
-
-Here we recall a very important principle that was explained in
-doc/kernels.txt: the kernel is free to dictate the precise data format
-that it expects; the packing code has to honor it. So there's an asymmetry
-here: the kernel is the master, the packing is the slave. That's why the
-packing code is templatized in the KernelSideFormat. At larger scales,
-the packing is independent of kernel format details, but inside PackRun is
-where we take care of the small-scale details that do depend on the kernel
-format details. That's why it's a good thing that we only need sequential
-access here, as it would be very complicated to spell out random access
-at this scale.
-
-Anyway, PackRun.
-
-Since it is the critical inner loop, it is what we want to allow specializing
-for particular CPU architectures. To allow that, we handle at a time blocks
-of fixed dimensions, that is intended to be friendly enough to optimization.
-These blocks are PackingRegisterBlock's and their dimensions are:
- width = KernelWidth
- depth = kRegisterSize
-See doc/kernels.txt and internal/kernel.h for the former, and internal/common.h
-for the latter.
-
-See the comments around PackingRegisterBlock in internal/pack.h:
-
-// A PackingRegisterBlock is a small fixed-size block of a matrix being
-// packed. This class is the generic non-optimized implementation,
-// it is inherited by the generic implementation of PackingRegisterBlock,
-// which may be overriden by template specialization. Overriding it is how
-// one may provide optimized packing code paths.
-//
-// The packing of a block proceeds in two steps:
-// 1. Ensuring that we have a complete block of source data, i.e. a block of
-// the compile-time prescribed size. This is where we handle unaligned
-// boundaries: if we don't have a complete block of source data, then
-// we copy and zero-extend it into a local temporary (complete_src_),
-// see MakeCompleteSrc. In the generic case, we do have a complete block,
-// so we just use it in-place, see UseCompleteSrcInPlace.
-// 2. Packing a complete block into the destination, see Pack. This is the
-// most critical part, so it's convenient that unaligned boundaries have
-// already been handled in step 1.
-
-
-Other things that the packing stage has to do
-=============================================
-
-Besides storing matrix entries in a suitable order, the packing stages also has
-two other things to do.
-
-First, packing has to compute the vectors of sums of entries along the depth
-dimension. If this is any mysterious, read doc/low-precision.txt. These will
-only be used at the unpacking stage.
-
-Second, if the BitDepthSetting requires less than 8 bit of precision, then at
-the packing stage we have to requantize inputs accordingly.
-See doc/less-than-8-bit.txt for details. This is the Requantize() function.
-
-
-Specialized packing paths for specific formats on specific CPU architectures
-============================================================================
-
-Please refer to internal/pack_neon.h for examples of doing that. The piece
-of code to be specialized is PackingRegisterBlock. However, inside of it,
-only the Pack() method typically needs to be specialized (the rest is unlikely
-to be critical). So one typically specializes PackingRegisterBlock but still
-inheriting PackingRegisterBlockBase to keep the generic stuff, and then one
-typically wants to override the Pack() method.
-
-Template specialization for the right template parameters is how one specifies
-in which case a given path is to be used in place of the generic packing code.
-
-It is entirely possible to set the value of kRegisterSize differently based on
-the CPU architecture (for example, 32 on x86 with AVX) as long as all the
-specialized packing paths used on that CPU architecture are consistent with it.