diff options
Diffstat (limited to 'doc/packing.txt')
-rw-r--r-- | doc/packing.txt | 206 |
1 files changed, 0 insertions, 206 deletions
diff --git a/doc/packing.txt b/doc/packing.txt deleted file mode 100644 index 2b94512..0000000 --- a/doc/packing.txt +++ /dev/null @@ -1,206 +0,0 @@ - The packing stage in gemmlowp - ***************************** - - -Introduction -============ - -We assume familiarity with doc/design.txt and with the overall -3 stages of computations described there: packing, kernel, unpacking. - -This page goes into more details about the first stage: packing. - -We also assume familiarity with doc/kernel.txt as it describes the -packed format requirements that the kernels expect, and that forms -basically the contract that the packing stage must honor. - -Some parts below also assume familiarity with doc/low-precision.txt -as the packing stage also has to compute the vectors of sums or columns as -described there. - - -The storage order of packed blocks, partly hidden behind sequential access -========================================================================== - -As explained in doc/design.txt, the primary purpose of packing is to -ensure that when the kernel traverses Lhs/Rhs matrix data, it can do -so efficiently thanks to having the data stored in an order that is -as similar as possible to the order in which the compute stage has to -traverse this data. - -This traversal order is nontrivial for the reasons outlined in -doc/design.txt: at the innermost level, one tries to work within registers -as much as possible; at the next level, one tries to stay within L1 cache -as much as possible. The packed blocks that we handle are supposed -to fit entirely in L2 cache. - -Thus it has become standard in GEMM design to describe complicated -"Z-order" or "fractal order" storage for packed blocks. - -However, we should keep in mind that the whole point of the packed storage -order is to be as similar as possible to the order of traversal during -the compute stage. The storage order doesn't matter in itself; the only -thing that matters is simple access patterns during the compute stage. - -This suggests the following approach to implementing packing: take the -exact same hierarchy of nested loops of the compute stage, drop the loops -that are not relevant to the side (Lhs or Rhs) being packed, and try to use -mostly sequential access to the destination packed data. - -This hierarchy of nested loops can be seen in PackSideBlockImpl -(PackL2, PackL1, PackRun), compare to the similar hierarchy of loops -in internal/compute.h. - -In this way, the more intricate small-scale details or the packed data format -never need to be made explicit (which would be complicated). We still use -some "seeking" but only at larger scales, where the storage order is less\ -complicated to describe. - - -Sequential access to PackedSideBlock data ------------------------------------------ - -See PackedSideBlock in internal/pack.h, specifically the following data -members: - - // Handle on the buffer backing this packed block. Owned. - Allocator::Handle data_handle_; - -and: - - // pos_ is the current position in the buffer, which we access - // sequentially, like a file. - // The idea is that we pack data in the same order as it is - // going to be traversed during the computation, which for - // cache-friendliness reasons is complicated to random-access, - // as the offsets calculations would be intricate. So we - // give up random-access addressing, and instead content ourselves - // with sequential access. - // - // pos_ is mutable because during the computation we will want to - // be able to iterate on the data in a const PackedSideBlock. - mutable int pos_; - -The methods exposing sequential access are: - - std::uint8_t* current_data() { - return allocator_->GetPointer<std::uint8_t>(data_handle_) + pos_; - } - -and: - - void seek_next_cell() const { pos_ += KernelSideFormat::Cell::kSize; } - - void seek_forward_n_cells(int n) const { - pos_ += n * KernelSideFormat::Cell::kSize; - } - - -Random access to PackedSideBlock data at larger scales ------------------------------------------------------- - -We still need some random access at larger scales (with high granularity), -which is unavoidable since GEMM is O(n^3) and has to traverse each of the -O(n^2) inputs O(n) times. - -The watershed between sequential access and random access is at the level -of a 'Run'. Throughout gemmlowp we consistently use the term 'Run' to refer -to the innermost GEMM loop in the depth dimension. That's the critical -inner loop that must be as fast as possible, thus for which we absolutely -want sequential access to packed data so that the storage order is optimal -by construction. At larger scales i.e. between runs, we accept that -the storage order is less optimal and since it's also less intricate, it's -not too hard to implement random access there. - -This is done by the seek_run method: - - void seek_run(int start_width, int start_depth) const { - int kernel_run_depth = - std::min<int>(params_.l1_depth, params_.l2_depth - start_depth); - pos_ = params_.l2_width * start_depth + start_width * kernel_run_depth; - } - -We see that the formula involves the l1_depth parameter, which is how the -packed storage order depends on L1 cache size. Again, the whole packed -block is supposed to fit in L2 cache. - - -The innermost loop of the packing stage, PackRun, and PackingRegisterBlock -========================================================================== - -Keeping with our consistent usage of the term 'Run' throughout gemmlowp, -the innermost loop is called PackRun(). - -Here we recall a very important principle that was explained in -doc/kernels.txt: the kernel is free to dictate the precise data format -that it expects; the packing code has to honor it. So there's an asymmetry -here: the kernel is the master, the packing is the slave. That's why the -packing code is templatized in the KernelSideFormat. At larger scales, -the packing is independent of kernel format details, but inside PackRun is -where we take care of the small-scale details that do depend on the kernel -format details. That's why it's a good thing that we only need sequential -access here, as it would be very complicated to spell out random access -at this scale. - -Anyway, PackRun. - -Since it is the critical inner loop, it is what we want to allow specializing -for particular CPU architectures. To allow that, we handle at a time blocks -of fixed dimensions, that is intended to be friendly enough to optimization. -These blocks are PackingRegisterBlock's and their dimensions are: - width = KernelWidth - depth = kRegisterSize -See doc/kernels.txt and internal/kernel.h for the former, and internal/common.h -for the latter. - -See the comments around PackingRegisterBlock in internal/pack.h: - -// A PackingRegisterBlock is a small fixed-size block of a matrix being -// packed. This class is the generic non-optimized implementation, -// it is inherited by the generic implementation of PackingRegisterBlock, -// which may be overriden by template specialization. Overriding it is how -// one may provide optimized packing code paths. -// -// The packing of a block proceeds in two steps: -// 1. Ensuring that we have a complete block of source data, i.e. a block of -// the compile-time prescribed size. This is where we handle unaligned -// boundaries: if we don't have a complete block of source data, then -// we copy and zero-extend it into a local temporary (complete_src_), -// see MakeCompleteSrc. In the generic case, we do have a complete block, -// so we just use it in-place, see UseCompleteSrcInPlace. -// 2. Packing a complete block into the destination, see Pack. This is the -// most critical part, so it's convenient that unaligned boundaries have -// already been handled in step 1. - - -Other things that the packing stage has to do -============================================= - -Besides storing matrix entries in a suitable order, the packing stages also has -two other things to do. - -First, packing has to compute the vectors of sums of entries along the depth -dimension. If this is any mysterious, read doc/low-precision.txt. These will -only be used at the unpacking stage. - -Second, if the BitDepthSetting requires less than 8 bit of precision, then at -the packing stage we have to requantize inputs accordingly. -See doc/less-than-8-bit.txt for details. This is the Requantize() function. - - -Specialized packing paths for specific formats on specific CPU architectures -============================================================================ - -Please refer to internal/pack_neon.h for examples of doing that. The piece -of code to be specialized is PackingRegisterBlock. However, inside of it, -only the Pack() method typically needs to be specialized (the rest is unlikely -to be critical). So one typically specializes PackingRegisterBlock but still -inheriting PackingRegisterBlockBase to keep the generic stuff, and then one -typically wants to override the Pack() method. - -Template specialization for the right template parameters is how one specifies -in which case a given path is to be used in place of the generic packing code. - -It is entirely possible to set the value of kRegisterSize differently based on -the CPU architecture (for example, 32 on x86 with AVX) as long as all the -specialized packing paths used on that CPU architecture are consistent with it. |