diff options
Diffstat (limited to 'README.md')
-rw-r--r-- | README.md | 276 |
1 files changed, 276 insertions, 0 deletions
diff --git a/README.md b/README.md new file mode 100644 index 0000000..22fabac --- /dev/null +++ b/README.md @@ -0,0 +1,276 @@ +# gemmlowp: a small self-contained low-precision GEMM library + +[![Build Status](https://secure.travis-ci.org/google/gemmlowp.png)](http://travis-ci.org/google/gemmlowp) + +This is not a full linear algebra library, only a GEMM library: it only does +general matrix multiplication ("GEMM"). + +The meaning of "low precision" is detailed in this document: +[doc/low-precision.md](doc/low-precision.md) + +Some of the general design is explained in [doc/design.md](doc/design.md). + +**Warning:** This library goes very slow if compiled incorrectly; see below. + +## Disclaimer + +This is not an official Google product (experimental or otherwise), it is just +code that happens to be owned by Google. + +## Mailing list + +gemmlowp-related discussion, about either development or usage, is welcome on +this Google Group (mailing list / forum): + +https://groups.google.com/forum/#!forum/gemmlowp + +## Portability, target platforms/architectures + +Should be portable to any platform with some C++11 and POSIX support, while we +have optional optimized code paths for specific architectures. + +Required: + +* C++11 (a small conservative subset of it) + +Required for some features: + +* Some POSIX interfaces: + * pthreads (for multi-threaded operation and for profiling). + * sysconf (for multi-threaded operation to detect number of cores; may be + bypassed). + +Optional: + +* Architecture-specific code paths use intrinsics or inline assembly. See + "Architecture-specific optimized code paths" below. + +## Architecture-specific optimized code paths + +We have some optimized code paths for specific instruction sets. Some are +written in inline assembly, some are written in C++ using intrinsics. Both GCC +and Clang are supported. + +Current optimized code paths: + +* ARM with NEON (both 32bit and 64bit). +* Intel x86 with SSE 4.1 (both 32bit and 64bit). + +When building for x86, it's very important to pass `-msse4.1` to the compiler, +otherwise gemmlowp will use slow reference code. Bazel users can compile by +running `bazel build --copt=-msse4.1 //gemmlowp:all`. The compiled binary should +work on all Intel CPUs since 2008 (including low power microarchitectures) as +well as AMD CPUs since 2011. + +Please note when compiling binaries that don't need to be distributed, it's +generally a better idea to pass `-march=native` to the compiler. That flag +implies `-msse4.1` flag, along with others that might be helpful. This of course +assumes the host machine supports those instructions. Bazel users should prefer +to run `bazel build --config=opt //gemmlowp:all` instead. + +Details of what it takes to make an efficient port of gemmlowp, namely writing a +suitable GEMM kernel and accompanying packing code, are explained in this file: +[doc/kernel.md](doc/kernel.md). + +## Public interfaces + +### The gemmlowp public interface + +gemmlowp's main public interface is in the `public/` subdirectory. + +This is a headers-only library, so there is nothing to link to. + +Usage documentation, and comments on the deprecation status of each public entry +point, may be found in [doc/public.md](doc/public.md) . + +A full, self-contained usage example, showing how to quantize float matrices and +perform a quantized matrix multiplication approximating a float matrix +multiplication, is given in +[doc/quantization_example.cc](doc/quantization_example.cc). + +### Old EightBitIntGemm legacy deprecated interface + +The `eight_bit_int_gemm/` subdirectory contains an alternate interface that +should be considered purely legacy, deprecated, and going to be removed at some +point in the future. + +## Building + +### Building by manually invoking your compiler + +Because gemmlowp is so simple, working with it involves only single-command-line +compiler invocations. Therefore we expect that most people working with gemmlowp +will either manually invoke their compiler, or write their own rules for their +own preferred build system. + +Keep in mind (previous section) that gemmlowp itself is a pure-headers-only +library so there is nothing to build. + +For a Android gemmlowp development workflow, the `scripts/` directory contains a +script to build and run a program on an Android device: + +``` +scripts/test-android.sh +``` + +### Building using Bazel + +That being said, we also maintain a Bazel BUILD system as part of gemmlowp. Its +usage is not mandatory at all and is only one possible way that gemmlowp +libraries and tests may be built. If you are interested, Bazel's home page is +http://bazel.build/ And you can get started with using Bazel to build gemmlowp +targets by first creating an empty WORKSPACE file in a parent directory, for +instance: + +``` +$ cd gemmlowp/.. # change to parent directory containing gemmlowp/ +$ touch WORKSPACE # declare that to be our workspace root +$ bazel build gemmlowp:all +``` + +## Testing + +### Testing by manually building and running tests + +The test/ directory contains unit tests. The primary unit test is + +``` +test/test.cc +``` + +Since it covers also the EightBitIntGemm interface, it needs to be linked +against + +``` +eight_bit_int_gemm/eight_bit_int_gemm.cc +``` + +It also uses realistic data captured from a neural network run in + +``` +test/test_data.cc +``` + +Thus you'll want to pass the following list of source files to your +compiler/linker: + +``` +test/test.cc +eight_bit_int_gemm/eight_bit_int_gemm.cc +test/test_data.cc +``` + +The `scripts/` directory contains a script to build and run a program on an +Android device: + +``` +scripts/test-android.sh +``` + +It expects the `CXX` environment variable to point to an Android toolchain's C++ +compiler, and expects source files (and optionally, cflags) as command-line +parameters. To build and run the above-mentioned main unit test, first set `CXX` +e.g.: + +``` +$ export CXX=/some/toolchains/arm-linux-androideabi-4.8/bin/arm-linux-androideabi-g++ +``` + +Then run: + +``` +$ ./scripts/test-android.sh \ +test/test.cc \ +eight_bit_int_gemm/eight_bit_int_gemm.cc \ +test/test_data.cc +``` + +### Testing using Bazel + +Alternatively, you can use Bazel to build and run tests. See the Bazel +instruction in the above section on building. Once your Bazel workspace is set +up, you can for instance do: + +``` +$ bazel test gemmlowp:all +``` + +## Troubleshooting Compilation + +If you're having trouble finding the compiler, follow these instructions to +build a standalone toolchain: +https://developer.android.com/ndk/guides/standalone_toolchain.html + +Here's an example of setting up Clang 3.5: + +``` +$ export INSTALL_DIR=~/toolchains/clang-21-stl-gnu +$ $NDK/build/tools/make-standalone-toolchain.sh \ +--toolchain=arm-linux-androideabi-clang3.5 --platform=android-21 \ +--install-dir=$INSTALL_DIR +$ export CXX="$INSTALL_DIR/bin/arm-linux-androideabi-g++ \ +--sysroot=$INSTALL_DIR/sysroot" +``` + +Some compilers (e.g. the default clang++ in the same bin directory) don't +support NEON assembly. The benchmark build process will issue a warning if +support isn't detected, and you should make sure you're using a compiler like +arm-linux-androideabi-g++ that does include NEON. + +## Benchmarking + +The main benchmark is + +``` +test/benchmark.cc +``` + +It doesn't need to be linked to any other source file. We recommend building +with assertions disabled (`-DNDEBUG`). + +For example, the benchmark can be built and run on an Android device by doing: + +``` +$ ./scripts/test-android.sh test/benchmark.cc -DNDEBUG +``` + +If `GEMMLOWP_TEST_PROFILE` is defined then the benchmark will be built with +profiling instrumentation (which makes it slower) and will dump profiles. See +next section on profiling. + +## Profiling + +The `profiling/` subdirectory offers a very simple, naive, inaccurate, +non-interrupting sampling profiler that only requires pthreads (no signals). + +It relies on source code being instrumented with pseudo-stack labels. See +`profiling/instrumentation.h`. A full example of using this profiler is given in +the top comment of `profiling/profiler.h`. + +## Contributing + +Contribution-related discussion is always welcome on the gemmlowp mailing list +(see above). + +We try to keep a current list of TODO items in the `todo/` directory. +Prospective contributors are welcome to pick one to work on, and communicate +about it on the gemmlowp mailing list. + +Details of the contributing process, including legalese, are in CONTRIBUTING. + +## Performance goals + +Our performance goals differ from typical GEMM performance goals in the +following ways: + +1. We care not only about speed, but also about minimizing power usage. We + specifically care about charge usage in mobile/embedded devices. This + implies that we care doubly about minimizing memory bandwidth usage: we care + about it, like any GEMM, because of the impact on speed, and we also care + about it because it is a key factor of power usage. + +2. Most GEMMs are optimized primarily for large dense matrix sizes (>= 1000). + We do care about large sizes, but we also care specifically about the + typically smaller matrix sizes encountered in various mobile applications. + This means that we have to optimize for all sizes, not just for large enough + sizes. |