diff options
Diffstat (limited to 'doc/lz4_Block_format.md')
-rw-r--r-- | doc/lz4_Block_format.md | 236 |
1 files changed, 162 insertions, 74 deletions
diff --git a/doc/lz4_Block_format.md b/doc/lz4_Block_format.md index 4344e9b8..9e802274 100644 --- a/doc/lz4_Block_format.md +++ b/doc/lz4_Block_format.md @@ -1,24 +1,22 @@ LZ4 Block Format Description ============================ -Last revised: 2019-03-30. +Last revised: 2022-07-31 . Author : Yann Collet -This specification is intended for developers -willing to produce LZ4-compatible compressed data blocks -using any programming language. +This specification is intended for developers willing to +produce or read LZ4 compressed data blocks +using any programming language of their choice. -LZ4 is an LZ77-type compressor with a fixed, byte-oriented encoding. +LZ4 is an LZ77-type compressor with a fixed byte-oriented encoding format. There is no entropy encoder back-end nor framing layer. The latter is assumed to be handled by other parts of the system (see [LZ4 Frame format]). This design is assumed to favor simplicity and speed. -It helps later on for optimizations, compactness, and features. -This document describes only the block format, +This document describes only the Block Format, not how the compressor nor decompressor actually work. -The correctness of the decompressor should not depend -on implementation details of the compressor, and vice versa. +For more details on such topics, see later section "Implementation Notes". [LZ4 Frame format]: lz4_Frame_format.md @@ -28,7 +26,7 @@ Compressed block format ----------------------- An LZ4 compressed block is composed of sequences. A sequence is a suite of literals (not-compressed bytes), -followed by a match copy. +followed by a match copy operation. Each sequence starts with a `token`. The `token` is a one byte value, separated into two 4-bits fields. @@ -38,13 +36,20 @@ Therefore each field ranges from 0 to 15. The first field uses the 4 high-bits of the token. It provides the length of literals to follow. -If the field value is 0, then there is no literal. -If it is 15, then we need to add some more bytes to indicate the full length. -Each additional byte then represent a value from 0 to 255, +If the field value is smaller than 15, +then it represents the total nb of literals present in the sequence, +including 0, in which case there is no literal. + +The value 15 is a special case: more bytes are required to indicate the full length. +Each additional byte then represents a value from 0 to 255, which is added to the previous value to produce a total length. -When the byte value is 255, another byte is output. -There can be any number of bytes following `token`. There is no "size limit". -(Side note : this is why a not-compressible input block is expanded by 0.4%). +When the byte value is 255, another byte must be read and added, and so on. +There can be any number of bytes of value `255` following `token`. +The Block Format does not define any "size limit", +though real implementations may feature some practical limits +(see more details in later chapter "Implementation Notes"). + +Note : this format explains why a non-compressible input block is expanded by 0.4%. Example 1 : A literal length of 48 will be represented as : @@ -55,7 +60,7 @@ Example 2 : A literal length of 280 will be represented as : - 15 : value for the 4-bits High field - 255 : following byte is maxed, since 280-15 >= 255 - - 10 : (=280 - 15 - 255) ) remaining length to reach 280 + - 10 : (=280 - 15 - 255) remaining length to reach 280 Example 3 : A literal length of 15 will be represented as : @@ -63,94 +68,177 @@ Example 3 : A literal length of 15 will be represented as : - 0 : (=15-15) yes, the zero must be output Following `token` and optional length bytes, are the literals themselves. -They are exactly as numerous as previously decoded (length of literals). -It's possible that there are zero literal. +They are exactly as numerous as just decoded (length of literals). +Reminder: it's possible that there are zero literals. Following the literals is the match copy operation. -It starts by the `offset`. +It starts by the `offset` value. This is a 2 bytes value, in little endian format (the 1st byte is the "low" byte, the 2nd one is the "high" byte). -The `offset` represents the position of the match to be copied from. -1 means "current position - 1 byte". -The maximum `offset` value is 65535, 65536 cannot be coded. -Note that 0 is an invalid value, not used. +The `offset` represents the position of the match to be copied from the past. +For example, 1 means "current position - 1 byte". +The maximum `offset` value is 65535. 65536 and beyond cannot be coded. +Note that 0 is an invalid `offset` value. +The presence of a 0 `offset` value denotes an invalid (corrupted) block. -Then we need to extract the `matchlength`. -For this, we use the second token field, the low 4-bits. -Value, obviously, ranges from 0 to 15. -However here, 0 means that the copy operation will be minimal. +Then the `matchlength` can be extracted. +For this, we use the second `token` field, the low 4-bits. +Such a value, obviously, ranges from 0 to 15. +However here, 0 means that the copy operation is minimal. The minimum length of a match, called `minmatch`, is 4. -As a consequence, a 0 value means 4 bytes, and a value of 15 means 19+ bytes. -Similar to literal length, on reaching the highest possible value (15), -we output additional bytes, one at a time, with values ranging from 0 to 255. +As a consequence, a 0 value means 4 bytes. +Similarly to literal length, any value smaller than 15 represents a length, +to which 4 (`minmatch`) must be added, thus ranging from 4 to 18. +A value of 15 is special, meaning 19+ bytes, +to which one must read additional bytes, one at a time, +with each byte value ranging from 0 to 255. They are added to total to provide the final match length. A 255 value means there is another byte to read and add. -There is no limit to the number of optional bytes that can be output this way. -(This points towards a maximum achievable compression ratio of about 250). +There is no limit to the number of optional `255` bytes that can be present, +and therefore no limit to representable match length, +though real-life implementations are likely going to enforce limits for practical reasons (see more details in "Implementation Notes" section below). + +Note: this format has a maximum achievable compression ratio of about ~250. Decoding the `matchlength` reaches the end of current sequence. -Next byte will be the start of another sequence. -But before moving to next sequence, -it's time to use the decoded match position and length. -The decoder copies `matchlength` bytes from match position to current position. - -In some cases, `matchlength` is larger than `offset`. -Therefore, `match_pos + matchlength > current_pos`, -which means that later bytes to copy are not yet decoded. -This is called an "overlap match", and must be handled with special care. -A common case is an offset of 1, -meaning the last byte is repeated `matchlength` times. +Next byte will be the start of another sequence, and therefore a new `token`. -End of block restrictions ------------------------ -There are specific rules required to terminate a block. +End of block conditions +------------------------- +There are specific restrictions required to terminate an LZ4 block. 1. The last sequence contains only literals. - The block ends right after them. + The block ends right after the literals (no `offset` field). 2. The last 5 bytes of input are always literals. Therefore, the last sequence contains at least 5 bytes. - Special : if input is smaller than 5 bytes, there is only one sequence, it contains the whole input as literals. - Empty input can be represented with a zero byte, + Even empty input can be represented, using a zero byte, interpreted as a final token without literal and without a match. 3. The last match must start at least 12 bytes before the end of block. - The last match is part of the penultimate sequence. - It is followed by the last sequence, which contains only literals. + The last match is part of the _penultimate_ sequence. + It is followed by the last sequence, which contains _only_ literals. - Note that, as a consequence, - an independent block < 13 bytes cannot be compressed, - because the match must copy "something", - so it needs at least one prior byte. - - When a block can reference data from another block, - it can start immediately with a match and no literal, - so a block of 12 bytes can be compressed. + blocks < 12 bytes cannot be compressed. + And as an extension, _independent_ blocks < 13 bytes cannot be compressed, + because they must start by at least one literal, + that the match can then copy afterwards. When a block does not respect these end conditions, a conformant decoder is allowed to reject the block as incorrect. -These rules are in place to ensure that a conformant decoder -can be designed for speed, issuing speculatively instructions, -while never reading nor writing beyond provided I/O buffers. +These rules are in place to ensure compatibility with +a wide range of historical decoders +which rely on these conditions for their speed-oriented design. - -Additional notes +Implementation notes ----------------------- -If the decoder will decompress data from an external source, -it is recommended to ensure that the decoder will not be vulnerable to -buffer overflow manipulations. +The LZ4 Block Format only defines the compressed format, +it does not tell how to create a decoder or an encoder, +which design is left free to the imagination of the implementer. + +However, thanks to experience, there are a number of typical topics that +most implementations will have to consider. +This section tries to provide a few guidelines. + +#### Metadata + +An LZ4-compressed Block requires additional metadata for proper decoding. +Typically, a decoder will require the compressed block's size, +and an upper bound of decompressed size. +Other variants exist, such as knowing the decompressed size, +and having an upper bound of the input size. +The Block Format does not specify how to transmit such information, +which is considered an out-of-band information channel. +That's because in many cases, the information is present in the environment. +For example, databases must store the size of their compressed block for indexing, +and know that their decompressed block can't be larger than a certain threshold. + +If you need a format which is "self-contained", +and also transports the necessary metadata for proper decoding on any platform, +consider employing the [LZ4 Frame format] instead. + +#### Large lengths + +While the Block Format does not define any maximum value for length fields, +in practice, most implementations will feature some form of limit, +since it's expected for such values to be stored into registers of fixed bit width. + +If length fields use 64-bit registers, +then it can be assumed that there is no practical limit, +as it would require a single continuous block of multiple petabytes to reach it, +which is unreasonable by today's standard. + +If length fields use 32-bit registers, then it can be overflowed, +but requires a compressed block of size > 16 MB. +Therefore, implementations that do not deal with compressed blocks > 16 MB are safe. +However, if such a case is allowed, +then it's recommended to check that no large length overflows the register. + +If length fields use 16-bit registers, +then it's definitely possible to overflow such register, +with less than < 300 bytes of compressed data. + +A conformant decoder should be able to detect length overflows when it's possible, +and simply error out when that happens. +The input block might not be invalid, +it's just not decodable by the local decoder implementation. + +Note that, in order to be compatible with the larger LZ4 ecosystem, +it's recommended to be able to read and represent lengths of up to 4 MB, +and to accept blocks of size up to 4 MB. +Such limits are compatible with 32-bit length registers, +and prevent overflow of 32-bit registers. + +#### Safe decoding + +If a decoder receives compressed data from any external source, +it is recommended to ensure that the decoder is resilient to corrupted input, +and made safe from buffer overflow manipulations. Always ensure that read and write operations remain within the limits of provided buffers. -Test the decoder with fuzzers -to ensure it's resilient to improbable combinations. -The format makes no assumption nor limits to the way the compressor +Of particular importance, ensure that the nb of bytes instructed to copy +does not overflow neither the input nor the output buffers. +Ensure also, when reading an offset value, that the resulting position to copy +does not reach beyond the beginning of the buffer. +Such a situation can happen during the first 64 KB of decoded data. + +For more safety, test the decoder with fuzzers +to ensure it's resilient to improbable sequences of conditions. +Combine them with sanitizers, in order to catch overflows (asan) +or initialization issues (msan). + +Pay some attention to offset 0 scenario, which is invalid, +and therefore must not be blindly decoded: +a naive implementation could preserve destination buffer content, +which could then result in information disclosure +if such buffer was uninitialized and still containing private data. +For reference, in such a scenario, the reference LZ4 decoder +clears the match segment with `0` bytes, +though other solutions are certainly possible. + +Finally, pay attention to the "overlap match" scenario, +when `matchlength` is larger than `offset`. +In which case, since `match_pos + matchlength > current_pos`, +some of the later bytes to copy do not exist yet, +and will be generated during the early stage of match copy operation. +Such scenario must be handled with special care. +A common case is an offset of 1, +meaning the last byte is repeated `matchlength` times. + +#### Compression techniques + +The core of a LZ4 compressor is to detect duplicated data across past 64 KB. +The format makes no assumption nor limits to the way a compressor searches and selects matches within the source data block. -Multiple techniques can be considered, -featuring distinct time / performance trade offs. -As long as the format is respected, -the result will be compatible and decodable by any compliant decoder. -An upper compression limit can be reached, -using a technique called "full optimal parsing", at high cpu cost. +For example, an upper compression limit can be reached, +using a technique called "full optimal parsing", at high cpu and memory cost. +But multiple other techniques can be considered, +featuring distinct time / performance trade-offs. +As long as the specified format is respected, +the result will be compatible with and decodable by any compliant decoder. |