aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--doc/lz4_Block_format.md211
1 files changed, 144 insertions, 67 deletions
diff --git a/doc/lz4_Block_format.md b/doc/lz4_Block_format.md
index a0017b90..8a4b0831 100644
--- a/doc/lz4_Block_format.md
+++ b/doc/lz4_Block_format.md
@@ -1,24 +1,22 @@
LZ4 Block Format Description
============================
-Last revised: 2022-02-02.
+Last revised: 2022-07-31 .
Author : Yann Collet
-This specification is intended for developers
-willing to produce LZ4-compatible compressed data blocks
-using any programming language.
+This specification is intended for developers willing to
+produce or read LZ4 compressed data blocks
+using any programming language of their choice.
-LZ4 is an LZ77-type compressor with a fixed, byte-oriented encoding.
+LZ4 is an LZ77-type compressor with a fixed byte-oriented encoding format.
There is no entropy encoder back-end nor framing layer.
The latter is assumed to be handled by other parts of the system
(see [LZ4 Frame format]).
This design is assumed to favor simplicity and speed.
-It helps later on for optimizations, compactness, and features.
-This document describes only the block format,
+This document describes only the Block Format,
not how the compressor nor decompressor actually work.
-The correctness of the decompressor should not depend
-on implementation details of the compressor, and vice versa.
+For more details on such topics, see later section "Implementation Notes".
[LZ4 Frame format]: lz4_Frame_format.md
@@ -28,7 +26,7 @@ Compressed block format
-----------------------
An LZ4 compressed block is composed of sequences.
A sequence is a suite of literals (not-compressed bytes),
-followed by a match copy.
+followed by a match copy operation.
Each sequence starts with a `token`.
The `token` is a one byte value, separated into two 4-bits fields.
@@ -38,14 +36,20 @@ Therefore each field ranges from 0 to 15.
The first field uses the 4 high-bits of the token.
It provides the length of literals to follow.
-If the field value is 0, then there is no literal.
-If it is 15, then we need to add some more bytes to indicate the full length.
-Each additional byte then represent a value from 0 to 255,
+If the field value is smaller than 15,
+then it represents the total nb of literals present in the sequence,
+including 0, in which case there is no literal.
+
+The value 15 is a special case: more bytes are required to indicate the full length.
+Each additional byte then represents a value from 0 to 255,
which is added to the previous value to produce a total length.
-When the byte value is 255, another byte must read and added, and so on.
-There can be any number of bytes of value "255" following `token`.
-There is no "size limit".
-(Side note : this is why a not-compressible input block is expanded by 0.4%).
+When the byte value is 255, another byte must be read and added, and so on.
+There can be any number of bytes of value `255` following `token`.
+The Block Format does not define any "size limit",
+though real implementations may feature some practical limits
+(see more details in later chapter "Implementation Notes").
+
+Note : this format explains why a non-compressible input block is expanded by 0.4%.
Example 1 : A literal length of 48 will be represented as :
@@ -64,104 +68,177 @@ Example 3 : A literal length of 15 will be represented as :
- 0 : (=15-15) yes, the zero must be output
Following `token` and optional length bytes, are the literals themselves.
-They are exactly as numerous as previously decoded (length of literals).
-It's possible that there are zero literals.
+They are exactly as numerous as just decoded (length of literals).
+Reminder: it's possible that there are zero literals.
Following the literals is the match copy operation.
-It starts by the `offset`.
+It starts by the `offset` value.
This is a 2 bytes value, in little endian format
(the 1st byte is the "low" byte, the 2nd one is the "high" byte).
-The `offset` represents the position of the match to be copied from.
+The `offset` represents the position of the match to be copied from the past.
For example, 1 means "current position - 1 byte".
-The maximum `offset` value is 65535, 65536 and beyond cannot be coded.
-Note that 0 is an invalid offset value.
-The presence of such a value denotes an invalid (corrupted) block.
+The maximum `offset` value is 65535. 65536 and beyond cannot be coded.
+Note that 0 is an invalid `offset` value.
+The presence of a 0 `offset` value denotes an invalid (corrupted) block.
Then the `matchlength` can be extracted.
-For this, we use the second token field, the low 4-bits.
+For this, we use the second `token` field, the low 4-bits.
Such a value, obviously, ranges from 0 to 15.
-However here, 0 means that the copy operation will be minimal.
+However here, 0 means that the copy operation is minimal.
The minimum length of a match, called `minmatch`, is 4.
-As a consequence, a 0 value means 4 bytes, and a value of 15 means 19+ bytes.
-Similar to literal length, on reaching the highest possible value (15),
-one must read additional bytes, one at a time, with values ranging from 0 to 255.
+As a consequence, a 0 value means 4 bytes.
+Similarly to literal length, any value smaller than 15 represents a length,
+to which 4 (`minmatch`) must be added, thus ranging from 4 to 18.
+A value of 15 is special, meaning 19+ bytes,
+to which one must read additional bytes, one at a time,
+with each byte value ranging from 0 to 255.
They are added to total to provide the final match length.
A 255 value means there is another byte to read and add.
-There is no limit to the number of optional "255" bytes that can be present.
-(Note: this points towards a maximum achievable compression ratio of about 250).
+There is no limit to the number of optional `255` bytes that can be present,
+and therefore no limit to representable match length,
+though real-life implementations are likely going to enforce limits for practical reasons (see more details in "Implementation Notes" section below).
+
+Note: this format has a maximum achievable compression ratio of about ~250.
Decoding the `matchlength` reaches the end of current sequence.
-Next byte will be the start of another sequence.
-But before moving to next sequence,
-it's time to use the decoded match position and length.
-The decoder copies `matchlength` bytes from match position to current position.
-
-In some cases, `matchlength` can be larger than `offset`.
-Therefore, since `match_pos + matchlength > current_pos`,
-later bytes to copy are not decoded yet.
-This is called an "overlap match", and must be handled with special care.
-A common case is an offset of 1,
-meaning the last byte is repeated `matchlength` times.
+Next byte will be the start of another sequence, and therefore a new `token`.
-End of block restrictions
------------------------
+End of block conditions
+-------------------------
There are specific restrictions required to terminate an LZ4 block.
1. The last sequence contains only literals.
- The block ends right after them.
+ The block ends right after the literals (no `offset` field).
2. The last 5 bytes of input are always literals.
Therefore, the last sequence contains at least 5 bytes.
- Special : if input is smaller than 5 bytes,
there is only one sequence, it contains the whole input as literals.
- Empty input can be represented with a zero byte,
+ Even empty input can be represented, using a zero byte,
interpreted as a final token without literal and without a match.
3. The last match must start at least 12 bytes before the end of block.
- The last match is part of the penultimate sequence.
- It is followed by the last sequence, which contains only literals.
+ The last match is part of the _penultimate_ sequence.
+ It is followed by the last sequence, which contains _only_ literals.
- Note that, as a consequence,
- an independent block < 13 bytes cannot be compressed,
- because the match must copy "something",
- so it needs at least one prior byte.
- - However, when a block can reference data from another block,
- it can start immediately with a match and no literal,
- therefore a block of exactly 12 bytes can be compressed.
+ blocks < 12 bytes cannot be compressed.
+ And as an extension, _independent_ blocks < 13 bytes cannot be compressed,
+ because they must start by at least one literal,
+ that the match can then copy afterwards.
When a block does not respect these end conditions,
a conformant decoder is allowed to reject the block as incorrect.
These rules are in place to ensure compatibility with
a wide range of historical decoders
-which rely on these conditions in their speed-oriented design.
+which rely on these conditions for their speed-oriented design.
-Additional notes
+Implementation notes
-----------------------
-If the decoder will decompress data from any external source,
-it is recommended to ensure that the decoder is resilient to corrupted data,
-and typically not vulnerable to buffer overflow manipulations.
+The LZ4 Block Format only defines the compressed format,
+it does not tell how to create a decoder or an encoder,
+which design is left free to the imagination of the implementer.
+
+However, thanks to experience, there are a number of typical topics that
+most implementations will have to consider.
+This section tries to provide a few guidelines.
+
+#### Metadata
+
+An LZ4-compressed Block requires additional metadata for proper decoding.
+Typically, a decoder will require the compressed block's size,
+and an upper bound of decompressed size.
+Other variants exist, such as knowing the decompressed size,
+and having an upper bound of the input size.
+The Block Format does not specify how to transmit such information,
+which is considered an out-of-band information channel.
+That's because in many cases, the information is present in the environment.
+For example, databases must store the size of their compressed block for indexing,
+and know that their decompressed block can't be larger than a certain threshold.
+
+If you need a format which is "self-contained",
+and also transports the necessary metadata for proper decoding on any platform,
+consider employing the [LZ4 Frame format] instead.
+
+#### Large lengths
+
+While the Block Format does not define any maximum value for length fields,
+in practice, most implementations will feature some form of limit,
+since it's expected for such values to be stored into registers of fixed bit width.
+
+If length fields use 64-bit registers,
+then it can be assumed that there is no practical limit,
+as it would require a single continuous block of multiple petabytes to reach it,
+which is unreasonable by today's standard.
+
+If length fields use 32-bit registers, then it can be overflowed,
+but requires a compressed block of size > 16 MB.
+Therefore, implementations that do not deal with compressed blocks > 16 MB are safe.
+However, if such a case is allowed,
+then it's recommended to check that no large length overflows the register.
+
+If length fields use 16-bit registers,
+then it's definitely possible to overflow such register,
+with less than < 300 bytes of compressed data.
+
+A conformant decoder should be able to detect length overflows when it's possible,
+and simply error out when that happens.
+The input block might not be invalid,
+it's just not decodable by the local decoder implementation.
+
+Note that, in order to be compatible with the larger LZ4 ecosystem,
+it's recommended to be able to read and represent lengths of up to 4 MB,
+and to accept blocks of size up to 4 MB.
+Such limits are compatible with 32-bit length registers,
+and prevent overflow of 32-bit registers.
+
+#### Safe decoding
+
+If a decoder receives compressed data from any external source,
+it is recommended to ensure that the decoder is resilient to corrupted input,
+and made safe from buffer overflow manipulations.
Always ensure that read and write operations
remain within the limits of provided buffers.
-Test the decoder with fuzzers
+
+Of particular importance, ensure that the nb of bytes instructed to copy
+does not overflow neither the input nor the output buffers.
+Ensure also, when reading an offset value, that the resulting position to copy
+does not reach beyond the beginning of the buffer.
+Such a situation can happen during the first 64 KB of decoded data.
+
+For more safety, test the decoder with fuzzers
to ensure it's resilient to improbable sequences of conditions.
Combine them with sanitizers, in order to catch overflows (asan)
or initialization issues (msan).
+
Pay some attention to offset 0 scenario, which is invalid,
-and therefore must not be blindly decoded
-(a naive implementation could preserve destination buffer content,
+and therefore must not be blindly decoded:
+a naive implementation could preserve destination buffer content,
which could then result in information disclosure
-if such buffer was uninitialized and still containing private data).
+if such buffer was uninitialized and still containing private data.
For reference, in such a scenario, the reference LZ4 decoder
clears the match segment with `0` bytes,
though other solutions are certainly possible.
+Finally, pay attention to the "overlap match" scenario,
+when `matchlength` is larger than `offset`.
+In which case, since `match_pos + matchlength > current_pos`,
+some of the later bytes to copy do not exist yet,
+and will be generated during the early stage of match copy operation.
+Such scenario must be handled with special care.
+A common case is an offset of 1,
+meaning the last byte is repeated `matchlength` times.
+
+#### Compression techniques
+
+The core of a LZ4 compressor is to detect duplicated data across past 64 KB.
The format makes no assumption nor limits to the way a compressor
searches and selects matches within the source data block.
-Multiple techniques can be considered,
-featuring distinct time / performance trade offs.
For example, an upper compression limit can be reached,
-using a technique called "full optimal parsing", at very high cpu cost.
+using a technique called "full optimal parsing", at high cpu and memory cost.
+But multiple other techniques can be considered,
+featuring distinct time / performance trade offs.
As long as the specified format is respected,
-the result will be compatible and decodable by any compliant decoder.
+the result will be compatible with and decodable by any compliant decoder.