aboutsummaryrefslogtreecommitdiff
path: root/doc/lz4_Block_format.md
diff options
context:
space:
mode:
Diffstat (limited to 'doc/lz4_Block_format.md')
-rw-r--r--doc/lz4_Block_format.md236
1 files changed, 162 insertions, 74 deletions
diff --git a/doc/lz4_Block_format.md b/doc/lz4_Block_format.md
index 4344e9b8..9e802274 100644
--- a/doc/lz4_Block_format.md
+++ b/doc/lz4_Block_format.md
@@ -1,24 +1,22 @@
LZ4 Block Format Description
============================
-Last revised: 2019-03-30.
+Last revised: 2022-07-31 .
Author : Yann Collet
-This specification is intended for developers
-willing to produce LZ4-compatible compressed data blocks
-using any programming language.
+This specification is intended for developers willing to
+produce or read LZ4 compressed data blocks
+using any programming language of their choice.
-LZ4 is an LZ77-type compressor with a fixed, byte-oriented encoding.
+LZ4 is an LZ77-type compressor with a fixed byte-oriented encoding format.
There is no entropy encoder back-end nor framing layer.
The latter is assumed to be handled by other parts of the system
(see [LZ4 Frame format]).
This design is assumed to favor simplicity and speed.
-It helps later on for optimizations, compactness, and features.
-This document describes only the block format,
+This document describes only the Block Format,
not how the compressor nor decompressor actually work.
-The correctness of the decompressor should not depend
-on implementation details of the compressor, and vice versa.
+For more details on such topics, see later section "Implementation Notes".
[LZ4 Frame format]: lz4_Frame_format.md
@@ -28,7 +26,7 @@ Compressed block format
-----------------------
An LZ4 compressed block is composed of sequences.
A sequence is a suite of literals (not-compressed bytes),
-followed by a match copy.
+followed by a match copy operation.
Each sequence starts with a `token`.
The `token` is a one byte value, separated into two 4-bits fields.
@@ -38,13 +36,20 @@ Therefore each field ranges from 0 to 15.
The first field uses the 4 high-bits of the token.
It provides the length of literals to follow.
-If the field value is 0, then there is no literal.
-If it is 15, then we need to add some more bytes to indicate the full length.
-Each additional byte then represent a value from 0 to 255,
+If the field value is smaller than 15,
+then it represents the total nb of literals present in the sequence,
+including 0, in which case there is no literal.
+
+The value 15 is a special case: more bytes are required to indicate the full length.
+Each additional byte then represents a value from 0 to 255,
which is added to the previous value to produce a total length.
-When the byte value is 255, another byte is output.
-There can be any number of bytes following `token`. There is no "size limit".
-(Side note : this is why a not-compressible input block is expanded by 0.4%).
+When the byte value is 255, another byte must be read and added, and so on.
+There can be any number of bytes of value `255` following `token`.
+The Block Format does not define any "size limit",
+though real implementations may feature some practical limits
+(see more details in later chapter "Implementation Notes").
+
+Note : this format explains why a non-compressible input block is expanded by 0.4%.
Example 1 : A literal length of 48 will be represented as :
@@ -55,7 +60,7 @@ Example 2 : A literal length of 280 will be represented as :
- 15 : value for the 4-bits High field
- 255 : following byte is maxed, since 280-15 >= 255
- - 10 : (=280 - 15 - 255) ) remaining length to reach 280
+ - 10 : (=280 - 15 - 255) remaining length to reach 280
Example 3 : A literal length of 15 will be represented as :
@@ -63,94 +68,177 @@ Example 3 : A literal length of 15 will be represented as :
- 0 : (=15-15) yes, the zero must be output
Following `token` and optional length bytes, are the literals themselves.
-They are exactly as numerous as previously decoded (length of literals).
-It's possible that there are zero literal.
+They are exactly as numerous as just decoded (length of literals).
+Reminder: it's possible that there are zero literals.
Following the literals is the match copy operation.
-It starts by the `offset`.
+It starts by the `offset` value.
This is a 2 bytes value, in little endian format
(the 1st byte is the "low" byte, the 2nd one is the "high" byte).
-The `offset` represents the position of the match to be copied from.
-1 means "current position - 1 byte".
-The maximum `offset` value is 65535, 65536 cannot be coded.
-Note that 0 is an invalid value, not used.
+The `offset` represents the position of the match to be copied from the past.
+For example, 1 means "current position - 1 byte".
+The maximum `offset` value is 65535. 65536 and beyond cannot be coded.
+Note that 0 is an invalid `offset` value.
+The presence of a 0 `offset` value denotes an invalid (corrupted) block.
-Then we need to extract the `matchlength`.
-For this, we use the second token field, the low 4-bits.
-Value, obviously, ranges from 0 to 15.
-However here, 0 means that the copy operation will be minimal.
+Then the `matchlength` can be extracted.
+For this, we use the second `token` field, the low 4-bits.
+Such a value, obviously, ranges from 0 to 15.
+However here, 0 means that the copy operation is minimal.
The minimum length of a match, called `minmatch`, is 4.
-As a consequence, a 0 value means 4 bytes, and a value of 15 means 19+ bytes.
-Similar to literal length, on reaching the highest possible value (15),
-we output additional bytes, one at a time, with values ranging from 0 to 255.
+As a consequence, a 0 value means 4 bytes.
+Similarly to literal length, any value smaller than 15 represents a length,
+to which 4 (`minmatch`) must be added, thus ranging from 4 to 18.
+A value of 15 is special, meaning 19+ bytes,
+to which one must read additional bytes, one at a time,
+with each byte value ranging from 0 to 255.
They are added to total to provide the final match length.
A 255 value means there is another byte to read and add.
-There is no limit to the number of optional bytes that can be output this way.
-(This points towards a maximum achievable compression ratio of about 250).
+There is no limit to the number of optional `255` bytes that can be present,
+and therefore no limit to representable match length,
+though real-life implementations are likely going to enforce limits for practical reasons (see more details in "Implementation Notes" section below).
+
+Note: this format has a maximum achievable compression ratio of about ~250.
Decoding the `matchlength` reaches the end of current sequence.
-Next byte will be the start of another sequence.
-But before moving to next sequence,
-it's time to use the decoded match position and length.
-The decoder copies `matchlength` bytes from match position to current position.
-
-In some cases, `matchlength` is larger than `offset`.
-Therefore, `match_pos + matchlength > current_pos`,
-which means that later bytes to copy are not yet decoded.
-This is called an "overlap match", and must be handled with special care.
-A common case is an offset of 1,
-meaning the last byte is repeated `matchlength` times.
+Next byte will be the start of another sequence, and therefore a new `token`.
-End of block restrictions
------------------------
-There are specific rules required to terminate a block.
+End of block conditions
+-------------------------
+There are specific restrictions required to terminate an LZ4 block.
1. The last sequence contains only literals.
- The block ends right after them.
+ The block ends right after the literals (no `offset` field).
2. The last 5 bytes of input are always literals.
Therefore, the last sequence contains at least 5 bytes.
- Special : if input is smaller than 5 bytes,
there is only one sequence, it contains the whole input as literals.
- Empty input can be represented with a zero byte,
+ Even empty input can be represented, using a zero byte,
interpreted as a final token without literal and without a match.
3. The last match must start at least 12 bytes before the end of block.
- The last match is part of the penultimate sequence.
- It is followed by the last sequence, which contains only literals.
+ The last match is part of the _penultimate_ sequence.
+ It is followed by the last sequence, which contains _only_ literals.
- Note that, as a consequence,
- an independent block < 13 bytes cannot be compressed,
- because the match must copy "something",
- so it needs at least one prior byte.
- - When a block can reference data from another block,
- it can start immediately with a match and no literal,
- so a block of 12 bytes can be compressed.
+ blocks < 12 bytes cannot be compressed.
+ And as an extension, _independent_ blocks < 13 bytes cannot be compressed,
+ because they must start by at least one literal,
+ that the match can then copy afterwards.
When a block does not respect these end conditions,
a conformant decoder is allowed to reject the block as incorrect.
-These rules are in place to ensure that a conformant decoder
-can be designed for speed, issuing speculatively instructions,
-while never reading nor writing beyond provided I/O buffers.
+These rules are in place to ensure compatibility with
+a wide range of historical decoders
+which rely on these conditions for their speed-oriented design.
-
-Additional notes
+Implementation notes
-----------------------
-If the decoder will decompress data from an external source,
-it is recommended to ensure that the decoder will not be vulnerable to
-buffer overflow manipulations.
+The LZ4 Block Format only defines the compressed format,
+it does not tell how to create a decoder or an encoder,
+which design is left free to the imagination of the implementer.
+
+However, thanks to experience, there are a number of typical topics that
+most implementations will have to consider.
+This section tries to provide a few guidelines.
+
+#### Metadata
+
+An LZ4-compressed Block requires additional metadata for proper decoding.
+Typically, a decoder will require the compressed block's size,
+and an upper bound of decompressed size.
+Other variants exist, such as knowing the decompressed size,
+and having an upper bound of the input size.
+The Block Format does not specify how to transmit such information,
+which is considered an out-of-band information channel.
+That's because in many cases, the information is present in the environment.
+For example, databases must store the size of their compressed block for indexing,
+and know that their decompressed block can't be larger than a certain threshold.
+
+If you need a format which is "self-contained",
+and also transports the necessary metadata for proper decoding on any platform,
+consider employing the [LZ4 Frame format] instead.
+
+#### Large lengths
+
+While the Block Format does not define any maximum value for length fields,
+in practice, most implementations will feature some form of limit,
+since it's expected for such values to be stored into registers of fixed bit width.
+
+If length fields use 64-bit registers,
+then it can be assumed that there is no practical limit,
+as it would require a single continuous block of multiple petabytes to reach it,
+which is unreasonable by today's standard.
+
+If length fields use 32-bit registers, then it can be overflowed,
+but requires a compressed block of size > 16 MB.
+Therefore, implementations that do not deal with compressed blocks > 16 MB are safe.
+However, if such a case is allowed,
+then it's recommended to check that no large length overflows the register.
+
+If length fields use 16-bit registers,
+then it's definitely possible to overflow such register,
+with less than < 300 bytes of compressed data.
+
+A conformant decoder should be able to detect length overflows when it's possible,
+and simply error out when that happens.
+The input block might not be invalid,
+it's just not decodable by the local decoder implementation.
+
+Note that, in order to be compatible with the larger LZ4 ecosystem,
+it's recommended to be able to read and represent lengths of up to 4 MB,
+and to accept blocks of size up to 4 MB.
+Such limits are compatible with 32-bit length registers,
+and prevent overflow of 32-bit registers.
+
+#### Safe decoding
+
+If a decoder receives compressed data from any external source,
+it is recommended to ensure that the decoder is resilient to corrupted input,
+and made safe from buffer overflow manipulations.
Always ensure that read and write operations
remain within the limits of provided buffers.
-Test the decoder with fuzzers
-to ensure it's resilient to improbable combinations.
-The format makes no assumption nor limits to the way the compressor
+Of particular importance, ensure that the nb of bytes instructed to copy
+does not overflow neither the input nor the output buffers.
+Ensure also, when reading an offset value, that the resulting position to copy
+does not reach beyond the beginning of the buffer.
+Such a situation can happen during the first 64 KB of decoded data.
+
+For more safety, test the decoder with fuzzers
+to ensure it's resilient to improbable sequences of conditions.
+Combine them with sanitizers, in order to catch overflows (asan)
+or initialization issues (msan).
+
+Pay some attention to offset 0 scenario, which is invalid,
+and therefore must not be blindly decoded:
+a naive implementation could preserve destination buffer content,
+which could then result in information disclosure
+if such buffer was uninitialized and still containing private data.
+For reference, in such a scenario, the reference LZ4 decoder
+clears the match segment with `0` bytes,
+though other solutions are certainly possible.
+
+Finally, pay attention to the "overlap match" scenario,
+when `matchlength` is larger than `offset`.
+In which case, since `match_pos + matchlength > current_pos`,
+some of the later bytes to copy do not exist yet,
+and will be generated during the early stage of match copy operation.
+Such scenario must be handled with special care.
+A common case is an offset of 1,
+meaning the last byte is repeated `matchlength` times.
+
+#### Compression techniques
+
+The core of a LZ4 compressor is to detect duplicated data across past 64 KB.
+The format makes no assumption nor limits to the way a compressor
searches and selects matches within the source data block.
-Multiple techniques can be considered,
-featuring distinct time / performance trade offs.
-As long as the format is respected,
-the result will be compatible and decodable by any compliant decoder.
-An upper compression limit can be reached,
-using a technique called "full optimal parsing", at high cpu cost.
+For example, an upper compression limit can be reached,
+using a technique called "full optimal parsing", at high cpu and memory cost.
+But multiple other techniques can be considered,
+featuring distinct time / performance trade-offs.
+As long as the specified format is respected,
+the result will be compatible with and decodable by any compliant decoder.