diff options
author | Alex Deymo <deymo@google.com> | 2017-09-13 22:17:57 +0200 |
---|---|---|
committer | Alex Deymo <deymo@google.com> | 2017-10-20 16:37:51 +0200 |
commit | 48ad5ab76ef0ce76a942d7077a2a8494212f0861 (patch) | |
tree | 05181a15e9fc91f3a504767bfd17529a10b4e401 /suffix_array_index_unittest.cc | |
parent | 52b6bae35a96841119be5691658b2496835e814f (diff) | |
download | bsdiff-48ad5ab76ef0ce76a942d7077a2a8494212f0861.tar.gz |
Select SuffixArray size at runtime.
The suffix array is used to search for strings from the new file in the
old file. The size of each element of the suffix array must be big
enough to hold the length of the old file. In most of the cases we use,
if not all of them, the old file is smaller than 2 GiB, which allows to
use 32-bit integers for the suffix array.
The previous code decided whether to use 64-bit or 32-bit integers
based on the large-file surpport, even in 32-bit computers architectures
which would not work in those cases; and would also use twice as much
memory when constructing the suffix array of inputs smaller than 2 GiB.
This patch selects the suffix array size based on the input size,
effectively using half the memory in the most common case. This also
improves the runtime performance due to lower cache pressure.
As a side effect, this patch hides the suffix array implementation
details from the public interface. This removes the limitation where
the caller of libbsdiff and libbsdiff needed to be compiled with the
same large-file support flags; and also hides the dependency on
libdivsufsort from the public interface.
While doing so, instead of re-implementing the suffix array search
algorithm (which had a history of bugs and corner cases) we re-use the
one from libdivsufsort, adapted to the bsdiff algorithm use-case.
The result is about 40% faster when diffing 10MiB files, and the patch
size is within 0.1%, due to differences in the position selected
whenever more than one match is available.
Bug: 34220646
Test: Added unittests for the suffix array.
Change-Id: Ib9dc85a83bbf10157a3937c5687e5663ab0d20d5
Diffstat (limited to 'suffix_array_index_unittest.cc')
-rw-r--r-- | suffix_array_index_unittest.cc | 79 |
1 files changed, 79 insertions, 0 deletions
diff --git a/suffix_array_index_unittest.cc b/suffix_array_index_unittest.cc new file mode 100644 index 0000000..50563d4 --- /dev/null +++ b/suffix_array_index_unittest.cc @@ -0,0 +1,79 @@ +// Copyright 2017 The Chromium OS Authors. All rights reserved. +// Use of this source code is governed by a BSD-style license that can be +// found in the LICENSE file. + +#include "bsdiff/suffix_array_index.h" + +#include <stdlib.h> + +#include <algorithm> +#include <string> +#include <utility> +#include <vector> + +#include <gmock/gmock.h> +#include <gtest/gtest.h> + +namespace bsdiff { + +class SuffixArrayIndexTest : public ::testing::Test { + protected: + void InitSA(const std::string& text) { + text_.resize(text.size()); + std::copy(text.begin(), text.end(), text_.data()); + sai_ = CreateSuffixArrayIndex(text_.data(), text_.size()); + } + + void SearchPrefix(const std::string& pattern, + size_t* out_length, + uint64_t* out_pos) { + sai_->SearchPrefix(reinterpret_cast<const uint8_t*>(pattern.data()), + pattern.size(), out_length, out_pos); + // Check that the returned values are indeed a valid prefix. + EXPECT_LE(*out_length, pattern.size()); + ASSERT_LT(*out_pos, text_.size()); + ASSERT_LE(*out_pos + *out_length, text_.size()); + EXPECT_EQ(0, memcmp(text_.data() + *out_pos, pattern.data(), *out_length)); + } + + std::vector<uint8_t> text_; + std::unique_ptr<SuffixArrayIndexInterface> sai_; +}; + +// Test that the suffix array index can be initialized with an example text. +TEST_F(SuffixArrayIndexTest, MississippiTest) { + InitSA("mississippi"); +} + +TEST_F(SuffixArrayIndexTest, EmptySuffixArrayTest) { + InitSA(""); +} + +// Test various corner cases while searching for prefixes. +TEST_F(SuffixArrayIndexTest, SearchPrefixTest) { + InitSA("abc1_abc2_abc3_ab_abcd"); + + uint64_t pos; + size_t length; + // The string is not found at all. + SearchPrefix("zzz", &length, &pos); // lexicographically highest. + EXPECT_EQ(0U, length); + + SearchPrefix(" ", &length, &pos); // lexicographically lowest. + EXPECT_EQ(0U, length); + + // The exact pattern is found multiple times. + SearchPrefix("abc", &length, &pos); + EXPECT_EQ(3U, length); + + // The exact pattern is found only one time, at the end of the string. + SearchPrefix("abcd", &length, &pos); + EXPECT_EQ(4U, length); + EXPECT_EQ(18U, pos); + + // The string is not found, but a prefix is found. + SearchPrefix("abcW", &length, &pos); + EXPECT_EQ(3U, length); +} + +} // namespace bsdiff |