Select SuffixArray size at runtime.

The suffix array is used to search for strings from the new file in the old file. The size of each element of the suffix array must be big enough to hold the length of the old file. In most of the cases we use, if not all of them, the old file is smaller than 2 GiB, which allows to use 32-bit integers for the suffix array. The previous code decided whether to use 64-bit or 32-bit integers based on the large-file surpport, even in 32-bit computers architectures which would not work in those cases; and would also use twice as much memory when constructing the suffix array of inputs smaller than 2 GiB. This patch selects the suffix array size based on the input size, effectively using half the memory in the most common case. This also improves the runtime performance due to lower cache pressure. As a side effect, this patch hides the suffix array implementation details from the public interface. This removes the limitation where the caller of libbsdiff and libbsdiff needed to be compiled with the same large-file support flags; and also hides the dependency on libdivsufsort from the public interface. While doing so, instead of re-implementing the suffix array search algorithm (which had a history of bugs and corner cases) we re-use the one from libdivsufsort, adapted to the bsdiff algorithm use-case. The result is about 40% faster when diffing 10MiB files, and the patch size is within 0.1%, due to differences in the position selected whenever more than one match is available. Bug: 34220646 Test: Added unittests for the suffix array. Change-Id: Ib9dc85a83bbf10157a3937c5687e5663ab0d20d5
author: Alex Deymo <deymo@google.com> 2017-09-13 22:17:57 +0200
committer: Alex Deymo <deymo@google.com> 2017-10-20 16:37:51 +0200
commit: 48ad5ab76ef0ce76a942d7077a2a8494212f0861 (patch)
tree: 05181a15e9fc91f3a504767bfd17529a10b4e401 /suffix_array_index_unittest.cc
parent: 52b6bae35a96841119be5691658b2496835e814f (diff)
download: bsdiff-48ad5ab76ef0ce76a942d7077a2a8494212f0861.tar.gz
1 files changed, 79 insertions, 0 deletions
diff --git a/suffix_array_index_unittest.cc b/suffix_array_index_unittest.cc
new file mode 100644
index 0000000..50563d4
--- /dev/null
+++ b/suffix_array_index_unittest.cc
@@ -0,0 +1,79 @@
+// Copyright 2017 The Chromium OS Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style license that can be
+// found in the LICENSE file.
+
+#include "bsdiff/suffix_array_index.h"
+
+#include <stdlib.h>
+
+#include <algorithm>
+#include <string>
+#include <utility>
+#include <vector>
+
+#include <gmock/gmock.h>
+#include <gtest/gtest.h>
+
+namespace bsdiff {
+
+class SuffixArrayIndexTest : public ::testing::Test {
+ protected:
+  void InitSA(const std::string& text) {
+    text_.resize(text.size());
+    std::copy(text.begin(), text.end(), text_.data());
+    sai_ = CreateSuffixArrayIndex(text_.data(), text_.size());
+  }
+
+  void SearchPrefix(const std::string& pattern,
+                    size_t* out_length,
+                    uint64_t* out_pos) {
+    sai_->SearchPrefix(reinterpret_cast<const uint8_t*>(pattern.data()),
+                       pattern.size(), out_length, out_pos);
+    // Check that the returned values are indeed a valid prefix.
+    EXPECT_LE(*out_length, pattern.size());
+    ASSERT_LT(*out_pos, text_.size());
+    ASSERT_LE(*out_pos + *out_length, text_.size());
+    EXPECT_EQ(0, memcmp(text_.data() + *out_pos, pattern.data(), *out_length));
+  }
+
+  std::vector<uint8_t> text_;
+  std::unique_ptr<SuffixArrayIndexInterface> sai_;
+};
+
+// Test that the suffix array index can be initialized with an example text.
+TEST_F(SuffixArrayIndexTest, MississippiTest) {
+  InitSA("mississippi");
+}
+
+TEST_F(SuffixArrayIndexTest, EmptySuffixArrayTest) {
+  InitSA("");
+}
+
+// Test various corner cases while searching for prefixes.
+TEST_F(SuffixArrayIndexTest, SearchPrefixTest) {
+  InitSA("abc1_abc2_abc3_ab_abcd");
+
+  uint64_t pos;
+  size_t length;
+  // The string is not found at all.
+  SearchPrefix("zzz", &length, &pos);  // lexicographically highest.
+  EXPECT_EQ(0U, length);
+
+  SearchPrefix("   ", &length, &pos);  // lexicographically lowest.
+  EXPECT_EQ(0U, length);
+
+  // The exact pattern is found multiple times.
+  SearchPrefix("abc", &length, &pos);
+  EXPECT_EQ(3U, length);
+
+  // The exact pattern is found only one time, at the end of the string.
+  SearchPrefix("abcd", &length, &pos);
+  EXPECT_EQ(4U, length);
+  EXPECT_EQ(18U, pos);
+
+  // The string is not found, but a prefix is found.
+  SearchPrefix("abcW", &length, &pos);
+  EXPECT_EQ(3U, length);
+}
+
+}  // namespace bsdiff
author	Alex Deymo <deymo@google.com>	2017-09-13 22:17:57 +0200
committer	Alex Deymo <deymo@google.com>	2017-10-20 16:37:51 +0200
commit	48ad5ab76ef0ce76a942d7077a2a8494212f0861 (patch)
tree	05181a15e9fc91f3a504767bfd17529a10b4e401 /suffix_array_index_unittest.cc
parent	52b6bae35a96841119be5691658b2496835e814f (diff)
download	bsdiff-48ad5ab76ef0ce76a942d7077a2a8494212f0861.tar.gz