aboutsummaryrefslogtreecommitdiff
path: root/docs/readme.en.html
blob: cf1955f5d2a2c635b26a7361c18a7b75560efba1 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en">
 <head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  <title>MARISA: Matching Algorithm with Recursively Implemented StorAge</title>
  <link rel="stylesheet" type="text/css" href="./style.css">
 </head>
 <body>
  <div id="header">
   <div class="left">MARISA: Matching Algorithm with Recursively Implemented StorAge</div>
   <div class="right">Last modified: 20 May 2018</div>
   <div class="end"></div>
  </div><!-- header -->
  <div id="body" style="text-align: justify">
   <h1>MARISA: Matching Algorithm with Recursively Implemented StorAge</h1>
   <p id="abstract">
    <span id="heading">Abstract: </span>
     Matching Algorithm with Recursively Implemented StorAge (MARISA) is a space-efficient trie data structure. libmarisa is a C++ library for an implementation of MARISA. Users can build dictionaries and search keys from the dictionaries. The package also provides command line tools to test basic operations of libmarisa, and the tools are useful to test the performance.
   </p><!-- abstract -->

   <div class="section">
    <h2><a name="introduction">Introduction</a></h2>
    <div class="subsection">
     <h3><a name="overview">Overview</a></h3>
     <p>
      Matching Algorithm with Recursively Implemented StorAge (MARISA) is a space-efficient, fairly fast, and static trie data structure. MARISA serves as a dictionary structure, and by definition, it supports exact match lookup, which is the basic operation of dictionary. In addition, MARISA supports reverse lookup, common prefix search, and predictive search.
     </p>
     <p>
      In most cases, MARISA is much more compact than a plain text which consists of the registered keys. This means that the traditional dictionary implementations, a binary tree (<code>std::map&lt;std::string, T&gt;</code>) and a hash table (<code>std::unordered_map&lt;std::string, T&gt;</code>), require more and more and more spaces than MARISA. Bloom Filter, a probabilistic data structure, is more space-efficient than MARISA but causes false positives and does not support reverse lookup, common prefix search, and predictive search.
     </p>
     <p>
      libmarisa is a C++ library for an implementation of MARISA. Users can build dictionaries and search keys from the dictionaries. The package also provides command line tools to test basic operations of libmarisa, and the tools are useful to test the performance.
     </p>
    </div><!-- subsection -->
    <div class="subsection">
     <h3><a name="ability">Functionality</a></h3>
     <p>
      libmarisa associates string keys with unique IDs, from <var>0</var> to <var>(n - 1)</var>, where <var>n</var> is the number of keys. Note that users cannot specify the IDs because the mapping is automatically generated by MARISA. Every search function takes a string or an ID and returns the search result which is represented by a pair of the key and its ID.
     </p>
     <ul>
      <li>Lookup
       <ul>
        <li>checks whether or not a query string is registered.</li>
       </ul>
      </li>
      <li>Reverse lookup
       <ul>
        <li>restores a key from its ID.</li>
       </ul>
      </li>
      <li>Common prefix search
       <ul>
        <li>searches keys from the possible prefixes of a query string.</li>
       </ul>
      </li>
      <li>Predictive search
       <ul>
        <li>searches keys starting with a query string.</li>
       </ul>
      </li>
     </ul>
    </div><!-- subsection -->
   </div><!-- section -->
   <div class="section">
    <h2><a name="source">Source</a></h2>
    <div class="subsection">
     <h3><a name="license">License</a></h3>
     <p>
      libmarisa and its command line tools are dual-licensed under the BSD 2-clause license and the LGPL.
     </p>
    </div><!-- subsection -->
    <div class="subsection">
     <h3><a name="download">Download</a></h3>
     <p>
      The project is hosted on <a href="https://github.com/">GitHub</a>.
     </p>
     <ul>
      <li>Project
       <ul>
        <li><a href="https://github.com/s-yata/marisa-trie">https://github.com/s-yata/marisa-trie</a></li>
       </ul>
      </li>
      <li>Source
       <ul>
        <li><a href="https://github.com/s-yata/marisa-trie/releases/download/v0.2.5/marisa-0.2.5.tar.gz">marisa-0.2.5.tar.gz</a></li>
       </ul>
      </li>
     </ul>
    </div><!-- subsection -->
   </div><!-- section -->

   <div class="section">
    <h2><a name="install">Installation</a></h2>
    <div class="subsection">
     <h3><a name="gcc">GCC &amp; Clang</a></h3>
     <div class="float">
      <pre class="console">$ tar zxf marisa-0.2.5.tar.gz
$ cd marisa-0.2.5
$ ./configure
$ make
$ make check
$ make install</pre>
     </div><!-- float -->
     <p>
      Users can install libmarisa by using <kbd>configure</kbd> and <kbd>make</kbd>. <kbd>make install</kbd> might require <kbd>sudo</kbd> to install libmarisa as the root user. Additionally, <kbd>ldconfig</kbd> might be required because libmarisa is installed as a shared library in default settings.
     </p>
     <p>
      If a POPCNT instruction is available on your environment, you can specify <kbd>--enable-popcnt</kbd>, when you run <kbd>configure</kbd>, to improve the performance of libmarisa. Likewise, <kbd>--enable-sse2</kbd>, <kbd>--enable-sse3</kbd>, <kbd>--enable-ssse3</kbd>, <kbd>--enable-sse4.1</kbd>, <kbd>--enable-sse4.2</kbd>, <kbd>--enable-sse4</kbd>, <kbd>--enable-sse4a</kbd> are available. Also, if you need a static library, specify <kbd>--enable-static</kbd> to <kbd>configure</kbd>. For other options, see <kbd>./configure --help</kbd>.
     </p>
    </div><!-- subsection -->
    <div class="subsection">
     <h3><a name="vc">Visual C++ 2008</a></h3>
     <p>
      There are project files for Visual C++ 2008 in <kbd>vs2008/</kbd>. Users can build a static library <kbd>libmarisa.lib</kbd> and the command line tools by using <kbd>vs2008/vs2008.sln</kbd>. If your Visual C++ is older than 2008. New projects are required to build libmarisa.
     </p>
    </div><!-- subsection -->
    <div class="subsection">
     <h3><a name="vc">Perl Bindings</a></h3>
     <div class="float">
      <pre class="console">$ cd bindings/perl
$ perl Makefile.PL
$ make
$ make install</pre>
     </div><!-- float -->
     <p>
      Users can find a Perl bindings in <kbd>bindings/perl/</kbd>, in which the wrapper was generated by <a href="http://www.swig.org/">SWIG</a>. To install the Perl bindings, run <kbd>perl Makefile.PL</kbd> and then <kbd>make install</kbd>. See also <kbd>bindings/perl/sample.pl</kbd>.
     </p>
    </div><!-- subsection -->
    <div class="subsection">
     <h3><a name="vc">Python Bindings</a></h3>
     <div class="float">
      <pre class="console">$ cd bindings/python
$ python setup.py build
$ python setup.py install</pre>
     </div><!-- float -->
     <p>
      Users can find a Python bindings in <kbd>bindings/python/</kbd>, in which the wrapper was generated by <a href="http://www.swig.org/">SWIG</a>. To install the Python bindings, run <kbd>python setup.py install</kbd>. See also <kbd>bindings/python/sample.py</kbd>.
     </p>
    </div><!-- subsection -->
    <div class="subsection">
     <h3><a name="vc">Ruby Bindings</a></h3>
     <div class="float">
      <pre class="console">$ cd bindings/ruby
$ ruby extconf.rb
$ make
$ make install</pre>
     </div><!-- float -->
     <p>
      Users can find a Ruby bindings in <kbd>bindings/ruby/</kbd>, in which the wrapper was generated by <a href="http://www.swig.org/">SWIG</a>. To install the Ruby bindings, run <kbd>ruby extconf.rb</kbd> and then <kbd>make install</kbd>. See also <kbd>bindings/ruby/sample.rb</kbd>.
     </p>
    </div><!-- subsection -->
    <div class="subsection">
     <h3><a name="vc">Others</a></h3>
     <p>
      There are some other bindings.
     </p>
     <ul>
      <li>Python
       <ul>
        <li><a href="https://github.com/kmike/marisa-trie/">https://github.com/kmike/marisa-trie/</a> an alternative Cython-based pip-installable Python bindings which is faster than the above Python bindings.</li>
       </ul>
      </li>
      <li>Node.js</li>
       <ul>
        <li><a href="https://github.com/jakwings/iojs-marisa-trie">https://github.com/jakwings/iojs-marisa-trie</a> a wrapper of marisa-trie</li>
       </ul>
      </li>
     </ul>
    </div><!-- subsection -->
   </div><!-- section -->

   <div class="section">
    <h2><a name="tools">Command Line Tools</a></h2>
    <div class="subsection">
     <h3><a name="marisa-build">marisa-build</a></h3>
     <div class="float">
      <pre class="console">$ marisa-build &lt; keyset.txt &gt; keyset.dic
#keys: 9864459
#nodes: 13473881
size: 51044424</pre>
     </div><!-- float -->
     <p>
      <kbd>marisa-build</kbd> is a tool to build a dictionary from a set of keys. This tool takes as input newline-delimited keys and writes the dictionary to the standard output.
     </p>
     <p>
      Users can specify parameters through command line options. See <kbd>marisa-build -h</kbd> for the list of options.
     </p>
     <p>
      If an input line contains horizontal tabs, the last one serves as the delimiter between a key and its weight which is used to optimize the order of nodes. Estimated frequency of each key, given as the weight, may improve the search performance.
     </p>
    </div><!-- subsection -->
    <div class="subsection">
     <h3><a name="marisa-lookup">marisa-lookup</a></h3>
     <div class="float">
      <pre class="console">$ marisa-lookup keyset.dic
Marisa
915465	Marisa
What's_uuup
-1	What's_uuup</pre>
     </div><!-- float -->
     <p>
      <kbd>marisa-lookup</kbd> is a tool to test exact match lookup. If a query string is registered, this tool prints the key and its ID. Otherwise, this tool prints the query with <var>-1</var>.
     </p>
     <p>
      See <kbd>marisa-lookup -h</kbd> for the list of options.
     </p>
    </div><!-- subsection -->
    <div class="subsection">
     <h3><a name="marisa-reverse-lookup">marisa-reverse-lookup</a></h3>
     <div class="float">
      <pre class="console">$ marisa-reverse-lookup keyset.dic
1234567
1234567	Goma_International_Airport</pre>
     </div><!-- float -->
     <p>
      <kbd>marisa-reverse-lookup</kbd> is a tool to test reverse lookup. If a given ID is not out-of-range, this tool restores the associated key and prints it. The available ID is <var>0</var> to <var>(n - 1)</var>, where <var>n</var> is the number of keys. Note that an out-of-range ID causes an error.
     </p>
     <p>
      See <kbd>marisa-reverse-lookup -h</kbd> for the list of options.
     </p>
    </div><!-- subsection -->
    <div class="subsection">
     <h3><a name="marisa-common-prefix-search">marisa-common-prefix-search</a></h3>
     <div class="float">
      <pre class="console">$ marisa-common-prefix-search keyset.dic
USA
3 found
20	U	USA
1526	US	USA
37471	USA	USA</pre>
     </div><!-- float -->
     <p>
      <kbd>marisa-common-prefix-search</kbd> is a tool to test common prefix search.  This tool searches keys from the possible prefixes of a query string and then prints the first <var>m</var> keys, where <var>m</var> is one of the parameters.
     </p>
     <p>
      See <kbd>marisa-common-prefix-search -h</kbd> for the list of options.
     </p>
    </div><!-- subsection -->
    <div class="subsection">
     <h3><a name="marisa-predictive-search">marisa-predictive-search</a></h3>
     <div class="float">
      <pre class="console">$ marisa-predictive-search keyset.dic -n 2
Touhou
15 found
975378	Touhou	Touhou
5508004	Touhou_Hisotensoku	Touhou</pre>
     </div><!-- float -->
     <p>
      <kbd>marisa-predictive-search</kbd> is a tool to test predictive search. This tool searches keys starting with a query string and then prints the first <var>m</var> keys, where <var>m</var> is one of the parameters.
     </p>
     <p>
      See <kbd>marisa-predictive-search -h</kbd> for the list of options.
     </p>
    </div><!-- subsection -->
    <div class="subsection">
     <h3><a name="marisa-benchmark">marisa-benchmark</a></h3>
     <div class="float">
      <pre class="console">$ marisa-benchmark keyset.txt
Number of tries: 1 - 5
TAIL mode: Text mode
Node order: Descending weight order
Cache level: Normal cache
Number of keys: 9864459
Total length: 191858227
------+----------+--------+--------+--------+--------+--------
#tries       size    build   lookup  reverse   prefix  predict
                                      lookup   search   search
          [bytes]    [K/s]    [K/s]    [K/s]    [K/s]    [K/s]
------+----------+--------+--------+--------+--------+--------
     1   69905816   334.84  1368.16  1304.82  1080.44   605.92
     2   53635744   284.03   762.91   773.68   662.04   244.35
     3   51044424   278.89   688.86   703.60   604.44   212.00
     4   50309000   277.01   669.23   680.78   588.57   204.23
     5   50042232   275.93   636.83   674.26   562.08   199.48
------+----------+--------+--------+--------+--------+--------</pre>
     </div><!-- float -->
     <p>
      <kbd>marisa-benchmark</kbd> is a tool to benchmark libmarisa. This tool takes the same input as <kbd>marisa-build</kbd> and measures the performance of libmarisa for the given set of keys. This tool is useful to fix dictionary settings.
     </p>
     <p>
      For the search performance, <kbd>marisa-benchmark</kbd> measures the time to lookup or search keys in input order. When the keys are given in lexicographic order, few cache misses will occur in the benchmark. In contrast, when the keys are given in random order, many cache misses will occur in the benchmark.
     </p>
     <p>
      See <kbd>marisa-benchmark -h</kbd> for the list of options.
     </p>
    </div><!-- subsection -->
    <div class="subsection">
     <h3><a name="marisa-dump">marisa-dump</a></h3>
     <div class="float">
      <pre class="console">$ marisa-dump keyset.dic | head -3
input: keyset.dic
S
St
Sta</pre>
     </div><!-- float -->
     <p>
      <kbd>marisa-build</kbd> is a tool to dump a dictionary. This tool prints all the keys in a given dictionary.
     </p>
     <p>
      Users can specify the delimiter through command line options. See <kbd>marisa-dump -h</kbd> for the list of options.
     </p>
    </div><!-- subsection -->
   </div><!-- section -->

   <div class="section">
    <h2><a name="library">Library</a></h2>
    <div class="subsection">
     <h3><a name="howto">How to Use</a></h3>
     <div class="float">
      <pre class="code">// sample.cc
#include &lt;iostream&gt;
#include &lt;marisa.h&gt;

int main() {
  marisa::Keyset keyset;
  keyset.push_back("a");
  keyset.push_back("app");
  keyset.push_back("apple");

  marisa::Trie trie;
  trie.build(keyset);

  marisa::Agent agent;
  agent.set_query("apple");
  while (trie.common_prefix_search(agent)) {
    std::cout.write(agent.key().ptr(), agent.key().length());
    std::cout &lt;&lt; ": " &lt;&lt; agent.key().id() &lt;&lt; std::endl;
  }
  return 0;
}</pre>
     </div><!-- float -->
     <div class="float">
      <pre class="console">$ g++ sample.cc -lmarisa
$ ./a.out
a: 0
app: 1
apple: 2</pre>
     </div><!-- float -->
     <p>
      libmarisa provides <kbd>marisa.h</kbd> in which all the headers are <code>#include</code>d. Also, libmarisa uses <code>namespace marisa</code>. All the classes and functions except enumeration types are given as members of this namespace. Note that <code>using namespace marisa</code> may cause a critical error. Finally, <kbd>gcc</kbd> and <kbd>clang</kbd> require an option, <kbd>-lmarisa</kbd>, to link libmarisa with an application.
     </p>
     <p>
      The core components of libmarisa are <a href="#keyset">Keyset</a>, <a href="#agent">Agent</a>, and <a href="#trie">Trie</a>. In addition, libmarisa provides an exception class, <a href="#exception">Exception</a>, and two more classes, <a href="#key">Key</a> and <a href="#query">Query</a>, as members of <code>Keyset</code> and <code>Agent</code>.
     </p>
     <ul>
      <li><code>Keyset</code>: A class to store a set of keys. This class is used to build a set of keys for building a dictionary. Also, this class is useful to store search results.</li>
      <li><code>Agent</code>: A class to store a query and a result of search operations. Every search function takes a reference to this class.</li>
      <li><code>Trie</code>: A dictionary class.</li>
     </ul>
     <p>
      For more examples, you can find the source code of the command line tools in <kbd>tools/</kbd>. The source code is useful as an example of error handling, predicive search, etc.
     </p>
    </div><!-- subsection -->

    <div class="subsection">
     <h3><a name="enum">Enumeration Constants</a></h3>
     <div class="subsubsection">
      <h4>Error Codes</h4>
      <div class="float">
       <pre class="code">typedef enum marisa_error_code_ {
  MARISA_OK           = 0,
  MARISA_STATE_ERROR  = 1,
  MARISA_NULL_ERROR   = 2,
  MARISA_BOUND_ERROR  = 3,
  MARISA_RANGE_ERROR  = 4,
  MARISA_CODE_ERROR   = 5,
  MARISA_RESET_ERROR  = 6,
  MARISA_SIZE_ERROR   = 7,
  MARISA_MEMORY_ERROR = 8,
  MARISA_IO_ERROR     = 9,
  MARISA_FORMAT_ERROR = 10,
} marisa_error_code;</pre>
      </div><!-- float -->
      <p>
       libmarisa throws an instance of <code>Exception</code> when an error occurs, such as a file I/O error (<var>MARISA_IO_ERROR</var>), a size limitation error (<var>MARISA_SIZE_ERROR</var>), etc. For details, see <kbd>marisa/base.h</kbd>.
      </p>
     </div><!-- subsubsection -->
     <div class="subsubsection">
      <h4>Number of Tries</h4>
      <div class="float">
       <pre class="code">
typedef enum marisa_num_tries_ {
  MARISA_MIN_NUM_TRIES     = 0x00001,
  MARISA_MAX_NUM_TRIES     = 0x0007F,
  MARISA_DEFAULT_NUM_TRIES = 0x00003,
} marisa_num_tries;</pre>
      </div><!-- float -->
      <p>
       MARISA is a recursive data structure in which a patricia trie is used to represent another patricia trie. A deeper recursion makes a dictionary more compact but degrades the search performance. For this time-space tradeoff, libmarisa provides a parameter to limit the recursion depth, which is equivalent to the number of tries. <code>marisa_num_tries</code> gives the range and the default setting of this parameter.
      </p>
      <p>
       The best setting depends on the set of keys and the applications. In most cases, libmarisa works well with the default setting, <var>MARISA_DEFAULT_NUM_TRIES</var>, but if the application requires better search performance, <var>MARISA_MIN_NUM_TRIES</var> may be a better choice. Also, if the application uses long and complicated keys, a deeper recursion may achieve much higher spece-efficiency. <kbd>marisa-benchmark</kbd> is useful to find the best setting.
      </p>
     </div><!-- subsubsection -->
     <div class="subsubsection">
      <h4>Cache Size</h4>
      <div class="float">
       <pre class="code">typedef enum marisa_cache_level_ {
  MARISA_HUGE_CACHE    = 0x00080,
  MARISA_LARGE_CACHE   = 0x00100,
  MARISA_NORMAL_CACHE  = 0x00200,
  MARISA_SMALL_CACHE   = 0x00400,
  MARISA_TINY_CACHE    = 0x00800,
  MARISA_DEFAULT_CACHE = MARISA_NORMAL_CACHE
} marisa_cache_level;</pre>
      </div><!-- float -->
      <p>
       libmarisa embeds a precomputed table to a dictionary. The table serves as transition cache which improves the search performance but increases the dictionary size. Cache size is the parameter of this time-space tradeoff.
      </p>
      <p>
       <code>marisa_cache_level</code> gives a list of available cache size. Compared with <var>MARISA_NORMAL_CACHE</var>, <var>MARISA_LARGE_CACHE</var> is 2 times larger, <var>MARISA_HUGE_CACHE</var> is 4 times larger, <var>MARISA_SMALL_CACHE</var> is 2 times smaller, and <var>MARISA_TINY_CACHE</var> is 4 times smaller.
      </p>
     </div><!-- subsubsection -->
     <div class="subsubsection">
      <h4>TAIL Mode</h4>
      <div class="float">
       <pre class="code">typedef enum marisa_tail_mode_ {
  MARISA_TEXT_TAIL    = 0x01000,
  MARISA_BINARY_TAIL  = 0x02000,
  MARISA_DEFAULT_TAIL = MARISA_TEXT_TAIL,
} marisa_tail_mode;</pre>
      </div><!-- float -->
      <p>
       The last patricia trie of MARISA stores its multi-byte labels as strings and <code>marisa_tail_mode</code> gives a list of TAIL implementations.
      </p>
      <p>
       <var>MARISA_TEXT_TAIL</var> stores labels as zero-terminated strings. If the labels contain <var>'\0'</var>, the TAIL mode is automatically switched to <var>MARISA_BINARY_TAIL</var>.
      </p>
      <p>
       On the other hand, <var>MARISA_BINARY_TAIL</var> uses a bit vector, instead of <var>'\0'</var>, to detect the end of labels. This means that <var>MARISA_TEXT_TAIL</var> is more space-efficient than <var>MARISA_BINARY_TAIL</var> when the average length of multi-byte labels is longer than <var>8 bytes</var>.
      </p>
     </div><!-- subsubsection -->
     <div class="subsubsection">
      <h4>Node Order</h4>
      <div class="float">
       <pre class="code">typedef enum marisa_node_order_ {
  MARISA_LABEL_ORDER   = 0x10000,
  MARISA_WEIGHT_ORDER  = 0x20000,
  MARISA_DEFAULT_ORDER = MARISA_WEIGHT_ORDER,
} marisa_node_order;</pre>
      </div><!-- float -->
      <p>
       A dictionary has one more parameter, which is the order of nodes. There are two choices, <var>MARISA_LABEL_ORDER</var> and <var>MARISA_WEIGHT_ORDER</var>. The former arranges nodes in ascending order of the label and the latter arranges nodes in descending order of the weight. Many trie implementations arrange nodes in the label order but libmarisa uses <var>MARISA_WEIGHT_ORDER</var> as the default setting.
      </p>
      <p>
       <var>MARISA_WEIGHT_ORDER</var> optimizes the node order for linear search performed in exact match lookup, common prefix search, and predictive search. In practice, experiments for English words/phrases showed that <var>MARISA_WEIGHT_ORDER</var> halved the average search time. On the other hand, <var>MARISA_LABEL_ORDER</var> enables predictive search to restore keys in lexicographic order.
      </p>
     </div><!-- subsubsection -->
     <div class="subsubsection">
      <h4>Aliases</h4>
      <div class="float">
       <pre class="code">namespace marisa {
  typedef ::marisa_error_code ErrorCode;
  typedef ::marisa_cache_level CacheLevel;
  typedef ::marisa_tail_mode TailMode;
  typedef ::marisa_node_order NodeOrder;
}  // namespace marisa</pre>
      </div><!-- float -->
      <p>
       The above enumeration types are defined in the global namespace to avoid collisions of the enumeration constants with macros provided by other modules. libmarisa provides type aliases and users can choose the familiar one.
      </p>
     </div><!-- subsubsection -->
    </div><!-- subsection -->

    <div class="subsection">
     <h3><a name="exception">class Exception</a></h3>
     <div class="float">
      <pre class="code">class Exception {
 public:
  const char *filename() const;
  int line() const;
  ErrorCode error_code() const;
  const char *error_message() const;

  const char *what() const;
};</pre>
     </div><!-- float -->
     <p>
      <code>Exception</code> is an exception class. libmarisa throws an instance of <code>Exception</code> with the file name (<code>__FILE__</code>), the line number (<code>__LINE__</code>), and an error code (<code>ErrorCode</code>) when an error is detected. The instance also has an error message formatted <var>__FILE__:__LINE__: error_code: error_message</var>.
     </p>
    </div><!-- subsection -->

    <div class="subsection">
     <h3><a name="key">class Key</a></h3>
     <div class="float">
      <pre class="code">class Key {
 public:
  char operator[](std::size_t i) const;
  const char *ptr() const;
  std::size_t length() const;
  std::size_t id() const;
};</pre>
     </div><!-- float -->
     <p>
      <code>Key</code> is a member of <a href="#keyset">Keyset</a> and <a href="#agent">Agent</a>. Each key of <code>Keyset</code> is represented by this class. Also, the search result of <code>Agent</code> is represented by this class.
     </p>
    </div><!-- subsection -->

    <div class="subsection">
     <h3><a name="query">class Query</a></h3>
     <div class="float">
      <pre class="code">class Query {
 public:
  char operator[](std::size_t i) const;
  const char *ptr() const;
  std::size_t length() const;
  std::size_t id() const;
};</pre>
     </div><!-- float -->
     <p>
      <code>Query</code> is a member of <a href="#agent">Agent</a>. This class stores a query string and an ID as input for search functions. Users cannot make changes directly to <code>Query</code> because <code>Agent</code> provides a special interface to update its query.
     </p>
    </div><!-- subsection -->

    <div class="subsection">
     <h3><a name="keyset">class Keyset</a></h3>
     <div class="float">
      <pre class="code">class Keyset {
 public:
  Keyset();

  void push_back(const Key &amp;key);
  void push_back(const Key &amp;key, char end_marker);

  void push_back(const char *str);
  void push_back(const char *ptr,
                 std::size_t length,
                 float weight = 1.0);

  const Key &amp;operator[](std::size_t i) const;
  Key &amp;operator[](std::size_t i);

  std::size_t num_keys();

  bool empty() const;
  std::size_t size() const;
  std::size_t total_length() const;

  void reset();

  void clear();
  void swap(Keyset &amp;rhs);
};</pre>
     </div><!-- float -->
     <div class="subsubsection">
      <h4>Overview</h4>
      <p>
       <code>Keyset</code> is used to store a set of keys for dictionary construction or to save the results of search functions.
      </p>
     </div><!-- subsubsection -->
     <div class="subsubsection">
      <h4>Dictionary Source</h4>
      <p>
       For dictionary construction, users append keys to <code>Keyset</code> by using <code>push_back()</code> and then pass the keyset to <code>build()</code> of <a href="#trie">Trie</a>. <var>weight</var> is an argument to receive the frequency or possibility of each key. If there are same keys, the weights are accumulated in dictionary construction.
      </p>
      <p>
       After dictionary construction, users can read the associated IDs through <code>operator[]()</code>. Instead, the weights are overwritten by the IDs because <code>Key</code> uses a <code>union</code> to store a weight or an ID.
      </p>
     </div><!-- subsubsection -->
     <div class="subsubsection">
      <h4>Search Result</h4>
      <p>
       Users can save a search result to <code>Keyset</code> by using <code>push_back()</code>. When <code>key()</code> of <a href="#agent">Agent</a> is given, a copy of the search result is stored in <code>Keyset</code>. If you want to append an end marker, such as <code>'\0'</code>, use <var>end_marker</var> of <code>push_back()</code>.
      </p>
      <p>
       If you want to reuse an instance of <code>Keyset</code>, <code>reset()</code> may be a better choice than <code>clear()</code> because <code>reset()</code> keeps allocated memory in order to reduce memory allocation overhead.
      </p>
      <p>
       <code>num_keys()</code> and <code>size()</code> return the number of keys. <code>empty()</code> checks whether the number of keys is <var>0</var> or not. <code>total_length()</code> returns the total length in byte.
      </p>
     </div><!-- subsubsection -->
    </div><!-- subsection -->

    <div class="subsection">
     <h3><a name="agent">class Agent</a></h3>
     <div class="float">
      <pre class="code">class Agent {
 public:
  Agent();

  const Query &amp;query() const;
  const Key &amp;key() const;

  void set_query(const char *str);
  void set_query(const char *ptr,
                 std::size_t length);
  void set_query(std::size_t key_id);
};</pre>
     </div><!-- float -->
     <p>
      <code>Agent</code> is actually a tuple of <code>Query</code>, <code>Key</code>, and <code>State</code>. This class is used as I/O of search functions. Also, <code>State</code> is an incomplete type to keep the internal state of search operation.
     </p>
     <p>
      A lookup operation requires 3 steps as follows: 1. sets a query string by <code>set_query()</code> of <code>Agent</code>, 2. passes the agent to <code>lookup()</code> of <code>Trie</code>, and 3. gets the search result by <code>key()</code> of <code>Agent</code>. The other operations proceed in the same way.
     </p>
    </div><!-- subsection -->

    <div class="subsection">
     <h3><a name="trie">class Trie</a></h3>
     <div class="float">
      <pre class="code">class Trie {
 public:
  Trie();

  void build(Keyset &amp;keyset,
             int config_flags = 0);

  void mmap(const char *filename);
  void map(const void *ptr,
           std::size_t size);

  void load(const char *filename);
  void read(int fd);

  void save(const char *filename) const;
  void write(int fd) const;

  bool lookup(Agent &amp;agent) const;
  void reverse_lookup(Agent &amp;agent) const;
  bool common_prefix_search(Agent &amp;agent) const;
  bool predictive_search(Agent &amp;agent) const;

  std::size_t num_tries() const;
  std::size_t num_keys() const;
  std::size_t num_nodes() const;

  TailMode tail_mode() const;
  NodeOrder node_order() const;

  bool empty() const;
  std::size_t size() const;
  std::size_t io_size() const;

  void clear();
  void swap(Trie &amp;rhs);
};</pre>
     </div><!-- float -->
     <div class="subsubsection">
      <h4>Overview</h4>
      <p>
       <code>Trie</code> is a dictionary class, which is the most important component of libmarisa. All the operations are performed through this class.
      </p>
      <p>
       In fact, <code>Trie</code> is a dictionary handle, and if the handle is invalid, functions other than <code>build()</code>, <code>mmap()</code>, <code>map()</code>, <code>load()</code>, <code>read()</code>, <code>clear()</code>, <code>swap()</code> throw an exception.
      </p>
     </div><!-- subsubsection -->
     <div class="subsubsection">
      <h4>Construction</h4>
      <p>
       You can build a dictionary by using <code>build()</code>. The arguments are the above mentioned <a href="#keyset">Keyset</a> and a dictionary setting, <var>config_flags</var>, which is represented by a combination of flags. For example, <var>2 | MARISA_BINARY_TAIL</var> specifies the maximum number of tries (<var>2</var>) and a TAIL mode (<var>MARISA_BINARY_TAIL</var>). Also, in this case, the default settings, <var>MARISA_DEFAULT_ORDER</var> and <var>MARISA_DEFAULT_CACHE</var>, are used for the node order and the cache size.
      </p>
      <p>
       The IDs associated with the keys are available through <code>operator[]()</code> of <var>keyset</var>, and the IDs are useful to associate the keys with any data types.
      </p>
     </div><!-- subsubsection -->
     <div class="subsubsection">
      <h4>File I/O</h4>
      <p>
       <code>mmap()</code> is an interface for memory mapped I/O. If an application performs a few search operations, it is unnecessary to read the whole dictionary, and in such a case, <code>mmap()</code> is useful. Also, memory mapped I/O is an easy way to share dictionary data among processes. On the other hand, if an application performs a lot of search operations, a memory mapped dictionary might cause a lot of random disk accesses which considerably increase the search time.
      </p>
      <p>
       <code>map()</code> restores an instance of <code>Trie</code> from dictionary data on memory. <code>load()</code> and <code>read()</code> read a dictionary from a file or a file descriptor. <code>save()</code> and <code>write()</code> write a dictionary to a file or a file descriptor.
      </p>
     </div><!-- subsubsection -->
     <div class="subsubsection">
      <h4>Search</h4>
      <p>
       <code>Trie</code> provides 4 search functions <code>lookup()</code>, <code>reverse_lookup()</code>, <code>common_prefix_search()</code>, and <code>predictive_search()</code> as follows:
      </p>
      <ul>
       <li>
        <code>lookup()</code> checks whether a query string is registered or not, and if it is registered, <code>lookup()</code> returns <var>true</var>. In this case, the search result is available through <code>agent.key()</code>. Note that <code>lookup()</code> does not restore a key and <code>agent.key().ptr()</code> points to the query string because the two strings are the same.
       </li>
       <li>
        <code>reverse_lookup()</code> restores a key from its ID. This function has no return value and the key is available through <var>agent.key()</var>. The key is actually stored in <var>agent</var> and it is lost when <var>agent</var> is reset or used for another search operation. If a given ID is out-of-range, <code>reverse_lookup()</code> throws an exception.
       </li>
       <li>
        <code>common_prefix_search()</code> searches keys from the possible prefixes of a query string. If there are matching keys, this function returns <var>true</var>. In this case, the first key is available through <code>agent.key()</code>, and if there are more than one matching keys, the next key will be available after the next <code>common_prefix_search()</code> which returns <var>true</var> until there are no more matching keys. Note that <code>agent.key().ptr() == agent.query().ptr()</code> is always <var>true</var> when <code>common_prefix_search()</code> has returned <var>true</var>.
       </li>
       <li>
        <code>predictive_search()</code> searches keys starting with a query string, and similar to <code>common_prefix_search()</code>, this function returns <var>true</var> until there are no more matching keys.
       </li>
      </ul>
      <p>
       Note that <code>agent</code> keeps the internal state of <code>common_prefix_search()</code> and <code>predictive_search()</code> until <code>agent</code> is passed to another search function or <code>agent.set_query()</code> is called.
      </p>
      <p>
       <code>num_keys()</code> and <code>size()</code> return the number of keys. <code>empty()</code> checks whether the number of keys is <var>0</var> or not. <code>io_size()</code> returns the dictionary size in byte.
      </p>
     </div><!-- subsubsection -->
    </div><!-- subsection -->

    <div class="subsection">
     <h3><a name="stdio">stdio</a></h3>
     <div class="float">
      <pre class="code">void fread(std::FILE *file, Trie *trie);
void fwrite(std::FILE *file, const Trie &amp;trie);</pre>
     </div><!-- float -->
     <p>
      The functions for I/O using <code>std::FILE</code> are declared in <kbd>marisa/stdio.h</kbd>. If you don't want to <code>#include &lt;cstdio&gt;</code>, use <kbd>marisa/trie.h</kbd> instead of <kbd>marisa.h</kbd>.
     </p>
    </div><!-- subsection -->

    <div class="subsection">
     <h3><a name="iostream">iostream</a></h3>
     <div class="float">
      <pre class="code">std::istream &amp;read(std::istream &amp;stream, Trie *trie);
std::ostream &amp;write(std::ostream &amp;stream, const Trie &amp;trie);

std::istream &amp;operator>>(std::istream &amp;stream, Trie &amp;trie);
std::ostream &amp;operator<<(std::ostream &amp;stream, const Trie &amp;trie);</pre>
     </div><!-- float -->
     <p>
      The functions for I/O using <code>std::iostream</code> are declared in <kbd>marisa/iostream.h</kbd>. If you don't want to <code>#include &lt;iosfwd&gt;</code>, use <kbd>marisa/trie.h</kbd> instead of <kbd>marisa.h</kbd>.
     </p>
    </div><!-- subsection -->
   </div><!-- section -->

   <div class="section">
    <h2><a name="compatibility">Cross-architecture compatibility</a></h2>
    <p>
     The dictionary format of libmarisa depends on the architecture. Dictionaries built on a little endian architecture don't work on a big endian architecture. Also, on a big endian architecture, dictionaries built on a 32-bit machine don't work on a 64-bit machine and vise versa. On a little endian architecture, dictionaries are compatible on 32/64-bit machines.
    </p>
   </div><!-- section -->

   <div class="section">
    <h2><a name="references">References</a></h2>
    <ul>
     <li><a href="http://www.slideshare.net/s5yata/x86opti-05-s5yata">Remove Branches in BitVector Select Operations - marisa 0.2.2 -</a>
      <ul>
       <li>This PowerPoint presentation describes improvements in marisa 0.2.2.</li>
      </ul>
     </li>
    </ul>
   </div><!-- section -->

   <div class="section">
    <h2><a name="conclusion">Last Spell</a></h2>
    <p>
     Feel free to contact me for any questions.
    </p>
   </div><!-- section -->
  </div><!-- body -->
  <div id="footer">
   <div class="left">MARISA: Matching Algorithm with Recursively Implemented StorAge</div>
   <div class="right">
  ‮moc.liamg@atay.umusus‭
   </div>
   <div class="end"></div>
  </div><!-- footer -->
 </body>
</html>