diff options
Diffstat (limited to 'v2/README.md')
-rw-r--r-- | v2/README.md | 56 |
1 files changed, 56 insertions, 0 deletions
diff --git a/v2/README.md b/v2/README.md new file mode 100644 index 0000000..a5b848a --- /dev/null +++ b/v2/README.md @@ -0,0 +1,56 @@ +# License Classifier v2 + +This is a substantial revision of the license classifier with a focus on improved accuracy and performance. + +## Glossary + +- corpus dictionary - contains all the unique tokens stored in the corpus of +documents to match. Any tokens in the target document that aren't in the corpus +dictionary are mapped to an invalid value. + +- document - an internal-only data type that contains sequenced token information +for a source or target content for matching. + +- source content - a body of text that can be matched by the scanner. + +- target content - the argument to Match that is scanned for matches with source +content. + +- indexed document - an internal-only data type that maps a document to the +corpus dictionary, resulting in a compressed representation suitable for fast +text searching and mapping operations. an indexed document is necessarily +tightly coupled to its corpus. + +- frequency table - a lookup table holding per-token counts of the number of +times a token appears in content. used for fast filtering of target content +against different source contents. + +- q-gram - a substring of content of length q tokens used to efficiently match +ranges of text. For background on the q-gram algorithms used, please see +[Indexing Methods for Approximate String Matching](https://users.dcc.uchile.cl/~gnavarro/ps/deb01.pdf) + +- searchset - a data structure that uses q-grams to identify ranges of text in +the target that correspond to a range of text in the source. The searchset +algorithms compensate for the allowable error in matching text exactly, dealing +with additional or missing tokens. + + +## Migrating from v1 + +The API for the classifier versions is quite similar, but there are two key +distinctions to be aware of while migrating usages. + +The confidence value for the v2 classifier is applied uniformly to results; it +will never return a match that is lower confidence than the threshold. In v1, +MultipleMatch behaved this way, but NearestMatch would return a value +regardless of the confidence match. Users often verified that the confidence +was above the threshold, but this is no longer necessary. + +The second change is that the classifier now returns all matches against the +supplied corpus. The v1 classifier allowed filtering on header matches via a +boolean field. This can be emulated by creating a license classifier with a +reduced corpus if matching against headers is not desired. Alternatively, the +user can use the MatchType field in the Match struct to filter out unwanted +matches. + + |