aboutsummaryrefslogtreecommitdiff
path: root/v2/tokenizer_test.go
diff options
context:
space:
mode:
authorBill Neubauer <wcn@google.com>2020-09-21 08:56:22 -0700
committerBill Neubauer <bill.neubauer@gmail.com>2020-11-13 09:54:34 -0800
commit3838607e076a72c4edea1d960e7a3b9e2f320899 (patch)
tree9f501663af907375c6dc8ded20ebe9f2b7a1eec3 /v2/tokenizer_test.go
parentf8c0fb63a3177b0de8bd614699113899e2f7fbc2 (diff)
downloadlicenseclassifier-3838607e076a72c4edea1d960e7a3b9e2f320899.tar.gz
Scope use of phrase induction to fix some bugs.
In order to handle vague reformulations of license headers, it's important to make sure the classifier doesn't "hallucinate" a license by introducing critical words that don't exist in the input text. This was originally done as a global list, but it had two problems. Turns out blocking 'bsd' as a keyword inadvertently suppressed Beerware detection since phk has used two different email addresses with that header, one of which contained the string 'bsd' The second problem was that some common phrases, particularly company names, were signifying in some licenses, but not others (Silicon Graphics critical in SGI-B licenses, but not in libtiff) The solution is to scope the list of prohibited phrases to subsets of licenses. This introduces that change and fixes up the reference tests to match the corrected detections. An issue with tokenization preserving trailing periods on numbers which caused version detection errors was also resolved.
Diffstat (limited to 'v2/tokenizer_test.go')
-rw-r--r--v2/tokenizer_test.go2
1 files changed, 1 insertions, 1 deletions
diff --git a/v2/tokenizer_test.go b/v2/tokenizer_test.go
index 4a4639b..7ea66cc 100644
--- a/v2/tokenizer_test.go
+++ b/v2/tokenizer_test.go
@@ -202,7 +202,7 @@ func TestTokenizer(t *testing.T) {
{
name: "preserve version number (not a header, but header-looking) not at beginning of sentence",
input: "This is version 1.1.",
- output: "this is version 1.1.",
+ output: "this is version 1.1",
},
{
name: "copyright inside a comment",