aboutsummaryrefslogtreecommitdiff
path: root/v2/tokenizer_test.go
AgeCommit message (Collapse)Author
2022-09-16Make the public facing API be implemented in terms of io.Reader rather thanBill Neubauer
[]byte. This allows for larger files or other inputs to be handled without requiring the full contents to be stored in memory at one time. PiperOrigin-RevId: 474376181
2022-09-16Rewrite the tokenization process to work on streams rather than requiring theBill Neubauer
entire text for analysis be present in memory. Some of the changes here improved the accuracy of classification, requiring updates in the expected tests. PiperOrigin-RevId: 468114923
2022-09-16Removing the Index field from the token structures.Bill Neubauer
It's completely redundant since it's the position of the token in the slice. I thought there would be a use for it, but it never materialized. PiperOrigin-RevId: 457764131
2022-09-16Adds Copyright detection to the report generated by the classifier.Bill Neubauer
Previously copyright lines were just silently dropped, but now the classifier returns a match to indicate that the line contained an identified copyright statement. The matched text is still pruned from the normalized output, this only changes the output report from the classifier. PiperOrigin-RevId: 444592120
2020-11-13Change the public API to use []byte instead of string.Bill Neubauer
The original thinking was that string was a better choice because it clearly indicated that the classifier wouldn't modify the input data, but it comes with some runtime cost in that if the caller has a byte slice and just wanted to scan part of it, they'd need to make a copy to get a string. Having the API be []byte and documenting that it doesn't modify the data is the better choice. The currenty implementation doesn't currently realize this efficiency because it immediately makes a string for tokenization purposes, but this can be fixed in the future. Changing the API signature is not so easily accomplished.
2020-11-13Scope use of phrase induction to fix some bugs.Bill Neubauer
In order to handle vague reformulations of license headers, it's important to make sure the classifier doesn't "hallucinate" a license by introducing critical words that don't exist in the input text. This was originally done as a global list, but it had two problems. Turns out blocking 'bsd' as a keyword inadvertently suppressed Beerware detection since phk has used two different email addresses with that header, one of which contained the string 'bsd' The second problem was that some common phrases, particularly company names, were signifying in some licenses, but not others (Silicon Graphics critical in SGI-B licenses, but not in libtiff) The solution is to scope the list of prohibited phrases to subsets of licenses. This introduces that change and fixes up the reference tests to match the corrected detections. An issue with tokenization preserving trailing periods on numbers which caused version detection errors was also resolved.
2020-11-13Needed to make a change to number tokenization to resolve an issueBill Neubauer
that cropped up in third_party_py_gevent_LICENSE where a xreference identifier without spaces was getting tokenized poorly, causing version mismatch issues. Analyzed 350K instances of number token instances in the license corpus and determined the only useful characters to retain when encountering a leading number are other numbers, periods, and dashes. All other characters can be discarded without affecting license matching. Doing so minimizes risk due to version matching by not introducing spurious deltas.
2020-11-13The tokenizer for the new version of the licenseclassifier.Bill Neubauer