external/licenseclassifier.git

Age	Commit message (Collapse)	Author
2022-09-16	Make the public facing API be implemented in terms of io.Reader rather than	Bill Neubauer
	[]byte. This allows for larger files or other inputs to be handled without requiring the full contents to be stored in memory at one time. PiperOrigin-RevId: 474376181
2022-09-16	Rewrite the tokenization process to work on streams rather than requiring the	Bill Neubauer
	entire text for analysis be present in memory. Some of the changes here improved the accuracy of classification, requiring updates in the expected tests. PiperOrigin-RevId: 468114923
2022-09-16	Removing the Index field from the token structures.	Bill Neubauer
	It's completely redundant since it's the position of the token in the slice. I thought there would be a use for it, but it never materialized. PiperOrigin-RevId: 457764131
2022-09-16	Adds Copyright detection to the report generated by the classifier.	Bill Neubauer
	Previously copyright lines were just silently dropped, but now the classifier returns a match to indicate that the line contained an identified copyright statement. The matched text is still pruned from the normalized output, this only changes the output report from the classifier. PiperOrigin-RevId: 444592120
2020-11-13	Change the public API to use []byte instead of string.	Bill Neubauer
	The original thinking was that string was a better choice because it clearly indicated that the classifier wouldn't modify the input data, but it comes with some runtime cost in that if the caller has a byte slice and just wanted to scan part of it, they'd need to make a copy to get a string. Having the API be []byte and documenting that it doesn't modify the data is the better choice. The currenty implementation doesn't currently realize this efficiency because it immediately makes a string for tokenization purposes, but this can be fixed in the future. Changing the API signature is not so easily accomplished.
2020-11-13	Scope use of phrase induction to fix some bugs.	Bill Neubauer
	In order to handle vague reformulations of license headers, it's important to make sure the classifier doesn't "hallucinate" a license by introducing critical words that don't exist in the input text. This was originally done as a global list, but it had two problems. Turns out blocking 'bsd' as a keyword inadvertently suppressed Beerware detection since phk has used two different email addresses with that header, one of which contained the string 'bsd' The second problem was that some common phrases, particularly company names, were signifying in some licenses, but not others (Silicon Graphics critical in SGI-B licenses, but not in libtiff) The solution is to scope the list of prohibited phrases to subsets of licenses. This introduces that change and fixes up the reference tests to match the corrected detections. An issue with tokenization preserving trailing periods on numbers which caused version detection errors was also resolved.
2020-11-13	Needed to make a change to number tokenization to resolve an issue	Bill Neubauer
	that cropped up in third_party_py_gevent_LICENSE where a xreference identifier without spaces was getting tokenized poorly, causing version mismatch issues. Analyzed 350K instances of number token instances in the license corpus and determined the only useful characters to retain when encountering a leading number are other numbers, periods, and dashes. All other characters can be discarded without affecting license matching. Doing so minimizes risk due to version matching by not introducing spurious deltas.
2020-11-13	The tokenizer for the new version of the licenseclassifier.	Bill Neubauer