Age | Commit message (Collapse) | Author |
|
[]byte. This allows for larger files or other inputs to be handled without
requiring the full contents to be stored in memory at one time.
PiperOrigin-RevId: 474376181
|
|
entire text for analysis be present in memory.
Some of the changes here improved the accuracy of classification, requiring
updates in the expected tests.
PiperOrigin-RevId: 468114923
|
|
It's completely redundant since it's the position of the token in the slice. I
thought there would be a use for it, but it never materialized.
PiperOrigin-RevId: 457764131
|
|
Previously copyright lines were just silently dropped, but now the classifier
returns a match to indicate that the line contained an identified copyright
statement. The matched text is still pruned from the normalized output, this
only changes the output report from the classifier.
PiperOrigin-RevId: 444592120
|
|
The original thinking was that string was a better choice because it clearly
indicated that the classifier wouldn't modify the input data, but it comes with
some runtime cost in that if the caller has a byte slice and just wanted to
scan part of it, they'd need to make a copy to get a string.
Having the API be []byte and documenting that it doesn't modify the data is the
better choice. The currenty implementation doesn't currently realize this
efficiency because it immediately makes a string for tokenization purposes,
but this can be fixed in the future. Changing the API signature is not so
easily accomplished.
|
|
In order to handle vague reformulations of license headers, it's important to
make sure the classifier doesn't "hallucinate" a license by introducing
critical words that don't exist in the input text. This was originally done as
a global list, but it had two problems.
Turns out blocking 'bsd' as a keyword inadvertently suppressed Beerware
detection since phk has used two different email addresses with that header,
one of which contained the string 'bsd'
The second problem was that some common phrases, particularly company names,
were signifying in some licenses, but not others (Silicon Graphics critical in
SGI-B licenses, but not in libtiff)
The solution is to scope the list of prohibited phrases to subsets of licenses.
This introduces that change and fixes up the reference tests to match the
corrected detections.
An issue with tokenization preserving trailing periods on numbers which caused
version detection errors was also resolved.
|
|
that cropped up in third_party_py_gevent_LICENSE where a xreference identifier
without spaces was getting tokenized poorly, causing version mismatch issues.
Analyzed 350K instances of number token instances in the license corpus and
determined the only useful characters to retain when encountering a leading
number are other numbers, periods, and dashes. All other characters can be
discarded without affecting license matching. Doing so minimizes risk due
to version matching by not introducing spurious deltas.
|
|
|