Age | Commit message (Collapse) | Author |
|
in debugging, just the final set helps to understand why a document didn't
survive for further matching rounds.
PiperOrigin-RevId: 464832890
|
|
source offsets being considered. This revisits the previous fix to avoid using
a map and its more expensive memory hit.
PiperOrigin-RevId: 399774203
|
|
wants to do some form of random access based on SrcStart / SrcEnd, so this might be a better fit.
<<<
b/196234339 was the problem
There's an off-by one that I don't entirely understand so I'm trying to split the difference here.
The filter variable is currently a slice and we're seeing an off-by-one in production with the amd_vulkan/LICENSE:
```
panic: runtime error: index out of range [3] with length 3 goroutine 1646 [running]:
```
Given that SrcStart / SrcEnd appear to be positions in the text file and the `i` variable seems to move between that range, it seemed natural to replace filter with a map of indexes instead of a slice...this way we can preserve the somewhat random-access pattern that appears to be happening but avoid any range errors.
>>>
PiperOrigin-RevId: 390415408
|
|
Travis CI was failing on this build because it ran the simplification
step.
|
|
In order to handle vague reformulations of license headers, it's important to
make sure the classifier doesn't "hallucinate" a license by introducing
critical words that don't exist in the input text. This was originally done as
a global list, but it had two problems.
Turns out blocking 'bsd' as a keyword inadvertently suppressed Beerware
detection since phk has used two different email addresses with that header,
one of which contained the string 'bsd'
The second problem was that some common phrases, particularly company names,
were signifying in some licenses, but not others (Silicon Graphics critical in
SGI-B licenses, but not in libtiff)
The solution is to scope the list of prohibited phrases to subsets of licenses.
This introduces that change and fixes up the reference tests to match the
corrected detections.
An issue with tokenization preserving trailing periods on numbers which caused
version detection errors was also resolved.
|
|
(q-grams were formerly discussed as token runs).
The test coverage here primarily focuses on the boundary conditions of
detection, but does provide good coverage testing of the branch trees of the
code.
|