aboutsummaryrefslogtreecommitdiff
path: root/v2/searchset.go
AgeCommit message (Collapse)Author
2022-09-16Reduce noisy logging in searchset. The intermediate results are rarely usefulBill Neubauer
in debugging, just the final set helps to understand why a document didn't survive for further matching rounds. PiperOrigin-RevId: 464832890
2022-03-16Fixed an OOB scenario where the target document could be shorter than theBill Neubauer
source offsets being considered. This revisits the previous fix to avoid using a map and its more expensive memory hit. PiperOrigin-RevId: 399774203
2022-03-16use a map instead of a slice for the fuseRange filter. It looks like it ↵Tyler Pirtle
wants to do some form of random access based on SrcStart / SrcEnd, so this might be a better fit. <<< b/196234339 was the problem There's an off-by one that I don't entirely understand so I'm trying to split the difference here. The filter variable is currently a slice and we're seeing an off-by-one in production with the amd_vulkan/LICENSE: ``` panic: runtime error: index out of range [3] with length 3 goroutine 1646 [running]: ``` Given that SrcStart / SrcEnd appear to be positions in the text file and the `i` variable seems to move between that range, it seemed natural to replace filter with a map of indexes instead of a slice...this way we can preserve the somewhat random-access pattern that appears to be happening but avoid any range errors. >>> PiperOrigin-RevId: 390415408
2020-11-13Applied gofmt -s to simplify code.Bill Neubauer
Travis CI was failing on this build because it ran the simplification step.
2020-11-13Scope use of phrase induction to fix some bugs.Bill Neubauer
In order to handle vague reformulations of license headers, it's important to make sure the classifier doesn't "hallucinate" a license by introducing critical words that don't exist in the input text. This was originally done as a global list, but it had two problems. Turns out blocking 'bsd' as a keyword inadvertently suppressed Beerware detection since phk has used two different email addresses with that header, one of which contained the string 'bsd' The second problem was that some common phrases, particularly company names, were signifying in some licenses, but not others (Silicon Graphics critical in SGI-B licenses, but not in libtiff) The solution is to scope the list of prohibited phrases to subsets of licenses. This introduces that change and fixes up the reference tests to match the corrected detections. An issue with tokenization preserving trailing periods on numbers which caused version detection errors was also resolved.
2020-11-13Adds the v2 searchset module for handling q-gramsBill Neubauer
(q-grams were formerly discussed as token runs). The test coverage here primarily focuses on the boundary conditions of detection, but does provide good coverage testing of the branch trees of the code.