aboutsummaryrefslogtreecommitdiff
path: root/v2/tokenizer.go
diff options
context:
space:
mode:
authorBill Neubauer <wcn@google.com>2020-11-05 10:23:10 -0800
committerBill Neubauer <bill.neubauer@gmail.com>2020-11-13 09:54:34 -0800
commit1f2120e0d6d439fe79ff152127d3c9e3a95db3e2 (patch)
tree9ea2baadb0a46f5add4560d1771283562af75095 /v2/tokenizer.go
parent0bd13fa9fe5d76472d9ed156753e3663146439d3 (diff)
downloadlicenseclassifier-1f2120e0d6d439fe79ff152127d3c9e3a95db3e2.tar.gz
Change the public API to use []byte instead of string.
The original thinking was that string was a better choice because it clearly indicated that the classifier wouldn't modify the input data, but it comes with some runtime cost in that if the caller has a byte slice and just wanted to scan part of it, they'd need to make a copy to get a string. Having the API be []byte and documenting that it doesn't modify the data is the better choice. The currenty implementation doesn't currently realize this efficiency because it immediately makes a string for tokenization purposes, but this can be fixed in the future. Changing the API signature is not so easily accomplished.
Diffstat (limited to 'v2/tokenizer.go')
-rw-r--r--v2/tokenizer.go4
1 files changed, 2 insertions, 2 deletions
diff --git a/v2/tokenizer.go b/v2/tokenizer.go
index beb7961..d20c410 100644
--- a/v2/tokenizer.go
+++ b/v2/tokenizer.go
@@ -67,10 +67,10 @@ func cleanupToken(in string) string {
}
// tokenize produces a document from the input content.
-func tokenize(in string) *document {
+func tokenize(in []byte) *document {
// Apply the global transforms described in SPDX
- norm := strings.ToLower(in)
+ norm := strings.ToLower(string(in))
norm = html.UnescapeString(norm)
norm = normalizePunctuation(norm)
norm = normalizeEquivalentWords(norm)