Change the public API to use []byte instead of string.

The original thinking was that string was a better choice because it clearly indicated that the classifier wouldn't modify the input data, but it comes with some runtime cost in that if the caller has a byte slice and just wanted to scan part of it, they'd need to make a copy to get a string. Having the API be []byte and documenting that it doesn't modify the data is the better choice. The currenty implementation doesn't currently realize this efficiency because it immediately makes a string for tokenization purposes, but this can be fixed in the future. Changing the API signature is not so easily accomplished.
author: Bill Neubauer <wcn@google.com> 2020-11-05 10:23:10 -0800
committer: Bill Neubauer <bill.neubauer@gmail.com> 2020-11-13 09:54:34 -0800
commit: 1f2120e0d6d439fe79ff152127d3c9e3a95db3e2 (patch)
tree: 9ea2baadb0a46f5add4560d1771283562af75095 /v2/tokenizer.go
parent: 0bd13fa9fe5d76472d9ed156753e3663146439d3 (diff)
download: licenseclassifier-1f2120e0d6d439fe79ff152127d3c9e3a95db3e2.tar.gz
1 files changed, 2 insertions, 2 deletions
diff --git a/v2/tokenizer.go b/v2/tokenizer.go
index beb7961..d20c410 100644
--- a/v2/tokenizer.go
+++ b/v2/tokenizer.go
@@ -67,10 +67,10 @@ func cleanupToken(in string) string {
 }
 
 // tokenize produces a document from the input content.
-func tokenize(in string) *document {
+func tokenize(in []byte) *document {
 	// Apply the global transforms described in SPDX
 
-	norm := strings.ToLower(in)
+	norm := strings.ToLower(string(in))
 	norm = html.UnescapeString(norm)
 	norm = normalizePunctuation(norm)
 	norm = normalizeEquivalentWords(norm)
author	Bill Neubauer <wcn@google.com>	2020-11-05 10:23:10 -0800
committer	Bill Neubauer <bill.neubauer@gmail.com>	2020-11-13 09:54:34 -0800
commit	1f2120e0d6d439fe79ff152127d3c9e3a95db3e2 (patch)
tree	9ea2baadb0a46f5add4560d1771283562af75095 /v2/tokenizer.go
parent	0bd13fa9fe5d76472d9ed156753e3663146439d3 (diff)
download	licenseclassifier-1f2120e0d6d439fe79ff152127d3c9e3a95db3e2.tar.gz