diff options
author | Bill Neubauer <wcn@google.com> | 2020-11-05 10:23:10 -0800 |
---|---|---|
committer | Bill Neubauer <bill.neubauer@gmail.com> | 2020-11-13 09:54:34 -0800 |
commit | 1f2120e0d6d439fe79ff152127d3c9e3a95db3e2 (patch) | |
tree | 9ea2baadb0a46f5add4560d1771283562af75095 /v2/tokenizer.go | |
parent | 0bd13fa9fe5d76472d9ed156753e3663146439d3 (diff) | |
download | licenseclassifier-1f2120e0d6d439fe79ff152127d3c9e3a95db3e2.tar.gz |
Change the public API to use []byte instead of string.
The original thinking was that string was a better choice because it clearly
indicated that the classifier wouldn't modify the input data, but it comes with
some runtime cost in that if the caller has a byte slice and just wanted to
scan part of it, they'd need to make a copy to get a string.
Having the API be []byte and documenting that it doesn't modify the data is the
better choice. The currenty implementation doesn't currently realize this
efficiency because it immediately makes a string for tokenization purposes,
but this can be fixed in the future. Changing the API signature is not so
easily accomplished.
Diffstat (limited to 'v2/tokenizer.go')
-rw-r--r-- | v2/tokenizer.go | 4 |
1 files changed, 2 insertions, 2 deletions
diff --git a/v2/tokenizer.go b/v2/tokenizer.go index beb7961..d20c410 100644 --- a/v2/tokenizer.go +++ b/v2/tokenizer.go @@ -67,10 +67,10 @@ func cleanupToken(in string) string { } // tokenize produces a document from the input content. -func tokenize(in string) *document { +func tokenize(in []byte) *document { // Apply the global transforms described in SPDX - norm := strings.ToLower(in) + norm := strings.ToLower(string(in)) norm = html.UnescapeString(norm) norm = normalizePunctuation(norm) norm = normalizeEquivalentWords(norm) |