aboutsummaryrefslogtreecommitdiff
path: root/stringclassifier/README.md
blob: cc8a9eb0a0e21b71706f8ebc47149aebeabce8b1 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
# StringClassifier

StringClassifier is a library to classify an unknown text against a set of known
texts. The classifier uses the [Levenshtein Distance] algorithm to determine
which of the known texts most closely matches the unknown text. The Levenshtein
Distance is normalized into a "confidence percentage" between 1 and 0, where 1.0
indicates an exact match and 0.0 indicates a complete mismatch.

[Levenshtein Distance]: https://en.wikipedia.org/wiki/Levenshtein_distance

## Types of matching

There are two kinds of matching algorithms the string classifier can perform:

1. [Nearest matching](#nearest), and
2. [Multiple matching](#multiple).

### Normalization

To get the best match, normalizing functions can be applied to the texts. For
example, flattening whitespaces removes a lot of inconsequential formatting
differences that would otherwise lower the matching confidence percentage.

```go
sc := stringclassifier.New(stringclassifier.FlattenWhitespace, strings.ToLower)
```

The normalizating functions are run on all the known texts that are added to the
classifier. They're also run on the unknown text before classification.

### Nearest matching {#nearest}

A nearest match returns the name of the known text that most closely matches the
full unknown text. This is most useful when the unknown text doesn't have
extraneous text around it.

Example:

```go
func IdentifyText(sc *stringclassifier.Classifier, name, unknown string) {
  m := sc.NearestMatch(unknown)
  log.Printf("The nearest match to %q is %q (confidence: %v)", name, m.Name, m.Confidence)
}
```

## Multiple matching {#multiple}

Multiple matching identifies all of the known texts which may exist in the
unknown text. It can also detect a known text in an unknown text even if there's
extraneous text around the unknown text. As with nearest matching, a confidence
percentage for each match is given.

Example:

```go
log.Printf("The text %q contains:", name)
for _, m := range sc.MultipleMatch(unknown, false) {
  log.Printf("  %q (conf: %v, offset: %v)", m.Name, m.Confidence, m.Offset)
}
```

## Disclaimer

This is not an official Google product (experimental or otherwise), it is just
code that happens to be owned by Google.