aboutsummaryrefslogtreecommitdiff
path: root/README.md
blob: fd921aab2b7b505851772b98ab8f7cd9b61a3151 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
bstr
====
This crate provides extension traits for `&[u8]` and `Vec<u8>` that enable
their use as byte strings, where byte strings are _conventionally_ UTF-8. This
differs from the standard library's `String` and `str` types in that they are
not required to be valid UTF-8, but may be fully or partially valid UTF-8.

[![Build status](https://github.com/BurntSushi/bstr/workflows/ci/badge.svg)](https://github.com/BurntSushi/bstr/actions)
[![](https://meritbadge.herokuapp.com/bstr)](https://crates.io/crates/bstr)


### Documentation

https://docs.rs/bstr


### When should I use byte strings?

See this part of the documentation for more details:
https://docs.rs/bstr/0.2.*/bstr/#when-should-i-use-byte-strings.

The short story is that byte strings are useful when it is inconvenient or
incorrect to require valid UTF-8.


### Usage

Add this to your `Cargo.toml`:

```toml
[dependencies]
bstr = "0.2"
```


### Examples

The following two examples exhibit both the API features of byte strings and
the I/O convenience functions provided for reading line-by-line quickly.

This first example simply shows how to efficiently iterate over lines in
stdin, and print out lines containing a particular substring:

```rust
use std::error::Error;
use std::io::{self, Write};

use bstr::{ByteSlice, io::BufReadExt};

fn main() -> Result<(), Box<dyn Error>> {
    let stdin = io::stdin();
    let mut stdout = io::BufWriter::new(io::stdout());

    stdin.lock().for_byte_line_with_terminator(|line| {
        if line.contains_str("Dimension") {
            stdout.write_all(line)?;
        }
        Ok(true)
    })?;
    Ok(())
}
```

This example shows how to count all of the words (Unicode-aware) in stdin,
line-by-line:

```rust
use std::error::Error;
use std::io;

use bstr::{ByteSlice, io::BufReadExt};

fn main() -> Result<(), Box<dyn Error>> {
    let stdin = io::stdin();
    let mut words = 0;
    stdin.lock().for_byte_line_with_terminator(|line| {
        words += line.words().count();
        Ok(true)
    })?;
    println!("{}", words);
    Ok(())
}
```

This example shows how to convert a stream on stdin to uppercase without
performing UTF-8 validation _and_ amortizing allocation. On standard ASCII
text, this is quite a bit faster than what you can (easily) do with standard
library APIs. (N.B. Any invalid UTF-8 bytes are passed through unchanged.)

```rust
use std::error::Error;
use std::io::{self, Write};

use bstr::{ByteSlice, io::BufReadExt};

fn main() -> Result<(), Box<dyn Error>> {
    let stdin = io::stdin();
    let mut stdout = io::BufWriter::new(io::stdout());

    let mut upper = vec![];
    stdin.lock().for_byte_line_with_terminator(|line| {
        upper.clear();
        line.to_uppercase_into(&mut upper);
        stdout.write_all(&upper)?;
        Ok(true)
    })?;
    Ok(())
}
```

This example shows how to extract the first 10 visual characters (as grapheme
clusters) from each line, where invalid UTF-8 sequences are generally treated
as a single character and are passed through correctly:

```rust
use std::error::Error;
use std::io::{self, Write};

use bstr::{ByteSlice, io::BufReadExt};

fn main() -> Result<(), Box<dyn Error>> {
    let stdin = io::stdin();
    let mut stdout = io::BufWriter::new(io::stdout());

    stdin.lock().for_byte_line_with_terminator(|line| {
        let end = line
            .grapheme_indices()
            .map(|(_, end, _)| end)
            .take(10)
            .last()
            .unwrap_or(line.len());
        stdout.write_all(line[..end].trim_end())?;
        stdout.write_all(b"\n")?;
        Ok(true)
    })?;
    Ok(())
}
```


### Cargo features

This crates comes with a few features that control standard library, serde
and Unicode support.

* `std` - **Enabled** by default. This provides APIs that require the standard
  library, such as `Vec<u8>`.
* `unicode` - **Enabled** by default. This provides APIs that require sizable
  Unicode data compiled into the binary. This includes, but is not limited to,
  grapheme/word/sentence segmenters. When this is disabled, basic support such
  as UTF-8 decoding is still included.
* `serde1` - **Disabled** by default. Enables implementations of serde traits
  for the `BStr` and `BString` types.
* `serde1-nostd` - **Disabled** by default. Enables implementations of serde
  traits for the `BStr` type only, intended for use without the standard
  library. Generally, you either want `serde1` or `serde1-nostd`, not both.


### Minimum Rust version policy

This crate's minimum supported `rustc` version (MSRV) is `1.41.1`.

In general, this crate will be conservative with respect to the minimum
supported version of Rust. MSRV may be bumped in minor version releases.


### Future work

Since this is meant to be a core crate, getting a `1.0` release is a priority.
My hope is to move to `1.0` within the next year and commit to its API so that
`bstr` can be used as a public dependency.

A large part of the API surface area was taken from the standard library, so
from an API design perspective, a good portion of this crate should be mature.
The main differences from the standard library are in how the various substring
search routines work. The standard library provides generic infrastructure for
supporting different types of searches with a single method, where as this
library prefers to define new methods for each type of search and drop the
generic infrastructure.

Some _probable_ future considerations for APIs include, but are not limited to:

* A convenience layer on top of the `aho-corasick` crate.
* Unicode normalization.
* More sophisticated support for dealing with Unicode case, perhaps by
  combining the use cases supported by [`caseless`](https://docs.rs/caseless)
  and [`unicase`](https://docs.rs/unicase).
* Add facilities for dealing with OS strings and file paths, probably via
  simple conversion routines.

Here are some examples that are _probably_ out of scope for this crate:

* Regular expressions.
* Unicode collation.

The exact scope isn't quite clear, but I expect we can iterate on it.

In general, as stated below, this crate is an experiment in bringing lots of
related APIs together into a single crate while simultaneously attempting to
keep the total number of dependencies low. Indeed, every dependency of `bstr`,
except for `memchr`, is optional.


### High level motivation

Strictly speaking, the `bstr` crate provides very little that can't already be
achieved with the standard library `Vec<u8>`/`&[u8]` APIs and the ecosystem of
library crates. For example:

* The standard library's
  [`Utf8Error`](https://doc.rust-lang.org/std/str/struct.Utf8Error.html)
  can be used for incremental lossy decoding of `&[u8]`.
* The
  [`unicode-segmentation`](https://unicode-rs.github.io/unicode-segmentation/unicode_segmentation/index.html)
  crate can be used for iterating over graphemes (or words), but is only
  implemented for `&str` types. One could use `Utf8Error` above to implement
  grapheme iteration with the same semantics as what `bstr` provides (automatic
  Unicode replacement codepoint substitution).
* The [`twoway`](https://docs.rs/twoway) crate can be used for
  fast substring searching on `&[u8]`.

So why create `bstr`? Part of the point of the `bstr` crate is to provide a
uniform API of coupled components instead of relying on users to piece together
loosely coupled components from the crate ecosystem. For example, if you wanted
to perform a search and replace in a `Vec<u8>`, then writing the code to do
that with the `twoway` crate is not that difficult, but it's still additional
glue code you have to write. This work adds up depending on what you're doing.
Consider, for example, trimming and splitting, along with their different
variants.

In other words, `bstr` is partially a way of pushing back against the
micro-crate ecosystem that appears to be evolving. It's not clear to me whether
this experiment will be successful or not, but it is definitely a goal of
`bstr` to keep its dependency list lightweight. For example, `serde` is an
optional dependency because there is no feasible alternative, but `twoway` is
not, where we instead prefer to implement our own substring search. In service
of this philosophy, currently, the only required dependency of `bstr` is
`memchr`.


### License

This project is licensed under either of

 * Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or
   https://www.apache.org/licenses/LICENSE-2.0)
 * MIT license ([LICENSE-MIT](LICENSE-MIT) or
   https://opensource.org/licenses/MIT)

at your option.

The data in `src/unicode/data/` is licensed under the Unicode License Agreement
([LICENSE-UNICODE](https://www.unicode.org/copyright.html#License)), although
this data is only used in tests.