aboutsummaryrefslogtreecommitdiff
path: root/doc/nom_recipes.md
blob: 88994858b071b0d4eca8b7f20195155c2c8261c2 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
# Nom Recipes

These are short recipes for accomplishing common tasks with nom.

* [Whitespace](#whitespace)
  + [Wrapper combinators that eat whitespace before and after a parser](#wrapper-combinators-that-eat-whitespace-before-and-after-a-parser)
* [Comments](#comments)
  + [`// C++/EOL-style comments`](#-ceol-style-comments)
  + [`/* C-style comments */`](#-c-style-comments-)
* [Identifiers](#identifiers)
  + [`Rust-Style Identifiers`](#rust-style-identifiers)
* [Literal Values](#literal-values)
  + [Escaped Strings](#escaped-strings)
  + [Integers](#integers)
    - [Hexadecimal](#hexadecimal)
    - [Octal](#octal)
    - [Binary](#binary)
    - [Decimal](#decimal)
  + [Floating Point Numbers](#floating-point-numbers)

## Whitespace



### Wrapper combinators that eat whitespace before and after a parser

```rust
use nom::{
  IResult,
  error::ParseError,
  combinator::value,
  sequence::delimited,
  character::complete::multispace0,
};

/// A combinator that takes a parser `inner` and produces a parser that also consumes both leading and 
/// trailing whitespace, returning the output of `inner`.
fn ws<'a, F: 'a, O, E: ParseError<&'a str>>(inner: F) -> impl FnMut(&'a str) -> IResult<&'a str, O, E>
  where
  F: Fn(&'a str) -> IResult<&'a str, O, E>,
{
  delimited(
    multispace0,
    inner,
    multispace0
  )
}
```

To eat only trailing whitespace, replace `delimited(...)` with `terminated(&inner, multispace0)`.
Likewise, the eat only leading whitespace, replace `delimited(...)` with `preceded(multispace0,
&inner)`. You can use your own parser instead of `multispace0` if you want to skip a different set
of lexemes.

## Comments

### `// C++/EOL-style comments`

This version uses `%` to start a comment, does not consume the newline character, and returns an
output of `()`.

```rust
use nom::{
  IResult,
  error::ParseError,
  combinator::value,
  sequence::pair,
  bytes::complete::is_not,
  character::complete::char,
};

pub fn peol_comment<'a, E: ParseError<&'a str>>(i: &'a str) -> IResult<&'a str, (), E>
{
  value(
    (), // Output is thrown away.
    pair(char('%'), is_not("\n\r"))
  )(i)
}
```

### `/* C-style comments */`

Inline comments surrounded with sentinel tags `(*` and `*)`. This version returns an output of `()`
and does not handle nested comments.

```rust
use nom::{
  IResult,
  error::ParseError,
  combinator::value,
  sequence::tuple,
  bytes::complete::{tag, take_until},
};

pub fn pinline_comment<'a, E: ParseError<&'a str>>(i: &'a str) -> IResult<&'a str, (), E> {
  value(
    (), // Output is thrown away.
    tuple((
      tag("(*"),
      take_until("*)"),
      tag("*)")
    ))
  )(i)
}
```

## Identifiers

### `Rust-Style Identifiers`

Parsing identifiers that may start with a letter (or underscore) and may contain underscores,
letters and numbers may be parsed like this:

```rust
use nom::{
  IResult,
  branch::alt,
  multi::many0,
  combinator::recognize,
  sequence::pair,
  character::complete::{alpha1, alphanumeric1},
  bytes::complete::tag,
};

pub fn identifier(input: &str) -> IResult<&str, &str> {
  recognize(
    pair(
      alt((alpha1, tag("_"))),
      many0(alt((alphanumeric1, tag("_"))))
    )
  )(input)
}
```

Let's say we apply this to the identifier `hello_world123abc`. The first `alt` parser would
recognize `h`. The `pair` combinator ensures that `ello_world123abc` will be piped to the next
`alphanumeric0` parser, which recognizes every remaining character. However, the `pair` combinator
returns a tuple of the results of its sub-parsers. The `recognize` parser produces a `&str` of the
input text that was parsed, which in this case is the entire `&str` `hello_world123abc`.

## Literal Values

### Escaped Strings

This is [one of the examples](https://github.com/Geal/nom/blob/master/examples/string.rs) in the
examples directory.

### Integers

The following recipes all return string slices rather than integer values. How to obtain an
integer value instead is demonstrated for hexadecimal integers. The others are similar.

The parsers allow the grouping character `_`, which allows one to group the digits by byte, for
example: `0xA4_3F_11_28`. If you prefer to exclude the `_` character, the lambda to convert from a
string slice to an integer value is slightly simpler. You can also strip the `_` from the string
slice that is returned, which is demonstrated in the second hexdecimal number parser.

If you wish to limit the number of digits in a valid integer literal, replace `many1` with
`many_m_n` in the recipes.

#### Hexadecimal

The parser outputs the string slice of the digits without the leading `0x`/`0X`.

```rust
use nom::{
  IResult,
  branch::alt,
  multi::{many0, many1},
  combinator::recognize,
  sequence::{preceded, terminated},
  character::complete::{char, one_of},
  bytes::complete::tag,
};

fn hexadecimal(input: &str) -> IResult<&str, &str> { // <'a, E: ParseError<&'a str>>
  preceded(
    alt((tag("0x"), tag("0X"))),
    recognize(
      many1(
        terminated(one_of("0123456789abcdefABCDEF"), many0(char('_')))
      )
    )
  )(input)
}
```

If you want it to return the integer value instead, use map:

```rust
use nom::{
  IResult,
  branch::alt,
  multi::{many0, many1},
  combinator::{map_res, recognize},
  sequence::{preceded, terminated},
  character::complete::{char, one_of},
  bytes::complete::tag,
};

fn hexadecimal_value(input: &str) -> IResult<&str, i64> {
  map_res(
    preceded(
      alt((tag("0x"), tag("0X"))),
      recognize(
        many1(
          terminated(one_of("0123456789abcdefABCDEF"), many0(char('_')))
        )
      )
    ),
    |out: &str| i64::from_str_radix(&str::replace(&out, "_", ""), 16)
  )(input)
}
```

#### Octal

```rust
use nom::{
  IResult,
  branch::alt,
  multi::{many0, many1},
  combinator::recognize,
  sequence::{preceded, terminated},
  character::complete::{char, one_of},
  bytes::complete::tag,
};

fn octal(input: &str) -> IResult<&str, &str> {
  preceded(
    alt((tag("0o"), tag("0O"))),
    recognize(
      many1(
        terminated(one_of("01234567"), many0(char('_')))
      )
    )
  )(input)
}
```

#### Binary

```rust
use nom::{
  IResult,
  branch::alt,
  multi::{many0, many1},
  combinator::recognize,
  sequence::{preceded, terminated},
  character::complete::{char, one_of},
  bytes::complete::tag,
};

fn binary(input: &str) -> IResult<&str, &str> {
  preceded(
    alt((tag("0b"), tag("0B"))),
    recognize(
      many1(
        terminated(one_of("01"), many0(char('_')))
      )
    )
  )(input)
}
```

#### Decimal

```rust
use nom::{
  IResult,
  multi::{many0, many1},
  combinator::recognize,
  sequence::terminated,
  character::complete::{char, one_of},
};

fn decimal(input: &str) -> IResult<&str, &str> {
  recognize(
    many1(
      terminated(one_of("0123456789"), many0(char('_')))
    )
  )(input)
}
```

### Floating Point Numbers

The following is adapted from [the Python parser by Valentin Lorentz (ProgVal)](https://github.com/ProgVal/rust-python-parser/blob/master/src/numbers.rs).

```rust
use nom::{
  IResult,
  branch::alt,
  multi::{many0, many1},
  combinator::{opt, recognize},
  sequence::{preceded, terminated, tuple},
  character::complete::{char, one_of},
};

fn float(input: &str) -> IResult<&str, &str> {
  alt((
    // Case one: .42
    recognize(
      tuple((
        char('.'),
        decimal,
        opt(tuple((
          one_of("eE"),
          opt(one_of("+-")),
          decimal
        )))
      ))
    )
    , // Case two: 42e42 and 42.42e42
    recognize(
      tuple((
        decimal,
        opt(preceded(
          char('.'),
          decimal,
        )),
        one_of("eE"),
        opt(one_of("+-")),
        decimal
      ))
    )
    , // Case three: 42. and 42.42
    recognize(
      tuple((
        decimal,
        char('.'),
        opt(decimal)
      ))
    )
  ))(input)
}

fn decimal(input: &str) -> IResult<&str, &str> {
  recognize(
    many1(
      terminated(one_of("0123456789"), many0(char('_')))
    )
  )(input)
}
```

# implementing FromStr

The [FromStr trait](https://doc.rust-lang.org/std/str/trait.FromStr.html) provides
a common interface to parse from a string.

```rust
use nom::{
  IResult, Finish, error::Error,
  bytes::complete::{tag, take_while},
};
use std::str::FromStr;

// will recognize the name in "Hello, name!"
fn parse_name(input: &str) -> IResult<&str, &str> {
  let (i, _) = tag("Hello, ")(input)?;
  let (i, name) = take_while(|c:char| c.is_alphabetic())(i)?;
  let (i, _) = tag("!")(i)?;

  Ok((i, name))
}

// with FromStr, the result cannot be a reference to the input, it must be owned
#[derive(Debug)]
pub struct Name(pub String);

impl FromStr for Name {
  // the error must be owned as well
  type Err = Error<String>;

  fn from_str(s: &str) -> Result<Self, Self::Err> {
      match parse_name(s).finish() {
          Ok((_remaining, name)) => Ok(Name(name.to_string())),
          Err(Error { input, code }) => Err(Error {
              input: input.to_string(),
              code,
          })
      }
  }
}

fn main() {
  // parsed: Ok(Name("nom"))
  println!("parsed: {:?}", "Hello, nom!".parse::<Name>());

  // parsed: Err(Error { input: "123!", code: Tag })
  println!("parsed: {:?}", "Hello, 123!".parse::<Name>());
}
```