btreplay/doc/btreplay.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532

%
% Copyright (C) 2007 Alan D. Brunelle <Alan.Brunelle@hp.com>
%
%  This program is free software; you can redistribute it and/or modify
%  it under the terms of the GNU General Public License as published by
%  the Free Software Foundation; either version 2 of the License, or
%  (at your option) any later version.
%
%  This program is distributed in the hope that it will be useful,
%  but WITHOUT ANY WARRANTY; without even the implied warranty of
%  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
%  GNU General Public License for more details.
%
%  You should have received a copy of the GNU General Public License
%  along with this program; if not, write to the Free Software
%  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
%
%  vi :set textwidth=75
%
\documentclass{article}
\usepackage{multirow,graphicx,placeins}

\begin{document}
%---------------------
\title{\texttt{btrecord} and \texttt{btreplay} User Guide}
\author{Alan D. Brunelle (Alan.Brunelle@hp.com)}
\date{\today}
\maketitle
\begin{abstract}
\input{abstract.tex}
\end{abstract}
\thispagestyle{empty}\newpage
%---------------------
\tableofcontents\thispagestyle{empty}\newpage
%---------------------
\section{Introduction}
\input{abstract.tex}

\bigskip 
This document presents the command line overview for
\texttt{btrecord} and \texttt{btreplay}, and shows some commonly used
example usages of it in everyday work here at OSLO's Scalability and
Performance Group.

\subsection*{Build Note}

To build these tools, one needs to
place the source directory next to a valid
\texttt{blktrace}\footnote{\texttt{git://git.kernel.dk/blktrace.git}}
directory, as it includes \texttt{../blktrace} in the \texttt{Makefile}.


%---------------------
\newpage\section{\texttt{btrecord} and \texttt{btreplay} Operating Model}

The \texttt{blktrace} utility provides the ability to collect detailed
traces from the kernel for each IO processed by the block IO layer. The
traces provide a complete timeline for each IO processed, including
detailed information concerning when an IO was first received by the block
IO layer -- indicating the device, CPU number, time stamp, IO direction,
sector number and IO size (number of sectors). Using this information,
one is able to \emph{replay} the IO again on the same machine or another
set up entirely.

\subsection{Basic Workflow}
The basic operating work-flow to replay IOs would be something like:

\begin{enumerate}
  \item Run \texttt{blktrace} to collect traces. Here you specify the
  device or devices that you wish to trace and later replay IOs upon. Note:
  the only traces you are interested in are \emph{QUEUE} requests --
  thus, to save system resources (including storage for traces), one could
  specify the \texttt{-a queue} command line option to \texttt{blktrace}.

  \item While \texttt{blktrace} is running, you run the workload that you
  are interested in. 

  \item When the work load has completed, you stop the \texttt{blktrace}
  utility (thus saving all traces over the complete workload). 

  \item You extract the pertinent IO information from the traces saved by
  \texttt{blktrace} using the \texttt{btrecord} utility. This will parse
  each trace file created by \texttt{blktrace}, and craft IO descriptions
  to be used in the next phase of the workload processing.

  \item Once \texttt{btrecord} has successfully created a series of data
  files to be processed, you can run the \texttt{btreplay} utility which
  attempts to generate the same IOs seen during the sample workload phase.
\end{enumerate}

\subsection{IO Stream Replay Characteristics}
  The major characteristics of the IO stream that are kept intact include:

  \begin{description}
    \item[Device] The IOs are replayed on the same device as was seen
    during the sample workload.

    \item[IO direction] The same IO direction (read/write) is maintained.

    \item[IO offset] The same device offset is maintained.

    \item[IO size] The same number of sectors are transferred.

    \item[Time differential] The time stamps stored during the
    \texttt{blktrace} run are used to determine the amount of time between
    IOs during the sample workload. \texttt{btreplay} \emph{attempts} to
    maintain the same time differential between IOs, but no guarantees as
    to complete accuracy are provided by the utility.

    \item[Device IO Stream Ordering] All IOs on a device are submitted in
    the precise order they were seen during the sample workload run. 
  \end{description}

  As noted above, the time between IOs may not be accurately maintained
  during replays. In addition the actual ordering of IOs \emph{between}
  devices is not necessarily maintained. (Each device with an IO stream
  maintains its own concept of time, and thus there may be slippage of the
  time kept between managing threads.)

  \begin{quotation}
    We have prototyped a different approach, wherein a single managing
    thread handles all IOs across all devices. This approach, while
    guaranteeing correct ordering of IOs across all devices, resulted in
    much worse timing on a per IO basis. 
  \end{quotation}

\subsection{\texttt{btrecord/btreplay} Method of Operation}

As noted above, \texttt{btrecord} extracts \texttt{QUEUE} operations from
\texttt{blktrace} output. These \texttt{QUEUE} operations indicate the
entrance of IOs into the block IO layer. In order to replay these IOs with
some accuracy in regards to ordering and timeliness, we decided to take
multiple sequential (in time) IOs and put them in a single \emph{bunch} of
IOs that will be processed as a single \emph{asynchronous IO} call to the
kernel\footnote{Attempts to do them individually resulted in too large of a
turnaround time penalty (user-space to kernel and back). Note that in a
number of workloads, the IOs are coming in from the page cache handling
code, and thus are submitted to the block IO layer with \emph{very small}
time intervals between issues.}. To manage the size of the \emph{bunches},
the \texttt{btrecord} utility provides you with two controlling knobs:

\begin{description}
  \item[\texttt{--max-bunch-time}] This is the amount of time to encompass
  in one bunch -- only IOs within the time specified are eligible
  for \emph{bunching.} The default time is 10 milliseconds (10,000,000
  nanoseconds). Refer to section~\ref{sec:c-o-m} on page~\pageref{sec:c-o-m}
  for more information.

  \item[\texttt{--max-pkts}] A \emph{bunch} size can be anywhere from
  1 to 512 packets in size and by default we max a bunch to contain no
  more than 8 individual IOs. With this option, one can increase or
  decrease the maximum \emph{bunch} size.  Refer to section~\ref{sec:c-o-M}
  on page~\pageref{sec:c-o-M} for more information.
\end{description}

Each input data file (one per device per CPU) results in a new record
data file (again, one per device per CPU) which contains information
about \emph{bunches} of IOs to be replayed. \texttt{btreplay} operates on
these record data files by spawning a new pair of threads per file. One
thread manages the submitting of AIOs per bunch in the record data file,
while the other thread manages reclaiming AIOs completed\footnote{We
have found that having the same thread do both results in a further
reduction in replay timing accuracy.}.

Each submitting thread simply reads the input file of \emph{bunches}
recorded by \texttt{btrecord}, and attempts to faithfully reproduce the
ordering and timing of IOs seen during the sample workload. The reclaiming
thread simply waits for AIO completions, freeing up resources for the
submitting thread to utilize to submit new AIOs.

The number of CPUs being used on the replay system can be different from
the number on the recorded system. To help with mappings here the
\texttt{--cpus} option allows one to state how many CPUs on the replay
system to utilize. If the number of CPUs on the replay system is less than
on the recording system, we wrap CPU IDs. This \emph{may} result in an
overload of CPU processing capabilities on the replay system. (Refer to
section~\ref{sec:p-o-c} on page~\pageref{sec:p-o-c} for more details about the
\texttt{--cpus} option.)

\newpage\subsection{Known Deficiencies and Proposed Possible Fixes}

The overall known deficiencies with this current set of utilities is
outlined here, in some cases ideas on additions and/or improvements are
included as well.

\begin{enumerate}
  \item Lack of IO ordering across devices. 

  \begin{quote}
    \emph{We could institute the notion of global time across threads,
    and thus ensure IO ordering across devices, with some reduction in
    timing accuracy.}
  \end{quote}

  \item Lack of IO timing accuracy -- additional time between IO bunches.

  \begin{quote}
    \emph{This is the primary problem with any IO replay mechanism -- how
    to guarantee per-IO timing accuracy with respect to other replayed IOs?
    One idea to reduce errors in this area would be to push the IO replay
    into the kernel, where you \emph{may} receive more responsive timings.}
  \end{quote}

  \item Bunching of IOs results in reduced time amongst IOs within a bunch.

  \begin{quote}
    \emph{The user has \emph{some} control over this (via the
    \texttt{--max-pkts} option). One \emph{could} simply specify
    \texttt{-max-pkts=1} and then each IO would be treated individually. Of
    course, this would probably then run into the problem of excessive
    inter-IO times.}
  \end{quote}

  \item 1-to-1 mapping of devices -- for now the devices on the replay
  machine must be the same as on the recording machine. 

  \begin{quote}
    \emph{It should be relatively trivial to add in the notion of
    mapping -- simply include a file that is read which maps devices
    on one machine to devices (with offsets and sizes) on the replay
    machine\footnote{The notion of an offset and device size to replay on
    could be used to both allow for a single device to masquerade as more
    than one device, and could be utilized in case the replay device is
    smaller than the recorded device.}.}
    
    \medskip\emph{One could also add in the notion of CPU mappings as well --
    device $D_{rec}$ managed by CPU $C_{rec}$ on the recorded system
    shall be replayed on device $D_{rep}$ and CPU $C_{rep}$ on the
    replay machine.}

    \bigskip
    \begin{quote}
      With version 0.9.1 we now support the \texttt{-M} option to do this
      -- see section~\ref{sec:p-o-M} on page~\pageref{sec:p-o-M} for more
      information on device mapping.
    \end{quote}
  \end{quote}

\end{enumerate}

%---------------------
\newpage\section{\label{sec:command-line}Command Line Options}
\subsection{\texttt{btrecord} Command Line Options}
\begin{figure}[h!]
\begin{verbatim}
Usage: btrecord -- version 0.9.3

	[ -d <dir>  : --input-directory=<dir> ] Default: .
	[ -D <dir>  : --output-directory=<dir>] Default: .
	[ -F        : --find-traces           ] Default: Off
	[ -h        : --help                  ] Default: Off
	[ -m <nsec> : --max-bunch-time=<nsec> ] Default: 10 msec
	[ -M <pkts> : --max-pkts=<pkts>       ] Default: 8
	[ -o <base> : --output-base=<base>    ] Default: replay
	[ -v        : --verbose               ] Default: Off
	[ -V        : --version               ] Default: Off
	<dev>...                                Default: None
\end{verbatim}
\caption{\label{fig:btrecord--help}\texttt{btrecord --help} Output}
\end{figure}
\FloatBarrier

\subsubsection{\label{sec:c-o-d}\texttt{-d} or
\texttt{--input-directory}\\Set Input Directory}

The \texttt{-d} option requires a single parameter providing the directory
name for where input files are to be found. The default directory is the
current directory (\texttt{.}).

\subsubsection{\label{sec:c-o-D}\texttt{-D} or
\texttt{--output-directory}\\Set Output Directory}

The \texttt{-D} option requires a single parameter providing the directory
name for where output files are to be placed. The default directory is the
current directory (\texttt{.}).

\subsubsection{\texttt{-F} or \texttt{--find-traces}\\Find Trace Files
Automatically}

The \texttt{-F} option instructs \texttt{btrecord} to go find all the
trace files in the directory specified (either via the \texttt{-d}
option, or in the default directory '.').

\subsubsection{\texttt{-h} or \texttt{--help}\\Display Help Message}
\subsubsection{\texttt{-V} or \texttt{--version}\\Display
\texttt{btrecord}Version}

The \texttt{-h} option displays the command line options and
defaults, as presented in figure~\ref{fig:btrecord--help} on
page~\pageref{fig:btrecord--help}.

The \texttt{-V} option displays the \texttt{btreplay} version, as shown here:

\begin{verbatim}
$ btrecord --version
btrecord -- version 0.9.0
\end{verbatim}

Both commands exit immediately after processing the option.

\subsubsection{\label{sec:c-o-m}\texttt{-m} or
\texttt{--max-bunch-time}\\Set Maximum Time Per Bunch}

The \texttt{-m} option requires a single parameter which specifies an
amount of time (in nanoseconds) to include in any one bunch of IOs that
are to be processed. The smaller the value, the smaller the number of
IOs processed at one time -- perhaps yielding in more realistic replay.
However, after a certain point the amount of overhead per bunch may result
in additional real replay time, thus yielding less accurate replay times.

The default value is 10,000,000 nanoseconds (10 milliseconds).

\subsubsection{\label{sec:c-o-M}\texttt{-M} or
\texttt{--max-pkts}\\Set Maximum Packets Per Bunch}

The \texttt{-M} option requires a single parameter which specifies the
maximum number of IOs to store in a single bunch. As with the \texttt{-m}
option (section~\ref{sec:c-o-m}), smaller values \emph{may} or \emph{may not}
yield more accurate replay times.

The default value is 8, with a maximum value of up to 512 being supported.

\subsubsection{\label{sec:c-o-o}\texttt{-o} or
\texttt{--output-base}\\Set Base Name for Output Files}

Each output file has 3 fields:

\begin{enumerate}
  \item Device identifier (taken directly from the device name of the
  \texttt{blktrace} output file).

  \item \texttt{btrecord} base name -- by default ``replay''.

  \item And the CPU number (again, taken directly from the
  \texttt{blktrace} output file name).
\end{enumerate}

This option requires a single parameter that will override the default name
(replay), and replace it with the specified value.

\subsubsection{\label{sec:c-o-v}\texttt{-v} or
\texttt{--verbose}\\Select Verbose Output}

This option will output some simple statistics at the end of a successful
run. Figure~\ref{fig:verb-out} (page~\pageref{fig:verb-out}) shows
an example of some output, while figure~\ref{fig:verb-defs}
(page~\pageref{fig:verb-defs}) shows what the fields mean.

\begin{figure}[h!]
\begin{verbatim}
sdab:0: 580661 pkts (tot), 126030 pkts (replay), 89809 bunches, 1.4 pkts/bunch
sdab:1: 2559775 pkts (tot), 430172 pkts (replay), 293029 bunches, 1.5 pkts/bunch
sdab:2: 653559 pkts (tot), 136522 pkts (replay), 102288 bunches, 1.3 pkts/bunch
sdab:3: 474773 pkts (tot), 117849 pkts (replay), 69572 bunches, 1.7 pkts/bunch
\end{verbatim}
\caption{\label{fig:verb-out}Verbose Output Example}
\end{figure}
\FloatBarrier

\begin{figure}[h!]
\begin{description}
  \item[Field 1] The first field contains the device name and CPU
  identifier. Thus: \texttt{sdab:0:} means the device \texttt{sdab} and
  traces on CPU 0. 

  \item[Field 2] The second field contains the total number of packets
  processed for each device file. 

  \item[Field 3] The next field shows the number of packets eligible for
  replay. 

  \item[Field 4] The fourth field contains the total number of IO bunches. 

  \item[Field 5] The last field shows the average number of IOs per bunch
  recorded.
\end{description}
\caption{\label{fig:verb-defs}Verbose Field Definitions}
\end{figure}
\FloatBarrier

%---------------------
\newpage\subsection{\texttt{btreplay} Command Line Options}
\begin{figure}[h!]
\begin{verbatim}
Usage: btreplay -- version 0.9.3

	[ -c <cpus> : --cpus=<cpus>           ] Default: 1
	[ -d <dir>  : --input-directory=<dir> ] Default: .
	[ -F        : --find-records          ] Default: Off
	[ -h        : --help                  ] Default: Off
	[ -i <base> : --input-base=<base>     ] Default: replay
	[ -I <iters>: --iterations=<iters>    ] Default: 1
	[ -M <file> : --map-devs=<file>       ] Default: None
	[ -N        : --no-stalls             ] Default: Off
	[ -x <int>  : --acc-factor=<int>      ] Default: 1
	[ -v        : --verbose               ] Default: Off
	[ -V        : --version               ] Default: Off
	[ -W        : --write-enable          ] Default: Off
	<dev...>                                Default: None
\end{verbatim}
\caption{\label{fig:btreplay--help}\texttt{btreplay --help} Output}
\end{figure}
\FloatBarrier

\subsubsection{\label{sec:p-o-c}\texttt{-c} or
\texttt{--cpus}\\Set Number of CPUs to Use}

\subsubsection{\label{sec:p-o-d}\texttt{-d} or
\texttt{--input-directory}\\Set Input Directory}

The \texttt{-d} option requires a single parameter providing the directory
name for where input files are to be found. The default directory is the
current directory (\texttt{.}).

\subsubsection{\texttt{-F} or \texttt{--find-records}\\Find RecordFiles
Automatically}

The \texttt{-F} option instructs \texttt{btreplay} to go find all the
record files in the directory specified (either via the \texttt{-d}
option, or in the default directory '.').

\subsubsection{\texttt{-h} or \texttt{--help}\\Display Help Message}
\subsubsection{\texttt{-V} or \texttt{--version}\\Display
\texttt{btreplay}Version}

The \texttt{-h} option displays the command line options and
defaults, as presented in figure~\ref{fig:btreplay--help} on
page~\pageref{fig:btreplay--help}.

The \texttt{-V} option displays the \texttt{btreplay} version, as show here:

\begin{verbatim}
$ btreplay --version
btreplay -- version 0.9.0
\end{verbatim}

Both commands exit immediately after processing the option.

\subsubsection{\label{sec:p-o-i}\texttt{-i} or
\texttt{--input-base}\\Set Base Name for Input Files}

Each input file has 3 fields:

\begin{enumerate}
  \item Device identifier (taken directly from the device name of the
  \texttt{blktrace} output file).

  \item \texttt{btrecord} base name -- by default ``replay''.

  \item And the CPU number (again, taken directly from the
  \texttt{blktrace} output file name).
\end{enumerate}

This option requires a single parameter that will override the default name
(replay), and replace it with the specified value.

\subsubsection{\label{sec:p-o-I}\texttt{-I} or
\texttt{--iterations}\\Set Number of Iterations to Run}

This option requires a single parameter which specifies the number of times
to run through the input files. The default value is 1.

\subsubsection{\label{sec:p-o-M}\texttt{-M} or \texttt{map-devs}\\
Specify Device Mappings}

This option requires a single parameter which specifies the name of a
file containing device mappings. The file must be very simply managed, with
just two pieces of data per line:

\begin{enumerate}
  \item The device name on the recorded system (with the \texttt{'/dev/'}
  removed). Example: \texttt{/dev/sda} would just be \texttt{sda}.

  \item The device name on the replay system to use (again, without the
  \texttt{'/dev/'} path prepended).
\end{enumerate}

An example file for when one would map devices \texttt{/dev/sda} and
\texttt{/dev/sdb} on the recorded system to \texttt{dev/sdg} and
\texttt{sdh} on the replay system would be:

\begin{verbatim}
sda sdg
sdb sdh
\end{verbatim}

The only entries in the file that are allowed are these two element lines
-- we do not (yet?) support the notion of blank lines, or comment lines, or
the like.

The utility \emph{does} allow for multiple \texttt{-M} options to be
supplied on the command line.

\subsubsection{\label{sec:o-N}\texttt{-N} or \texttt{--no-stalls}\\Disable
Pre-bunch Stalls}

When specified on the command line, all pre-bunch stall indicators will be
ignored. IOs will be replayed without inter-bunch delays.

\subsubsection{\label{sec:o-x}\texttt{-x} or \texttt{--acc-factor}\\Acceleration
Factor}

  While the \texttt{--no-stalls} option allows the traces to be replayed
  with no waiting time, this option specifies some acceleration factor
  to be used. If the value of two is used, then the stall time is
  divided by half resulting in a reduction of the execution time by
  this factor. Note that if this number is too high, the results will
  be equivalent of not having stall.

\subsubsection{\label{sec:p-o-v}\texttt{-v} or
\texttt{--verbose}\\Select Verbose Output}

When specified on the command line, this option instructs \texttt{btreplay}
to store information concerning each \emph{stall} and IO operation
performed by \texttt{btreplay}. The name of each file so created will be
the input file name used with an extension of \texttt{.rep} appended onto
it. Thus, an input file of the name \texttt{sdab.replay.3} would generate a
verbose output file with the name \texttt{sdab.replay.3.rep} in the
directory specified for input files.

In addition, \texttt{btreplay} will also output to \texttt{stderr} the
names of the input files being processed.

\subsubsection{\label{sec:p-o-W}\texttt{-W} or
\texttt{--write-enable}\\Enable Writing During Replay}

As a precautionary measure, by default \texttt{btreplay} will \emph{not}
process \emph{write} requests. In order to enable \texttt{btreplay} to
actually \emph{write} to devices one must explicitly specify the
\texttt{-W} option.

\end{document}