diff options
Diffstat (limited to 'btreplay/doc/btreplay.tex')
-rw-r--r-- | btreplay/doc/btreplay.tex | 532 |
1 files changed, 532 insertions, 0 deletions
diff --git a/btreplay/doc/btreplay.tex b/btreplay/doc/btreplay.tex new file mode 100644 index 0000000..8b0ecf7 --- /dev/null +++ b/btreplay/doc/btreplay.tex @@ -0,0 +1,532 @@ +% +% Copyright (C) 2007 Alan D. Brunelle <Alan.Brunelle@hp.com> +% +% This program is free software; you can redistribute it and/or modify +% it under the terms of the GNU General Public License as published by +% the Free Software Foundation; either version 2 of the License, or +% (at your option) any later version. +% +% This program is distributed in the hope that it will be useful, +% but WITHOUT ANY WARRANTY; without even the implied warranty of +% MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +% GNU General Public License for more details. +% +% You should have received a copy of the GNU General Public License +% along with this program; if not, write to the Free Software +% Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA +% +% vi :set textwidth=75 +% +\documentclass{article} +\usepackage{multirow,graphicx,placeins} + +\begin{document} +%--------------------- +\title{\texttt{btrecord} and \texttt{btreplay} User Guide} +\author{Alan D. Brunelle (Alan.Brunelle@hp.com)} +\date{\today} +\maketitle +\begin{abstract} +\input{abstract.tex} +\end{abstract} +\thispagestyle{empty}\newpage +%--------------------- +\tableofcontents\thispagestyle{empty}\newpage +%--------------------- +\section{Introduction} +\input{abstract.tex} + +\bigskip +This document presents the command line overview for +\texttt{btrecord} and \texttt{btreplay}, and shows some commonly used +example usages of it in everyday work here at OSLO's Scalability and +Performance Group. + +\subsection*{Build Note} + +To build these tools, one needs to +place the source directory next to a valid +\texttt{blktrace}\footnote{\texttt{git://git.kernel.dk/blktrace.git}} +directory, as it includes \texttt{../blktrace} in the \texttt{Makefile}. + + +%--------------------- +\newpage\section{\texttt{btrecord} and \texttt{btreplay} Operating Model} + +The \texttt{blktrace} utility provides the ability to collect detailed +traces from the kernel for each IO processed by the block IO layer. The +traces provide a complete timeline for each IO processed, including +detailed information concerning when an IO was first received by the block +IO layer -- indicating the device, CPU number, time stamp, IO direction, +sector number and IO size (number of sectors). Using this information, +one is able to \emph{replay} the IO again on the same machine or another +set up entirely. + +\subsection{Basic Workflow} +The basic operating work-flow to replay IOs would be something like: + +\begin{enumerate} + \item Run \texttt{blktrace} to collect traces. Here you specify the + device or devices that you wish to trace and later replay IOs upon. Note: + the only traces you are interested in are \emph{QUEUE} requests -- + thus, to save system resources (including storage for traces), one could + specify the \texttt{-a queue} command line option to \texttt{blktrace}. + + \item While \texttt{blktrace} is running, you run the workload that you + are interested in. + + \item When the work load has completed, you stop the \texttt{blktrace} + utility (thus saving all traces over the complete workload). + + \item You extract the pertinent IO information from the traces saved by + \texttt{blktrace} using the \texttt{btrecord} utility. This will parse + each trace file created by \texttt{blktrace}, and craft IO descriptions + to be used in the next phase of the workload processing. + + \item Once \texttt{btrecord} has successfully created a series of data + files to be processed, you can run the \texttt{btreplay} utility which + attempts to generate the same IOs seen during the sample workload phase. +\end{enumerate} + +\subsection{IO Stream Replay Characteristics} + The major characteristics of the IO stream that are kept intact include: + + \begin{description} + \item[Device] The IOs are replayed on the same device as was seen + during the sample workload. + + \item[IO direction] The same IO direction (read/write) is maintained. + + \item[IO offset] The same device offset is maintained. + + \item[IO size] The same number of sectors are transferred. + + \item[Time differential] The time stamps stored during the + \texttt{blktrace} run are used to determine the amount of time between + IOs during the sample workload. \texttt{btreplay} \emph{attempts} to + maintain the same time differential between IOs, but no guarantees as + to complete accuracy are provided by the utility. + + \item[Device IO Stream Ordering] All IOs on a device are submitted in + the precise order they were seen during the sample workload run. + \end{description} + + As noted above, the time between IOs may not be accurately maintained + during replays. In addition the actual ordering of IOs \emph{between} + devices is not necessarily maintained. (Each device with an IO stream + maintains its own concept of time, and thus there may be slippage of the + time kept between managing threads.) + + \begin{quotation} + We have prototyped a different approach, wherein a single managing + thread handles all IOs across all devices. This approach, while + guaranteeing correct ordering of IOs across all devices, resulted in + much worse timing on a per IO basis. + \end{quotation} + +\subsection{\texttt{btrecord/btreplay} Method of Operation} + +As noted above, \texttt{btrecord} extracts \texttt{QUEUE} operations from +\texttt{blktrace} output. These \texttt{QUEUE} operations indicate the +entrance of IOs into the block IO layer. In order to replay these IOs with +some accuracy in regards to ordering and timeliness, we decided to take +multiple sequential (in time) IOs and put them in a single \emph{bunch} of +IOs that will be processed as a single \emph{asynchronous IO} call to the +kernel\footnote{Attempts to do them individually resulted in too large of a +turnaround time penalty (user-space to kernel and back). Note that in a +number of workloads, the IOs are coming in from the page cache handling +code, and thus are submitted to the block IO layer with \emph{very small} +time intervals between issues.}. To manage the size of the \emph{bunches}, +the \texttt{btrecord} utility provides you with two controlling knobs: + +\begin{description} + \item[\texttt{--max-bunch-time}] This is the amount of time to encompass + in one bunch -- only IOs within the time specified are eligible + for \emph{bunching.} The default time is 10 milliseconds (10,000,000 + nanoseconds). Refer to section~\ref{sec:c-o-m} on page~\pageref{sec:c-o-m} + for more information. + + \item[\texttt{--max-pkts}] A \emph{bunch} size can be anywhere from + 1 to 512 packets in size and by default we max a bunch to contain no + more than 8 individual IOs. With this option, one can increase or + decrease the maximum \emph{bunch} size. Refer to section~\ref{sec:c-o-M} + on page~\pageref{sec:c-o-M} for more information. +\end{description} + +Each input data file (one per device per CPU) results in a new record +data file (again, one per device per CPU) which contains information +about \emph{bunches} of IOs to be replayed. \texttt{btreplay} operates on +these record data files by spawning a new pair of threads per file. One +thread manages the submitting of AIOs per bunch in the record data file, +while the other thread manages reclaiming AIOs completed\footnote{We +have found that having the same thread do both results in a further +reduction in replay timing accuracy.}. + +Each submitting thread simply reads the input file of \emph{bunches} +recorded by \texttt{btrecord}, and attempts to faithfully reproduce the +ordering and timing of IOs seen during the sample workload. The reclaiming +thread simply waits for AIO completions, freeing up resources for the +submitting thread to utilize to submit new AIOs. + +The number of CPUs being used on the replay system can be different from +the number on the recorded system. To help with mappings here the +\texttt{--cpus} option allows one to state how many CPUs on the replay +system to utilize. If the number of CPUs on the replay system is less than +on the recording system, we wrap CPU IDs. This \emph{may} result in an +overload of CPU processing capabilities on the replay system. (Refer to +section~\ref{sec:p-o-c} on page~\pageref{sec:p-o-c} for more details about the +\texttt{--cpus} option.) + +\newpage\subsection{Known Deficiencies and Proposed Possible Fixes} + +The overall known deficiencies with this current set of utilities is +outlined here, in some cases ideas on additions and/or improvements are +included as well. + +\begin{enumerate} + \item Lack of IO ordering across devices. + + \begin{quote} + \emph{We could institute the notion of global time across threads, + and thus ensure IO ordering across devices, with some reduction in + timing accuracy.} + \end{quote} + + \item Lack of IO timing accuracy -- additional time between IO bunches. + + \begin{quote} + \emph{This is the primary problem with any IO replay mechanism -- how + to guarantee per-IO timing accuracy with respect to other replayed IOs? + One idea to reduce errors in this area would be to push the IO replay + into the kernel, where you \emph{may} receive more responsive timings.} + \end{quote} + + \item Bunching of IOs results in reduced time amongst IOs within a bunch. + + \begin{quote} + \emph{The user has \emph{some} control over this (via the + \texttt{--max-pkts} option). One \emph{could} simply specify + \texttt{-max-pkts=1} and then each IO would be treated individually. Of + course, this would probably then run into the problem of excessive + inter-IO times.} + \end{quote} + + \item 1-to-1 mapping of devices -- for now the devices on the replay + machine must be the same as on the recording machine. + + \begin{quote} + \emph{It should be relatively trivial to add in the notion of + mapping -- simply include a file that is read which maps devices + on one machine to devices (with offsets and sizes) on the replay + machine\footnote{The notion of an offset and device size to replay on + could be used to both allow for a single device to masquerade as more + than one device, and could be utilized in case the replay device is + smaller than the recorded device.}.} + + \medskip\emph{One could also add in the notion of CPU mappings as well -- + device $D_{rec}$ managed by CPU $C_{rec}$ on the recorded system + shall be replayed on device $D_{rep}$ and CPU $C_{rep}$ on the + replay machine.} + + \bigskip + \begin{quote} + With version 0.9.1 we now support the \texttt{-M} option to do this + -- see section~\ref{sec:p-o-M} on page~\pageref{sec:p-o-M} for more + information on device mapping. + \end{quote} + \end{quote} + +\end{enumerate} + +%--------------------- +\newpage\section{\label{sec:command-line}Command Line Options} +\subsection{\texttt{btrecord} Command Line Options} +\begin{figure}[h!] +\begin{verbatim} +Usage: btrecord -- version 0.9.3 + + [ -d <dir> : --input-directory=<dir> ] Default: . + [ -D <dir> : --output-directory=<dir>] Default: . + [ -F : --find-traces ] Default: Off + [ -h : --help ] Default: Off + [ -m <nsec> : --max-bunch-time=<nsec> ] Default: 10 msec + [ -M <pkts> : --max-pkts=<pkts> ] Default: 8 + [ -o <base> : --output-base=<base> ] Default: replay + [ -v : --verbose ] Default: Off + [ -V : --version ] Default: Off + <dev>... Default: None +\end{verbatim} +\caption{\label{fig:btrecord--help}\texttt{btrecord --help} Output} +\end{figure} +\FloatBarrier + +\subsubsection{\label{sec:c-o-d}\texttt{-d} or +\texttt{--input-directory}\\Set Input Directory} + +The \texttt{-d} option requires a single parameter providing the directory +name for where input files are to be found. The default directory is the +current directory (\texttt{.}). + +\subsubsection{\label{sec:c-o-D}\texttt{-D} or +\texttt{--output-directory}\\Set Output Directory} + +The \texttt{-D} option requires a single parameter providing the directory +name for where output files are to be placed. The default directory is the +current directory (\texttt{.}). + +\subsubsection{\texttt{-F} or \texttt{--find-traces}\\Find Trace Files +Automatically} + +The \texttt{-F} option instructs \texttt{btrecord} to go find all the +trace files in the directory specified (either via the \texttt{-d} +option, or in the default directory '.'). + +\subsubsection{\texttt{-h} or \texttt{--help}\\Display Help Message} +\subsubsection{\texttt{-V} or \texttt{--version}\\Display +\texttt{btrecord}Version} + +The \texttt{-h} option displays the command line options and +defaults, as presented in figure~\ref{fig:btrecord--help} on +page~\pageref{fig:btrecord--help}. + +The \texttt{-V} option displays the \texttt{btreplay} version, as shown here: + +\begin{verbatim} +$ btrecord --version +btrecord -- version 0.9.0 +\end{verbatim} + +Both commands exit immediately after processing the option. + +\subsubsection{\label{sec:c-o-m}\texttt{-m} or +\texttt{--max-bunch-time}\\Set Maximum Time Per Bunch} + +The \texttt{-m} option requires a single parameter which specifies an +amount of time (in nanoseconds) to include in any one bunch of IOs that +are to be processed. The smaller the value, the smaller the number of +IOs processed at one time -- perhaps yielding in more realistic replay. +However, after a certain point the amount of overhead per bunch may result +in additional real replay time, thus yielding less accurate replay times. + +The default value is 10,000,000 nanoseconds (10 milliseconds). + +\subsubsection{\label{sec:c-o-M}\texttt{-M} or +\texttt{--max-pkts}\\Set Maximum Packets Per Bunch} + +The \texttt{-M} option requires a single parameter which specifies the +maximum number of IOs to store in a single bunch. As with the \texttt{-m} +option (section~\ref{sec:c-o-m}), smaller values \emph{may} or \emph{may not} +yield more accurate replay times. + +The default value is 8, with a maximum value of up to 512 being supported. + +\subsubsection{\label{sec:c-o-o}\texttt{-o} or +\texttt{--output-base}\\Set Base Name for Output Files} + +Each output file has 3 fields: + +\begin{enumerate} + \item Device identifier (taken directly from the device name of the + \texttt{blktrace} output file). + + \item \texttt{btrecord} base name -- by default ``replay''. + + \item And the CPU number (again, taken directly from the + \texttt{blktrace} output file name). +\end{enumerate} + +This option requires a single parameter that will override the default name +(replay), and replace it with the specified value. + +\subsubsection{\label{sec:c-o-v}\texttt{-v} or +\texttt{--verbose}\\Select Verbose Output} + +This option will output some simple statistics at the end of a successful +run. Figure~\ref{fig:verb-out} (page~\pageref{fig:verb-out}) shows +an example of some output, while figure~\ref{fig:verb-defs} +(page~\pageref{fig:verb-defs}) shows what the fields mean. + +\begin{figure}[h!] +\begin{verbatim} +sdab:0: 580661 pkts (tot), 126030 pkts (replay), 89809 bunches, 1.4 pkts/bunch +sdab:1: 2559775 pkts (tot), 430172 pkts (replay), 293029 bunches, 1.5 pkts/bunch +sdab:2: 653559 pkts (tot), 136522 pkts (replay), 102288 bunches, 1.3 pkts/bunch +sdab:3: 474773 pkts (tot), 117849 pkts (replay), 69572 bunches, 1.7 pkts/bunch +\end{verbatim} +\caption{\label{fig:verb-out}Verbose Output Example} +\end{figure} +\FloatBarrier + +\begin{figure}[h!] +\begin{description} + \item[Field 1] The first field contains the device name and CPU + identifier. Thus: \texttt{sdab:0:} means the device \texttt{sdab} and + traces on CPU 0. + + \item[Field 2] The second field contains the total number of packets + processed for each device file. + + \item[Field 3] The next field shows the number of packets eligible for + replay. + + \item[Field 4] The fourth field contains the total number of IO bunches. + + \item[Field 5] The last field shows the average number of IOs per bunch + recorded. +\end{description} +\caption{\label{fig:verb-defs}Verbose Field Definitions} +\end{figure} +\FloatBarrier + +%--------------------- +\newpage\subsection{\texttt{btreplay} Command Line Options} +\begin{figure}[h!] +\begin{verbatim} +Usage: btreplay -- version 0.9.3 + + [ -c <cpus> : --cpus=<cpus> ] Default: 1 + [ -d <dir> : --input-directory=<dir> ] Default: . + [ -F : --find-records ] Default: Off + [ -h : --help ] Default: Off + [ -i <base> : --input-base=<base> ] Default: replay + [ -I <iters>: --iterations=<iters> ] Default: 1 + [ -M <file> : --map-devs=<file> ] Default: None + [ -N : --no-stalls ] Default: Off + [ -x <int> : --acc-factor=<int> ] Default: 1 + [ -v : --verbose ] Default: Off + [ -V : --version ] Default: Off + [ -W : --write-enable ] Default: Off + <dev...> Default: None +\end{verbatim} +\caption{\label{fig:btreplay--help}\texttt{btreplay --help} Output} +\end{figure} +\FloatBarrier + +\subsubsection{\label{sec:p-o-c}\texttt{-c} or +\texttt{--cpus}\\Set Number of CPUs to Use} + +\subsubsection{\label{sec:p-o-d}\texttt{-d} or +\texttt{--input-directory}\\Set Input Directory} + +The \texttt{-d} option requires a single parameter providing the directory +name for where input files are to be found. The default directory is the +current directory (\texttt{.}). + +\subsubsection{\texttt{-F} or \texttt{--find-records}\\Find RecordFiles +Automatically} + +The \texttt{-F} option instructs \texttt{btreplay} to go find all the +record files in the directory specified (either via the \texttt{-d} +option, or in the default directory '.'). + +\subsubsection{\texttt{-h} or \texttt{--help}\\Display Help Message} +\subsubsection{\texttt{-V} or \texttt{--version}\\Display +\texttt{btreplay}Version} + +The \texttt{-h} option displays the command line options and +defaults, as presented in figure~\ref{fig:btreplay--help} on +page~\pageref{fig:btreplay--help}. + +The \texttt{-V} option displays the \texttt{btreplay} version, as show here: + +\begin{verbatim} +$ btreplay --version +btreplay -- version 0.9.0 +\end{verbatim} + +Both commands exit immediately after processing the option. + +\subsubsection{\label{sec:p-o-i}\texttt{-i} or +\texttt{--input-base}\\Set Base Name for Input Files} + +Each input file has 3 fields: + +\begin{enumerate} + \item Device identifier (taken directly from the device name of the + \texttt{blktrace} output file). + + \item \texttt{btrecord} base name -- by default ``replay''. + + \item And the CPU number (again, taken directly from the + \texttt{blktrace} output file name). +\end{enumerate} + +This option requires a single parameter that will override the default name +(replay), and replace it with the specified value. + +\subsubsection{\label{sec:p-o-I}\texttt{-I} or +\texttt{--iterations}\\Set Number of Iterations to Run} + +This option requires a single parameter which specifies the number of times +to run through the input files. The default value is 1. + +\subsubsection{\label{sec:p-o-M}\texttt{-M} or \texttt{map-devs}\\ +Specify Device Mappings} + +This option requires a single parameter which specifies the name of a +file containing device mappings. The file must be very simply managed, with +just two pieces of data per line: + +\begin{enumerate} + \item The device name on the recorded system (with the \texttt{'/dev/'} + removed). Example: \texttt{/dev/sda} would just be \texttt{sda}. + + \item The device name on the replay system to use (again, without the + \texttt{'/dev/'} path prepended). +\end{enumerate} + +An example file for when one would map devices \texttt{/dev/sda} and +\texttt{/dev/sdb} on the recorded system to \texttt{dev/sdg} and +\texttt{sdh} on the replay system would be: + +\begin{verbatim} +sda sdg +sdb sdh +\end{verbatim} + +The only entries in the file that are allowed are these two element lines +-- we do not (yet?) support the notion of blank lines, or comment lines, or +the like. + +The utility \emph{does} allow for multiple \texttt{-M} options to be +supplied on the command line. + +\subsubsection{\label{sec:o-N}\texttt{-N} or \texttt{--no-stalls}\\Disable +Pre-bunch Stalls} + +When specified on the command line, all pre-bunch stall indicators will be +ignored. IOs will be replayed without inter-bunch delays. + +\subsubsection{\label{sec:o-x}\texttt{-x} or \texttt{--acc-factor}\\Acceleration +Factor} + + While the \texttt{--no-stalls} option allows the traces to be replayed + with no waiting time, this option specifies some acceleration factor + to be used. If the value of two is used, then the stall time is + divided by half resulting in a reduction of the execution time by + this factor. Note that if this number is too high, the results will + be equivalent of not having stall. + +\subsubsection{\label{sec:p-o-v}\texttt{-v} or +\texttt{--verbose}\\Select Verbose Output} + +When specified on the command line, this option instructs \texttt{btreplay} +to store information concerning each \emph{stall} and IO operation +performed by \texttt{btreplay}. The name of each file so created will be +the input file name used with an extension of \texttt{.rep} appended onto +it. Thus, an input file of the name \texttt{sdab.replay.3} would generate a +verbose output file with the name \texttt{sdab.replay.3.rep} in the +directory specified for input files. + +In addition, \texttt{btreplay} will also output to \texttt{stderr} the +names of the input files being processed. + +\subsubsection{\label{sec:p-o-W}\texttt{-W} or +\texttt{--write-enable}\\Enable Writing During Replay} + +As a precautionary measure, by default \texttt{btreplay} will \emph{not} +process \emph{write} requests. In order to enable \texttt{btreplay} to +actually \emph{write} to devices one must explicitly specify the +\texttt{-W} option. + +\end{document} |