11 files changed, 478 insertions, 437 deletions
diff --git a/cachegrind/docs/cg-manual.xml b/cachegrind/docs/cg-manual.xml
index 80f2a8c28..9b4b0b9a2 100644
--- a/cachegrind/docs/cg-manual.xml
+++ b/cachegrind/docs/cg-manual.xml
@@ -59,9 +59,6 @@ and test coverage.</para>
 additionally specify <computeroutput>--branch-sim=yes</computeroutput>
 on the command line.</para>
 
-<para>Any feedback, bug-fixes, suggestions, etc, welcome.</para>
-
-
 
 <sect2 id="cg-manual.overview" xreflabel="Overview">
 <title>Overview</title>
@@ -119,7 +116,7 @@ outputs of multiple Cachegrind runs, into a single file which you then
 use as the input for
 <computeroutput>cg_annotate</computeroutput>.</para>
 
-<para>The steps are described in detail in the following
+<para>These steps are described in detail in the following
 sections.</para>
 
 </sect2>
@@ -128,14 +125,14 @@ sections.</para>
 <sect2 id="cache-sim" xreflabel="Cache simulation specifics">
 <title>Cache simulation specifics</title>
 
-<para>Cachegrind uses a simulation for a machine with a split L1
-cache and a unified L2 cache.  This configuration is used for all
-(modern) x86-based machines we are aware of.  Old Cyrix CPUs had
-a unified I and D L1 cache, but they are ancient history
-now.</para>
+<para>Cachegrind simulates a machine with independent
+first level instruction and data caches (I1 and D1), backed by a
+unified second level cache (L2).  This configuration is used by almost
+all modern machines.  Some old Cyrix CPUs had a unified I and D L1
+cache, but they are ancient history now.</para>
 
-<para>The more specific characteristics of the simulation are as
-follows.</para>
+<para>Specific characteristics of the simulation are as
+follows:</para>
 
 <itemizedlist>
 
@@ -162,9 +159,9 @@ follows.</para>
   <listitem>
     <para>Inclusive L2 cache: the L2 cache replicates all the
     entries of the L1 cache.  This is standard on Pentium chips,
-    but AMD Athlons use an exclusive L2 cache that only holds
-    blocks evicted from L1.  Ditto AMD Durons and most modern
-    VIAs.</para>
+    but AMD Opterons, Athlons and Durons 
+    use an exclusive L2 cache that only holds
+    blocks evicted from L1.  Ditto most modern VIA CPUs.</para>
   </listitem>
 
 </itemizedlist>
@@ -182,6 +179,14 @@ happens.  You can manually specify one, two or all three levels
 <computeroutput>--D1</computeroutput> and
 <computeroutput>--L2</computeroutput> options.</para>
 
+<para>On PowerPC platforms
+Cachegrind cannot automatically 
+determine the cache configuration, so you will 
+need to specify it with the
+<computeroutput>--I1</computeroutput>,
+<computeroutput>--D1</computeroutput> and
+<computeroutput>--L2</computeroutput> options.</para>
+
 
 <para>Other noteworthy behaviour:</para>
 
@@ -385,9 +390,11 @@ programs that spawn child processes.</para>
 <title>Cachegrind options</title>
 
 <!-- start of xi:include in the manpage -->
-<para id="cg.opts.para">Manually specifies the I1/D1/L2 cache
-configuration, where <varname>size</varname> and
-<varname>line_size</varname> are measured in bytes.  The three items
+<para id="cg.opts.para">Using command line options, you can 
+manually specify the I1/D1/L2 cache
+configuration to simulate.  For each cache, you can specify the
+size, associativity and line size.  The size and line size
+are measured in bytes.  The three items
 must be comma-separated, but with no spaces, eg:
 <literallayout>    valgrind --tool=cachegrind --I1=65535,2,64</literallayout>
 
@@ -551,7 +558,7 @@ Ir        I1mr I2mr Dr        D1mr  D2mr  Dw        D1mw   D2mw    file:function
    <para>Events recorded: event abbreviations are:</para>
    <itemizedlist>
      <listitem>
-       <para><computeroutput>Ir </computeroutput>: I cache reads
+       <para><computeroutput>Ir</computeroutput>: I cache reads
        (ie. instructions executed)</para>
      </listitem>
      <listitem>
@@ -563,7 +570,7 @@ Ir        I1mr I2mr Dr        D1mr  D2mr  Dw        D1mw   D2mw    file:function
        instruction read misses</para>
      </listitem>
      <listitem>
-       <para><computeroutput>Dr </computeroutput>: D cache reads
+       <para><computeroutput>Dr</computeroutput>: D cache reads
        (ie. memory reads)</para>
      </listitem>
      <listitem>
@@ -575,7 +582,7 @@ Ir        I1mr I2mr Dr        D1mr  D2mr  Dw        D1mw   D2mw    file:function
        read misses</para>
      </listitem>
      <listitem>
-       <para><computeroutput>Dw </computeroutput>: D cache writes
+       <para><computeroutput>Dw</computeroutput>: D cache writes
        (ie. memory writes)</para>
      </listitem>
      <listitem>
@@ -613,8 +620,8 @@ Ir        I1mr I2mr Dr        D1mr  D2mr  Dw        D1mw   D2mw    file:function
  </listitem>
 
  <listitem>
-   <para>Events shown: the events shown (a subset of events
-   gathered).  This can be adjusted with the
+   <para>Events shown: the events shown, which is a subset of the events
+   gathered.  This can be adjusted with the
    <computeroutput>--show</computeroutput> option.</para>
   </listitem>
 
@@ -637,8 +644,8 @@ Ir        I1mr I2mr Dr        D1mr  D2mr  Dw        D1mw   D2mw    file:function
 
   <listitem>
     <para>Threshold: <computeroutput>cg_annotate</computeroutput>
-    by default omits functions that cause very low numbers of
-    misses to avoid drowning you in information.  In this case,
+    by default omits functions that cause very low counts
+    to avoid drowning you in information.  In this case,
     cg_annotate shows summaries the functions that account for
     99% of the <computeroutput>Ir</computeroutput> counts;
     <computeroutput>Ir</computeroutput> is chosen as the
@@ -682,28 +689,9 @@ unloading of shared objects) its counts are aggregated into a
 single cost centre written as
 <computeroutput>(discarded):(discarded)</computeroutput>.</para>
 
-<para>It is worth noting that functions will come from three
-types of source files:</para>
-
-<orderedlist>
-  <listitem>
-    <para>From the profiled program
-    (<filename>concord.c</filename> in this example).</para>
-  </listitem>
-  <listitem>
-    <para>From libraries (eg. <filename>getc.c</filename>)</para>
-  </listitem>
-  <listitem>
-    <para>From Valgrind's implementation of some libc functions
-    (eg. <computeroutput>vg_clientmalloc.c:malloc</computeroutput>).
-    These are recognisable because the filename begins with
-    <computeroutput>vg_</computeroutput>, and is probably one of
-    <filename>vg_main.c</filename>,
-    <filename>vg_clientmalloc.c</filename> or
-    <filename>vg_mylibc.c</filename>.</para>
-  </listitem>
-
-</orderedlist>
+<para>It is worth noting that functions will come both from
+the profiled program (eg. <filename>concord.c</filename>)
+and from libraries (eg. <filename>getc.c</filename>)</para>
 
 <para>There are two ways to annotate source files -- by choosing
 them manually, or with the
@@ -759,7 +747,7 @@ found in one of the directories specified with the
 and file are both given.</para>
 
 <para>Each line is annotated with its event counts.  Events not
-applicable for a line are represented by a `.'; this is useful
+applicable for a line are represented by a dot.  This is useful
 for distinguishing between an event which cannot happen, and one
 which can but did not.</para>
 
@@ -1063,7 +1051,7 @@ warnings.</para>
 
   <listitem>
     <para>Files with more than 65,535 lines cause difficulties
-    for the stabs debug info reader.  This is because the line
+    for the Stabs-format debug info reader.  This is because the line
     number in the <computeroutput>struct nlist</computeroutput>
     defined in <filename>a.out.h</filename> under Linux is only a
     16-bit value.  Valgrind can handle some files with more than
@@ -1071,6 +1059,11 @@ warnings.</para>
     line number overflows.  But some cases are beyond it, in
     which case you'll get a warning message explaining that
     annotations for the file might be incorrect.</para>
+    
+    <para>If you are using gcc 3.1 or later, this is most likely
+    irrelevant, since gcc switched to using the more modern DWARF2 
+    format by default at version 3.1.  DWARF2 does not have any such
+    limitations on line numbers.</para>
   </listitem>
 
   <listitem>
@@ -1087,14 +1080,6 @@ warnings.</para>
 <para>This list looks long, but these cases should be fairly
 rare.</para>
 
-<formalpara>
-  <title>Note:</title>
-  <para><computeroutput>stabs</computeroutput> is not an easy
-  format to read.  If you come across bizarre annotations that
-  look like might be caused by a bug in the stabs reader, please
-  let us know.</para>
-</formalpara>
-
 </sect2>
 
 
@@ -1112,16 +1097,17 @@ shortcomings:</para>
   </listitem>
 
   <listitem>
-    <para>It doesn't account for other process activity (although
-    this is probably desirable when considering a single
-    program).</para>
+    <para>It doesn't account for other process activity.
+    This is probably desirable when considering a single
+    program.</para>
   </listitem>
 
   <listitem>
     <para>It doesn't account for virtual-to-physical address
-    mappings; hence the entire simulation is not a true
+    mappings.  Hence the simulation is not a true
     representation of what's happening in the
-    cache.</para>
+    cache.  Most caches are physically indexed, but Cachegrind
+    simulates caches using virtual addresses.</para>
   </listitem>
 
   <listitem>
@@ -1157,17 +1143,17 @@ shortcomings:</para>
 
 </itemizedlist>
 
-<para>Another thing worth nothing is that results are very
-sensitive.  Changing the size of the
-the executable being profiled, or the size of the the shared objects
-it uses, or even the length of its name can perturb the
-results.  Variations will be small, but don't expect perfectly
-repeatable results if your program changes at all.</para>
-
-<para>Beware also of address space randomisation, which many Linux
-distros now do by default.  This loads the program and its libraries
-at different randomly chosen address each run, and may also disturb
-the results.</para>
+<para>Another thing worth noting is that results are very sensitive.
+Changing the size of the the executable being profiled, or the sizes
+of any of the shared libraries it uses, or even the length of their
+file names, can perturb the results.  Variations will be small, but
+don't expect perfectly repeatable results if your program changes at
+all.</para>
+
+<para>More recent GNU/Linux distributions do address space
+randomisation, in which identical runs of the same program have their
+shared libraries loaded at different locations, as a security measure.
+This also perturbs the results.</para>
 
 <para>While these factors mean you shouldn't trust the results to
 be super-accurate, hopefully they should be close enough to be
diff --git a/callgrind/docs/cl-manual.xml b/callgrind/docs/cl-manual.xml
index f33fdad98..b6318207b 100644
--- a/callgrind/docs/cl-manual.xml
+++ b/callgrind/docs/cl-manual.xml
@@ -10,13 +10,12 @@
 <sect1 id="cl-manual.use" xreflabel="Overview">
 <title>Overview</title>
 
-<para>Callgrind is a Valgrind tool for profiling programs
-with the ability to construct a call graph from the execution.
+<para>Callgrind is profiling tool that can
+construct a call graph for a program's run.
 By default, the collected data consists of
-the number of instructions executed, their attribution
-to source lines, and
-call relationship among functions together with number of
-actually executed calls.
+the number of instructions executed, their relationship
+to source lines, the caller/callee relationship between functions,
+and the numbers of such calls.
 Optionally, a cache simulator (similar to cachegrind) can produce
 further information about the memory access behavior of the application.
 </para>
@@ -34,8 +33,10 @@ of the profiling, two command line tools are provided:</para>
     <para>You can read the manpage here: <xref
 	      linkend="callgrind-annotate"/>.</para>
 -->
-    <para>For graphical visualization of the data, check out
-    <ulink url="&cl-gui;">KCachegrind</ulink>.</para>
+    <para>For graphical visualization of the data, try
+    <ulink url="&cl-gui;">KCachegrind</ulink>, which is a KDE/Qt based
+    GUI that makes it easy to navigate the large amount of data that
+    Callgrind produces.</para>
 
   </listitem>
   </varlistentry>
@@ -62,36 +63,48 @@ command line.</para>
   <sect2 id="cl-manual.functionality" xreflabel="Functionality">
   <title>Functionality</title>
 
-<para>Cachegrind provides a flat profile: event counts (reads, misses etc.)
-attributed to functions exactly represent events which happened while the
-function itself was running, which also is called <emphasis>self</emphasis>
-or <emphasis>exclusive</emphasis> cost. In addition, Callgrind further
-attributes call sites inside functions with event counts for events which
-happened while the call was active, ie. while code was executed which actually
-was called from the given call site. Adding these call costs to the self cost of
-a function gives the so called <emphasis>inclusive</emphasis> cost.
-As an example, inclusive cost of <computeroutput>main()</computeroutput> should
-be almost 100 percent (apart from any cost spent in startup before main, such as
-initialization of the run time linker or construction of global C++ objects).
-</para>
-
-<para>Together with the call graph, this allows you to see the call chains starting
-from <computeroutput>main()</computeroutput>, inside which most of the
-events were happening. This especially is useful for functions called from
-multiple call sites, and where any optimization makes sense only by changing
-code in the caller (e.g. by reducing the call count).</para>
+<para>Cachegrind collects flat profile data: event counts (data reads,
+cache misses, etc.) are attributed directly to the function they
+occurred in.  This simple cost attribution mechanism is sometimes
+called <emphasis>self</emphasis> or <emphasis>exclusive</emphasis>
+attribution.</para>
+
+<para>Callgrind extends this functionality by propagating costs
+across function call boundaries.  If function <code>foo</code> calls
+<code>bar</code>, the costs from <code>bar</code> are added into
+<code>foo</code>'s costs.  When applied to the program as a whole,
+this builds up a picture of so called <emphasis>inclusive</emphasis>
+costs, that is, where the cost of each function includes the costs of
+all functions it called, directly or indirectly.</para>
+
+<para>As an example, the inclusive cost of
+<computeroutput>main</computeroutput> should be almost 100 percent
+of the total program cost.  Because of costs arising before 
+<computeroutput>main</computeroutput> is run, such as
+initialization of the run time linker and construction of global C++
+objects, the inclusive cost of <computeroutput>main</computeroutput>
+is not exactly 100 percent of the total program cost.</para>
+
+<para>Together with the call graph, this allows you to find the
+specific call chains starting from
+<computeroutput>main</computeroutput> in which the majority of the
+program's costs occur.  Caller/callee cost attribution is also useful
+for profiling functions called from multiple call sites, and where
+optimization opportunities depend on changing code in the callers, in
+particular by reducing the call count.</para>
 
 <para>Callgrind's cache simulation is based on the 
 <ulink url="&cg-tool-url;">Cachegrind tool</ulink>. Read 
-<ulink url="&cg-doc-url;">Cachegrind's documentation</ulink> first; 
-this page describes the features supported in addition to 
+<ulink url="&cg-doc-url;">Cachegrind's documentation</ulink> first.
+The material below describes the features supported in addition to 
 Cachegrind's features.</para>
 
-<para>Callgrinds ability to trace function call varies with the ISA of the
-platform it is run on. Its usage was specially tailored for x86 and amd64,
-and unfortunately, it currently happens to show quite bad call/return detection
-in PPC32/64 code (this is because there are only jump/branch instructions
-in the PPC ISA, and Callgrind has to rely on heuristics).</para>
+<para>Callgrind's ability to detect function calls and returns depends
+on the instruction set of the platform it is run on.  It works best
+on x86 and amd64, and unfortunately currently does not work so well
+on PowerPC code.  This is because there are no explicit call or return
+instructions in the PowerPC instruction set, so Callgrind has to rely
+on heuristics to detect calls and returns.</para>
 
   </sect2>
 
@@ -114,8 +127,8 @@ in the PPC ISA, and Callgrind has to rely on heuristics).</para>
 
   <para>After program termination, a profile data file named 
   <computeroutput>callgrind.out.pid</computeroutput>
-  is generated with <emphasis>pid</emphasis> being the process ID 
-  of the execution of this profile run.
+  is generated, where <emphasis>pid</emphasis> is the process ID 
+  of the program being profiled.
   The data file contains information about the calls made in the
   program among the functions executed, together with events of type
   <command>Instruction Read Accesses</command> (Ir).</para>
@@ -138,11 +151,11 @@ in the PPC ISA, and Callgrind has to rely on heuristics).</para>
     </listitem>
 
     <listitem>
-      <para><option>--tree=both</option>: Interleaved into the
-      ordered list of function, show the callers and the callees
+      <para><option>--tree=both</option>: Interleave into the
+      top level list of functions, information on the callers and the callees
       of each function. In these lines, which represents executed
       calls, the cost gives the number of events spent in the call.
-      Indented, above each given function, there is the list of callers,
+      Indented, above each function, there is the list of callers,
       and below, the list of callees. The sum of events in calls to
       a given function (caller lines), as well as the sum of events in
       calls from the function (callee lines) together with the self
@@ -154,13 +167,15 @@ in the PPC ISA, and Callgrind has to rely on heuristics).</para>
   for all relevant functions for which the source can be found. In
   addition to source annotation as produced by
   <computeroutput>cg_annotate</computeroutput>, you will see the
-  annotated call sites with call counts. For all other options, look
-  up the manual for <computeroutput>cg_annotate</computeroutput>.
+  annotated call sites with call counts. For all other options, 
+  consult the (Cachegrind) documentation for
+  <computeroutput>cg_annotate</computeroutput>.
   </para>
 
   <para>For better call graph browsing experience, it is highly recommended
-  to use <ulink url="&cl-gui;">KCachegrind</ulink>. If your code happens
-  to spent relevant fractions of cost in <emphasis>cycles</emphasis> (sets
+  to use <ulink url="&cl-gui;">KCachegrind</ulink>.
+  If your code
+  has a significant fraction of its cost in <emphasis>cycles</emphasis> (sets
   of functions calling each other in a recursive manner), you have to
   use KCachegrind, as <computeroutput>callgrind_annotate</computeroutput>
   currently does not do any cycle detection, which is important to get correct
@@ -175,19 +190,20 @@ in the PPC ISA, and Callgrind has to rely on heuristics).</para>
   <para>If the program section you want to profile is somewhere in the
   middle of the run, it is beneficial to 
   <emphasis>fast forward</emphasis> to this section without any 
-  profiling at all, and switch profiling on later.  This is achieved by using
+  profiling, and then switch on profiling.  This is achieved by using
+  the command line option
   <option><xref linkend="opt.instr-atstart"/>=no</option> 
-  and interactively use 
-  <computeroutput>callgrind_control -i on</computeroutput> before the 
-  interesting code section is about to be executed. To exactly specify
+  and running, in a shell,
+  <computeroutput>callgrind_control -i on</computeroutput> just before the 
+  interesting code section is executed. To exactly specify
   the code position where profiling should start, use the client request
   <computeroutput>CALLGRIND_START_INSTRUMENTATION</computeroutput>.</para>
 
-  <para>If you want to be able to see assembler annotation, specify
+  <para>If you want to be able to see assembly code level annotation, specify
   <option><xref linkend="opt.dump-instr"/>=yes</option>. This will produce
   profile data at instruction granularity. Note that the resulting profile
   data
-  can only be viewed with KCachegrind. For assembler annotation, it also is
+  can only be viewed with KCachegrind. For assembly annotation, it also is
   interesting to see more details of the control flow inside of functions,
   ie. (conditional) jumps. This will be collected by further specifying
   <option><xref linkend="opt.collect-jumps"/>=yes</option>.</para>
@@ -203,11 +219,11 @@ in the PPC ISA, and Callgrind has to rely on heuristics).</para>
          xreflabel="Multiple dumps from one program run">
   <title>Multiple profiling dumps from one program run</title>
 
-  <para>Often, you are not interested in characteristics of a full 
-  program run, but only of a small part of it (e.g. execution of one
-  algorithm).  If there are multiple algorithms or one algorithm 
-  running with different input data, it's even useful to get different
-  profile information for multiple parts of one program run.</para>
+  <para>Sometimes you are not interested in characteristics of a full 
+  program run, but only of a small part of it, for example execution of one
+  algorithm.  If there are multiple algorithms, or one algorithm 
+  running with different input data, it may even be useful to get different
+  profile information for different parts of a single program run.</para>
 
   <para>Profile data files have names of the form
 <screen>
@@ -233,7 +249,7 @@ callgrind.out.<emphasis>pid</emphasis>.<emphasis>part</emphasis>-<emphasis>threa
     <listitem>
       <para><command>Dump on program termination.</command>
       This method is the standard way and doesn't need any special
-      action from your side.</para>
+      action on your part.</para>
     </listitem>
 
     <listitem>
@@ -245,7 +261,7 @@ callgrind.out.<emphasis>pid</emphasis>.<emphasis>part</emphasis>-<emphasis>threa
       distinguish profile dumps.  The control program will not terminate
       before the dump is completely written.  Note that the application
       must be actively running for detection of the dump command. So,
-      for a GUI application, resize the window or for a server send a
+      for a GUI application, resize the window, or for a server, send a
       request.</para>
       <para>If you are using <ulink url="&cl-gui;">KCachegrind</ulink>
       for browsing of profile information, you can use the toolbar
@@ -348,7 +364,7 @@ callgrind.out.<emphasis>pid</emphasis>.<emphasis>part</emphasis>-<emphasis>threa
   probably leading to many <emphasis>cold misses</emphasis>
   which would not have happened in reality. If you do not want to see these,
   start event collection a few million instructions after you have switched
-  on instrumentation</para>.
+  on instrumentation.</para>
 
 
   </sect2>
@@ -358,14 +374,21 @@ callgrind.out.<emphasis>pid</emphasis>.<emphasis>part</emphasis>-<emphasis>threa
   <sect2 id="cl-manual.cycles" xreflabel="Avoiding cycles">
   <title>Avoiding cycles</title>
 
-  <para>Each group of functions with any two of them happening to have a
-  call chain from one to the other, is called a cycle.  For example,
-  with A calling B, B calling C, and C calling A, the three functions
-  A,B,C build up one cycle.</para>
+  <para>Informally speaking, a cycle is a group of functions which
+  call each other in a recursive way.</para>
+
+  <para>Formally speaking, a cycle is a nonempty set S of functions,
+  such that for every pair of functions F and G in S, it is possible
+  to call from F to G (possibly via intermediate functions) and also
+  from G to F.  Furthermore, S must be maximal -- that is, be the
+  largest set of functions satisfying this property.  For example, if
+  a third function H is called from inside S and calls back into S,
+  then H is also part of the cycle and should be included in S.</para>
 
-  <para>If a call chain goes multiple times around inside of a cycle,
+  <para>If a call chain goes multiple times around inside a cycle,
   with profiling, you can not distinguish event counts coming from the
-  first round or the second. Thus, it makes no sense to attach any inclusive
+  first, second or subsequent rounds.
+  Thus, it makes no sense to attach any inclusive
   cost to a call among functions inside of one cycle.
   If "A &gt; B" appears multiple times in a call chain, you
   have no way to partition the one big sum of all appearances of "A &gt;
@@ -383,11 +406,12 @@ callgrind.out.<emphasis>pid</emphasis>.<emphasis>part</emphasis>-<emphasis>threa
   functions.</para>
 
   <para>There is an option to ignore calls to a function with
-  <option><xref linkend="opt.fn-skip"/>=funcprefix</option>.  E.g., you
+  <option><xref linkend="opt.fn-skip"/>=funcprefix</option>.  For
+  example you
   usually do not want to see the trampoline functions in the PLT sections
   for calls to functions in shared libraries. You can see the difference
   if you profile with <option><xref linkend="opt.skip-plt"/>=no</option>.
-  If a call is ignored, cost events happening will be attached to the
+  If a call is ignored, its cost events will be propagated to the
   enclosing function.</para>
 
   <para>If you have a recursive function, you can distinguish the first
@@ -468,9 +492,10 @@ These options influence the name and format of the profile data files.
       <computeroutput>.&lt;pid&gt;</computeroutput> is appended to the
       base dump file name with
       <computeroutput>&lt;pid&gt;</computeroutput> being the process ID
-      of the profile run (with multiple dumps happening, the file name
-      is modified further; see below).</para> <para>This option is
-      especially usefull if your application changes its working
+      of the profiled program.  When multiple dumps are made, the file name
+      is modified further; see below.</para> 
+      <para>This option is
+      especially useful if your application changes its working
       directory.  Usually, the dump file is generated in the current
       working directory of the application at program termination.  By
       giving an absolute path with the base specification, you can force
@@ -485,8 +510,9 @@ These options influence the name and format of the profile data files.
     <listitem>
       <para>This specifies that event counting should be performed at
       per-instruction granularity.
-      This allows for assembler code
-      annotation, but currently the results can only be shown with KCachegrind.</para>
+      This allows for assembly code
+      annotation.  Currently the results can only be 
+      displayed by KCachegrind.</para>
   </listitem>
   </varlistentry>
 
@@ -508,11 +534,9 @@ These options influence the name and format of the profile data files.
     <listitem>
       <para>This option influences the output format of the profile data.
       It specifies whether strings (file and function names) should be
-      identified by numbers. This shrinks the file size, but makes it more difficult
-      for humans to read (which is not recommand either way).</para>
-      <para>However, this currently has to be switched off if
-      the files are to be read by
-      <computeroutput>callgrind_annotate</computeroutput>!</para>
+      identified by numbers. This shrinks the file, 
+      but makes it more difficult
+      for humans to read (which is not recommended in any case).</para>
     </listitem>
   </varlistentry>
 
@@ -525,9 +549,6 @@ These options influence the name and format of the profile data files.
       It specifies whether numerical positions are always specified as absolute
       values or are allowed to be relative to previous numbers.
       This shrinks the file size,</para>
-      <para>However, this currently has to be switched off if
-      the files are to be read by
-      <computeroutput>callgrind_annotate</computeroutput>!</para>
     </listitem>
   </varlistentry>
 
@@ -538,7 +559,7 @@ These options influence the name and format of the profile data files.
     <listitem>
       <para>When multiple profile data parts are to be generated, these
       parts are appended to the same output file if this option is set to
-      "yes". Not recommand.</para>
+      "yes". Not recommended.</para>
   </listitem>
   </varlistentry>
 
@@ -690,7 +711,7 @@ Also see <xref linkend="cl-manual.limits"/>.</para>
     </listitem>
   </varlistentry>
 
-  <varlistentry id="opt.collect-jumps" xreflabel="--collect-jumps=">
+  <varlistentry id="opt.collect-jumps" xreflabel="--collect-jumps">
     <term>
       <option><![CDATA[--collect-jumps=<no|yes> [default: no] ]]></option>
     </term>
@@ -712,9 +733,9 @@ Also see <xref linkend="cl-manual.limits"/>.</para>
 <para>
 These options specify how event counts should be attributed to execution
 contexts.
-More specifically, they specify e.g. if the recursion level or the
-call chain leading to a function should be accounted for, and whether the
-thread ID should be remembered.
+For example, they specify whether the recursion level or the
+call chain leading to a function should be taken into account, 
+and whether the thread ID should be considered.
 Also see <xref linkend="cl-manual.cycles"/>.</para>
 
 <variablelist id="cmd-options.separation">
@@ -735,7 +756,7 @@ Also see <xref linkend="cl-manual.cycles"/>.</para>
       <option><![CDATA[--fn-recursion=<level> [default: 2] ]]></option>
     </term>
     <listitem>
-      <para>Separate function recursions, maximal &lt;level&gt;.
+      <para>Separate function recursions by at most &lt;level&gt; levels.
       See <xref linkend="cl-manual.cycles"/>.</para>
     </listitem>
   </varlistentry>
@@ -745,7 +766,7 @@ Also see <xref linkend="cl-manual.cycles"/>.</para>
       <option><![CDATA[--fn-caller=<callers> [default: 0] ]]></option>
     </term>
     <listitem>
-      <para>Separate contexts by maximal &lt;callers&gt; functions in the
+      <para>Separate contexts by at most &lt;callers&gt; functions in the
       call chain. See <xref linkend="cl-manual.cycles"/>.</para>
     </listitem>
   </varlistentry>
@@ -768,7 +789,8 @@ Also see <xref linkend="cl-manual.cycles"/>.</para>
       call chain A &gt; B &gt; C, and you specify function B to be
       ignored, you will only see A &gt; C.</para>
       <para>This is very convenient to skip functions handling callback
-      behaviour. E.g. for the SIGNAL/SLOT mechanism in QT, you only want
+      behaviour.  For example, with the signal/slot mechanism in the
+      Qt graphics library, you only want
       to see the function emitting a signal to call the slots connected
       to that signal. First, determine the real call chain to see the
       functions needed to be skipped, then use this option.</para>
@@ -781,7 +803,7 @@ Also see <xref linkend="cl-manual.cycles"/>.</para>
     </term>
     <listitem>
       <para>Put a function into a separate group. This influences the
-      context name for cycle avoidance. All functions inside of such a
+      context name for cycle avoidance. All functions inside such a
       group are treated as being the same for context name building, which
       resembles the call chain leading to a context. By specifying function
       groups with this option, you can shorten the context name, as functions
diff --git a/docs/internals/3_2_BUGSTATUS.txt b/docs/internals/3_2_BUGSTATUS.txt
index 874fb6704..f6e0c3567 100644
--- a/docs/internals/3_2_BUGSTATUS.txt
+++ b/docs/internals/3_2_BUGSTATUS.txt
@@ -42,7 +42,7 @@ r6593    r6711     32 139363   callgrind: fix --collect-systime=yes
 r6601    r6712     32 n-i-bz   callgrind: Fix threads display
                                of "callgrind_control -s"
 
-r6734    pending      n-i-nz   Callgrind: improve documentation
+r6734    r6740     32 n-i-nz   Callgrind: improve documentation
 
 r6622    r6713     32 n-i-bz   .eh_frame crud for m_trampoline.S fns
 
@@ -112,6 +112,7 @@ vx1759,r6722
 
 XXX 143924: --db-attach=yes and --trace-children=yes
 
+pending  r6743     32 n-i-bz   Documentation overhaul
 
 
 //// maybe do not fix in 3.2 branch
diff --git a/docs/xml/FAQ.xml b/docs/xml/FAQ.xml
index 720b64702..9ff0626e9 100644
--- a/docs/xml/FAQ.xml
+++ b/docs/xml/FAQ.xml
@@ -131,9 +131,9 @@ collect2: ld returned 1 exit status
     <para>Problem is that running <literal>__libc_freeres()</literal> in
     older glibc versions causes this crash.</para>
 
-    <para>WORKAROUND FOR 1.1.X and later versions of Valgrind: use the
+    <para>Workaround for 1.1.X and later versions of Valgrind: use the
     <option>--run-libc-freeres=no</option> flag.  You may then get space
-    leak reports for glibc-allocations (please _don't_ report these to
+    leak reports for glibc allocations (please don't report these to
     the glibc people, since they are not real leaks), but at least the
     program runs.</para>
   </answer>
@@ -142,14 +142,14 @@ collect2: ld returned 1 exit status
 <qandaentry id="faq.bugdeath">
   <question id="q-bugdeath">
     <para>My (buggy) program dies like this:</para>
-<screen>% valgrind: vg_malloc2.c:442 (bszW_to_pszW): Assertion 'pszW >= 0' failed.</screen>
+<screen>valgrind: m_mallocfree.c:442 (bszW_to_pszW): Assertion 'pszW >= 0' failed.</screen>
   </question>
   <answer id="a-bugdeath">
     <para>If Memcheck (the memory checker) shows any invalid reads,
-    invalid writes and invalid frees in your program, the above may
+    invalid writes or invalid frees in your program, the above may
     happen.  Reason is that your program may trash Valgrind's low-level
     memory manager, which then dies with the above assertion, or
-    something like this.  The cure is to fix your program so that it
+    something similar.  The cure is to fix your program so that it
     doesn't do any illegal memory accesses.  The above failure will
     hopefully go away after that.</para>
   </answer>
@@ -159,21 +159,18 @@ collect2: ld returned 1 exit status
   <question id="q-msgdeath">
     <para>My program dies, printing a message like this along the
     way:</para>
-<screen>% disInstr: unhandled instruction bytes: 0x66 0xF 0x2E 0x5</screen>
+<screen>vex x86->IR: unhandled instruction bytes: 0x66 0xF 0x2E 0x5</screen>
   </question>
   <answer id="a-msgdeath">
-    <para>Older versions did not support some x86 instructions,
-    particularly SSE/SSE2 instructions.  Try a newer Valgrind; we now
-    support almost all instructions.  If it still happens with newer
-    versions, if the failing instruction is an SSE/SSE2 instruction, you
-    might be able to recompile your program without it by using the flag
-    <option>-march</option> to gcc.  Either way, let us know and we'll
-    try to fix it.</para>
+    <para>Older versions did not support some x86 and amd64 instructions,
+    particularly SSE/SSE2/SSE3 instructions.  Try a newer Valgrind; we now
+    support almost all instructions.  If it still breaks, file a bug
+    report.</para>
 
     <para>Another possibility is that your program has a bug and
     erroneously jumps to a non-code address, in which case you'll get a
     SIGILL signal.  Memcheck may issue a warning just before
-    this happens, but they might not if the jump happens to land in
+    this happens, but it might not if the jump happens to land in
     addressable memory.</para>
   </answer>
 </qandaentry>
@@ -189,9 +186,10 @@ collect2: ld returned 1 exit status
     none of the generated code is later overwritten by other generated
     code.  If this happens, though, things will go wrong as Valgrind
     will continue running its translations of the old code (this is true
-    on x86 and AMD64, on PPC32 there are explicit cache flush
-    instructions which Valgrind detects).  You should try running with
-    <option>--smc-check=all</option> in this case; Valgrind will run
+    on x86 and amd64, on PowerPC there are explicit cache flush
+    instructions which Valgrind detects and honours).
+    You should try running with
+    <option>--smc-check=all</option> in this case.  Valgrind will run
     much more slowly, but should detect the use of the out-of-date
     code.</para>
 
@@ -243,7 +241,7 @@ collect2: ld returned 1 exit status
     <itemizedlist>
       <listitem>
         <para>With gcc 2.91, 2.95, 3.0 and 3.1, compile all source using
-        the STL with <literal>-D__USE_MALLOC</literal>. Beware!  This is
+        the STL with <literal>-D__USE_MALLOC</literal>. Beware!  This was
         removed from gcc starting with version 3.3.</para>
       </listitem>
       <listitem>
@@ -262,22 +260,14 @@ collect2: ld returned 1 exit status
     portable, but should work for gcc) or even writing your own memory
     allocators. But all this goes beyond the scope of this FAQ.  Start
     by reading 
-    <ulink url="http://gcc.gnu.org/onlinedocs/libstdc++/ext/howto.html#3">
-    http://gcc.gnu.org/onlinedocs/libstdc++/ext/howto.html#3</ulink> if
-    you absolutely want to do that. But beware:</para>
-
-    <orderedlist>
-      <listitem>
-        <para>there are currently changes underway for gcc which are not
-        totally reflected in the docs right now ("now" == 26 Apr 03)</para>
-      </listitem>
-      <listitem>
-        <para>allocators belong to the more messy parts of the STL and
-        people went to great lengths to make it portable across
-        platforms. Chances are good that your solution will work on your
-        platform, but not on others.</para>
-      </listitem>
-    </orderedlist>
+    <ulink 
+    url="http://gcc.gnu.org/onlinedocs/libstdc++/faq/index.html#4_4_leak">
+    http://gcc.gnu.org/onlinedocs/libstdc++/faq/index.html#4_4_leak</ulink> if
+    you absolutely want to do that. But beware: 
+    allocators belong to the more messy parts of the STL and
+    people went to great lengths to make the STL portable across
+    platforms. Chances are good that your solution will work on your
+    platform, but not on others.</para>
  </answer>
 </qandaentry>
 
@@ -407,7 +397,7 @@ Invalid write of size 1
     <para>If you are tracing large trees of processes, it can be less
     disruptive to have the output sent over the network.  Give Valgrind
     the flag <option>--log-socket=127.0.0.1:12345</option> (if you want
-    logging output sent to <literal>port 12345</literal> on
+    logging output sent to port <literal>12345</literal> on
     <literal>localhost</literal>).  You can use the valgrind-listener
     program to listen on that port:</para>
 <programlisting>
@@ -476,7 +466,7 @@ int main(void)
 
     <para>If you really want to write suppressions by hand, read the
     manual carefully.  Note particularly that C++ function names must be
-    <literal>_mangled_</literal>.</para>
+    mangled (that is, not demangled).</para>
   </answer>
 </qandaentry>
 
diff --git a/docs/xml/manual-core.xml b/docs/xml/manual-core.xml
index 23b3f4a0f..d58788072 100644
--- a/docs/xml/manual-core.xml
+++ b/docs/xml/manual-core.xml
@@ -10,7 +10,7 @@
 <para>This section describes the Valgrind core services, flags and
 behaviours.  That means it is relevant regardless of what particular
 tool you are using.  A point of terminology: most references to
-"valgrind" in the rest of this section (Section 2) refer to the Valgrind
+"Valgrind" in the rest of this section refer to the Valgrind
 core services.</para>
 
 <sect1 id="manual-core.whatdoes" 
@@ -31,14 +31,14 @@ memory-checking tool Memcheck, issue the command:</para>
 <programlisting><![CDATA[
 valgrind --tool=memcheck ls -l]]></programlisting>
 
-<para>(Memcheck is the default, so if you want to use it you can
-actually omit the <option>--tool</option> flag.</para>
+<para>Memcheck is the default, so if you want to use it you can
+omit the <option>--tool</option> flag.</para>
 
 <para>Regardless of which tool is in use, Valgrind takes control of your
 program before it starts.  Debugging information is read from the
 executable and associated libraries, so that error messages and other
-outputs can be phrased in terms of source code locations (if that is
-appropriate).</para>
+outputs can be phrased in terms of source code locations, when
+appropriate.</para>
 
 <para>Your program is then run on a synthetic CPU provided by the
 Valgrind core.  As new code is executed for the first time, the core
@@ -49,10 +49,11 @@ code.</para>
 
 <para>The amount of instrumentation code added varies widely between
 tools.  At one end of the scale, Memcheck adds code to check every
-memory access and every value computed, increasing the size of the code
-at least 12 times, and making it run 25-50 times slower than natively.
+memory access and every value computed,
+making it run 10-50 times slower than natively.
 At the other end of the spectrum, the ultra-trivial "none" tool
-(a.k.a. Nulgrind) adds no instrumentation at all and causes in total
+(also referred to as Nulgrind) adds no instrumentation at all 
+and causes in total
 "only" about a 4 times slowdown.</para>
 
 <para>Valgrind simulates every single instruction your program executes.
@@ -62,17 +63,18 @@ in your application but also in all supporting dynamically-linked
 GNU C library, the X client libraries, Qt, if you work with KDE, and so
 on.</para>
 
-<para>If you're using one of the error-detection tools, Valgrind will
-often detect errors in libraries, for example the GNU C or X11
+<para>If you're using an error-detection tool, Valgrind may
+detect errors in libraries, for example the GNU C or X11
 libraries, which you have to use.  You might not be interested in these
 errors, since you probably have no control over that code.  Therefore,
 Valgrind allows you to selectively suppress errors, by recording them in
 a suppressions file which is read when Valgrind starts up.  The build
 mechanism attempts to select suppressions which give reasonable
-behaviour for the libc and XFree86 versions detected on your machine.
+behaviour for the C library
+and X11 client library versions detected on your machine.
 To make it easier to write suppressions, you can use the
-<option>--gen-suppressions=yes</option> option which tells Valgrind to
-print out a suppression for each error that appears, which you can then
+<option>--gen-suppressions=yes</option> option.  This tells Valgrind to
+print out a suppression for each reported error, which you can then
 copy into a suppressions file.</para>
 
 <para>Different error-checking tools report different kinds of errors.
@@ -90,13 +92,13 @@ your application and supporting libraries with debugging info enabled
 (the <option>-g</option> flag).  Without debugging info, the best
 Valgrind tools will be able to do is guess which function a particular
 piece of code belongs to, which makes both error messages and profiling
-output nearly useless.  With <option>-g</option>, you'll hopefully get
+output nearly useless.  With <option>-g</option>, you'll get
 messages which point directly to the relevant source code lines.</para>
 
 <para>Another flag you might like to consider, if you are working with
 C++, is <option>-fno-inline</option>.  That makes it easier to see the
 function-call chain, which can help reduce confusion when navigating
-around large C++ apps.  For whatever it's worth, debugging
+around large C++ apps.  For example, debugging
 OpenOffice.org with Memcheck is a bit easier when using this flag.  You
 don't have to do this, but doing so helps Valgrind produce more accurate
 and less confusing error reports.  Chances are you're set up like this
@@ -110,17 +112,18 @@ wrongly reporting uninitialised value errors.  We have looked in detail
 into fixing this, and unfortunately the result is that doing so would
 give a further significant slowdown in what is already a slow tool.  So
 the best solution is to turn off optimisation altogether.  Since this
-often makes things unmanagably slow, a plausible compromise is to use
+often makes things unmanagably slow, a reasonable compromise is to use
 <computeroutput>-O</computeroutput>.  This gets you the majority of the
 benefits of higher optimisation levels whilst keeping relatively small
 the chances of false complaints from Memcheck.  All other tools (as far
 as we know) are unaffected by optimisation level.</para>
 
 <para>Valgrind understands both the older "stabs" debugging format, used
-by gcc versions prior to 3.1, and the newer DWARF2 format used by gcc
-3.1 and later.  We continue to refine and debug our debug-info readers,
+by gcc versions prior to 3.1, and the newer DWARF2 and DWARF3 formats
+used by gcc
+3.1 and later.  We continue to develop our debug-info readers,
 although the majority of effort will naturally enough go into the newer
-DWARF2 reader.</para>
+DWARF2/3 reader.</para>
 
 <para>When you're ready to roll, just run your application as you
 would normally, but place 
@@ -175,7 +178,7 @@ re-run, passing the <option>-v</option> flag to Valgrind.  A second
     <option>--log-fd=9</option>.</para>
 
     <para>This is the simplest and most common arrangement, but can
-    cause problems when valgrinding entire trees of processes which
+    cause problems when Valgrinding entire trees of processes which
     expect specific file descriptors, particularly stdin/stdout/stderr,
     to be available for their own use.</para>
   </listitem>
@@ -187,7 +190,7 @@ re-run, passing the <option>-v</option> flag to Valgrind.  A second
     commentary is <command>not</command> written to the file you
     specify, but instead to one called
     <filename>filename.12345</filename>, if for example the pid of the
-    traced process is 12345.  This is helpful when valgrinding a whole
+    traced process is 12345.  This is helpful when Valgrinding a whole
     tree of processes at once, since it means that each process writes
     to its own logfile, rather than the result being jumbled up in one
     big logfile.  If <filename>filename.12345</filename> already exists,
@@ -199,12 +202,12 @@ re-run, passing the <option>-v</option> flag to Valgrind.  A second
     instead use <option>--log-file-exactly=filename</option>.</para>
 
     <para>You can also use the
-    <option>--log-file-qualifier=&lt;VAR&gt;</option> option to modify
-    the filename via according to the environment variable
+    <option>--log-file-qualifier=&lt;VAR&gt;</option> option to
+    incorporate into the filename the contents of environment variable
     <varname>VAR</varname>.  This is rarely needed, but very useful in
     certain circumstances (eg. when running MPI programs).  In this
     case, the trailing <computeroutput>.12345</computeroutput> part is
-    replaced by the contents of <varname>$VAR</varname>.  The idea is
+    replaced by (the contents of) <varname>$VAR</varname>.  The idea is
     that you specify a variable which will be set differently for each
     process in the job, for example
     <computeroutput>BPROC_RANK</computeroutput> or whatever is
@@ -216,7 +219,8 @@ re-run, passing the <option>-v</option> flag to Valgrind.  A second
     least intrusive option is to send the commentary to a network
     socket.  The socket is specified as an IP address and port number
     pair, like this: <option>--log-socket=192.168.0.1:12345</option> if
-    you want to send the output to host IP 192.168.0.1 port 12345 (I
+    you want to send the output to host IP 192.168.0.1 port 12345
+    (note: we
     have no idea if 12345 is a port of pre-existing significance).  You
     can also omit the port number:
     <option>--log-socket=192.168.0.1</option>, in which case a default
@@ -227,7 +231,7 @@ re-run, passing the <option>-v</option> flag to Valgrind.  A second
     <para>Note, unfortunately, that you have to use an IP address here,
     rather than a hostname.</para>
 
-    <para>Writing to a network socket is pretty useless if you don't
+    <para>Writing to a network socket is pointless if you don't
     have something listening at the other end.  We provide a simple
     listener program,
     <computeroutput>valgrind-listener</computeroutput>, which accepts
@@ -237,7 +241,7 @@ re-run, passing the <option>-v</option> flag to Valgrind.  A second
     listeners in the fullness of time.</para>
 
     <para>valgrind-listener can accept simultaneous connections from up
-    to 50 valgrinded processes.  In front of each line of output it
+    to 50 Valgrinded processes.  In front of each line of output it
     prints the current number of active connections in round
     brackets.</para>
 
@@ -258,7 +262,7 @@ re-run, passing the <option>-v</option> flag to Valgrind.  A second
       </listitem>
     </itemizedlist>
 
-    <para>If a valgrinded process fails to connect to a listener, for
+    <para>If a Valgrinded process fails to connect to a listener, for
     whatever reason (the listener isn't running, invalid or unreachable
     host or port, etc), Valgrind switches back to writing the commentary
     to stderr.  The same goes for any process which loses an established
@@ -285,9 +289,9 @@ further processing, which is why we have chosen this arrangement.</para>
 <sect1 id="manual-core.report" xreflabel="Reporting of errors">
 <title>Reporting of errors</title>
 
-<para>When one of the error-checking tools (Memcheck,
-Helgrind) detects something bad happening in the program, an error
-message is written to the commentary.  For example:</para>
+<para>When an error-checking tool
+detects something bad happening in the program, an error
+message is written to the commentary.  Here's an example from Memcheck:</para>
 
 <programlisting><![CDATA[
 ==25832== Invalid read of size 4
@@ -297,7 +301,7 @@ message is written to the commentary.  For example:</para>
 
 <para>This message says that the program did an illegal 4-byte read of
 address 0xBFFFF74C, which, as far as Memcheck can tell, is not a valid
-stack address, nor corresponds to any currently malloc'd or free'd
+stack address, nor corresponds to any current malloc'd or free'd
 blocks.  The read is happening at line 45 of
 <filename>bogon.cpp</filename>, called from line 66 of the same file,
 etc.  For errors associated with an identified malloc'd/free'd block,
@@ -317,7 +321,7 @@ counts.  This makes it easy to see which errors have occurred most
 frequently.</para>
 
 <para>Errors are reported before the associated operation actually
-happens.  If you're using a tool (Memcheck) which does
+happens.  If you're using a tool (eg. Memcheck) which does
 address checking, and your program attempts to read from address zero,
 the tool will emit a message to this effect, and the program will then
 duly die with a segmentation fault.</para>
@@ -333,10 +337,10 @@ root cause of the problem.</para>
 expensive one and can become a significant performance overhead
 if your program generates huge quantities of errors.  To avoid
 serious problems, Valgrind will simply stop collecting
-errors after 1000 different errors have been seen, or 10000000 errors
+errors after 1,000 different errors have been seen, or 10,000,000 errors
 in total have been seen.  In this situation you might as well
 stop your program and fix it, because Valgrind won't tell you
-anything else useful after this.  Note that the 1000/10000000 limits
+anything else useful after this.  Note that the 1,000/10,000,000 limits
 apply after suppressed errors are removed.  These limits are
 defined in <filename>m_errormgr.c</filename> and can be increased
 if necessary.</para>
@@ -353,11 +357,11 @@ since it may have a bad effect on performance.</para>
 <title>Suppressing errors</title>
 
 <para>The error-checking tools detect numerous problems in the base
-libraries, such as the GNU C library, and the XFree86 client libraries,
+libraries, such as the GNU C library, and the X11 client libraries,
 which come pre-installed on your GNU/Linux system.  You can't easily fix
 these, but you don't want to see these errors (and yes, there are many!)
 So Valgrind reads a list of errors to suppress at startup.  A default
-suppression file is cooked up by the
+suppression file is created by the
 <computeroutput>./configure</computeroutput> script when the system is
 built.</para>
 
@@ -381,7 +385,7 @@ specification of errors to suppress.</para>
 <para>If you use the <option>-v</option> flag, at the end of execution,
 Valgrind prints out one line for each used suppression, giving its name
 and the number of times it got used.  Here's the suppressions used by a
-run of <computeroutput>valgrind --tool=memcheck ls l</computeroutput>:</para>
+run of <computeroutput>valgrind --tool=memcheck ls -l</computeroutput>:</para>
 
 <programlisting><![CDATA[
 --27579-- supp: 1 socketcall.connect(serv_addr)/__libc_connect/__nscd_getgrgid_r
@@ -396,7 +400,7 @@ ask to add suppressions from another file, by specifying
 
 <para>If you want to understand more about suppressions, look at an
 existing suppressions file whilst reading the following documentation.
-The file <filename>glibc-2.2.supp</filename>, in the source
+The file <filename>glibc-2.3.supp</filename>, in the source
 distribution, provides some good examples.</para>
 
 <para>Each suppression has the following components:</para>
@@ -427,12 +431,6 @@ tool_name1,tool_name2:suppression_name]]></programlisting>
     any suppression directed to it.  Tools ignore suppressions which are
     not directed to them.  As a result, it is quite practical to put
     suppressions for all tools into the same suppression file.</para>
-
-    <para>Valgrind's core can detect certain PThreads API errors, for
-    which this line reads:</para>
-
-<programlisting><![CDATA[
-core:PThread]]></programlisting>
   </listitem>
 
   <listitem>
@@ -443,8 +441,8 @@ core:PThread]]></programlisting>
 
   <listitem>
     <para>Remaining lines: This is the calling context for the error --
-    the chain of function calls that led to it.  There can be up to
-    twenty-four of these lines.</para>
+    the chain of function calls that led to it.  There can be up to 24
+    of these lines.</para>
 
     <para>Locations may be either names of shared objects/executables or
     wildcards matching function names.  They begin
@@ -511,13 +509,12 @@ anywhere in <filename>libX11.so.6.2</filename>, when called from
 anywhere in the same library, when called from anywhere in
 <filename>libXaw.so.7.0</filename>.  The inexact specification of
 locations is regrettable, but is about all you can hope for, given that
-the X11 libraries shipped with Red Hat 7.2 have had their symbol tables
-removed.</para>
+the X11 libraries shipped on the Linux distro on which this example
+was made have had their symbol tables removed.</para>
 
-<para>Note: since the above two examples did not make it clear, you can
-freely mix the <computeroutput>obj:</computeroutput> and
-<computeroutput>fun:</computeroutput> styles of description within a
-single suppression record.</para>
+<para>Although the above two examples do not make this clear, you can
+freely mix <computeroutput>obj:</computeroutput> and
+<computeroutput>fun:</computeroutput> lines in a suppression.</para>
 
 </sect1>
 
@@ -631,7 +628,7 @@ categories.</para>
     </term>
     <listitem>
       <para>Run the Valgrind tool called <varname>toolname</varname>,
-      e.g. Memcheck, Addrcheck, Cachegrind, etc.</para>
+      e.g. Memcheck, Cachegrind, etc.</para>
     </listitem>
   </varlistentry>
 
@@ -713,9 +710,7 @@ categories.</para>
     </term>
     <listitem>
       <para>Just like <option>--log-file</option>, but the suffix
-      <computeroutput>".pid"</computeroutput> is not added.  If you
-      trace multiple processes with Valgrind when using this option the
-      log file may get all messed up.</para>
+      <computeroutput>".pid"</computeroutput> is not added.</para>
     </listitem>
   </varlistentry>
 
@@ -912,7 +907,7 @@ that can report errors, e.g. Memcheck, but not Cachegrind.</para>
 
       <para>Note that the suppressions printed are as specific as
       possible.  You may want to common up similar ones, eg. by adding
-      wildcards to function names.  Also, sometimes two different errors
+      wildcards to function names.  Sometimes two different errors
       are suppressed by the same suppression, in which case Valgrind
       will output the suppression more than once, but you only need to
       have one copy in your suppression file (but having more than one
@@ -992,9 +987,9 @@ that can report errors, e.g. Memcheck, but not Cachegrind.</para>
       <option><![CDATA[--input-fd=<number> [default: 0, stdin] ]]></option>
     </term>
     <listitem>
-      <para>When using <option>--db-attach=yes</option> and
+      <para>When using <option>--db-attach=yes</option> or
       <option>--gen-suppressions=yes</option>, Valgrind will stop so as
-      to read keyboard input from you, when each error occurs.  By
+      to read keyboard input from you when each error occurs.  By
       default it reads from the standard input (stdin), which is
       problematic for programs which close stdin.  This option allows
       you to specify an alternative file descriptor from which to read
@@ -1007,7 +1002,7 @@ that can report errors, e.g. Memcheck, but not Cachegrind.</para>
       <option><![CDATA[--max-stackframe=<number> [default: 2000000] ]]></option>
     </term>
     <listitem>
-      <para>The maximum size of a stack frame - if the stack pointer moves by
+      <para>The maximum size of a stack frame.  If the stack pointer moves by
       more than this amount then Valgrind will assume that
       the program is switching to a different stack.</para>
 
@@ -1028,9 +1023,9 @@ that can report errors, e.g. Memcheck, but not Cachegrind.</para>
       the new threshold you should specify.</para>
 
       <para>In general, allocating large structures on the stack is a
-      bad idea, because (1) you can easily run out of stack space,
+      bad idea, because you can easily run out of stack space,
       especially on systems with limited memory or which expect to
-      support large numbers of threads each with a small stack, and (2)
+      support large numbers of threads each with a small stack, and also
       because the error checking performed by Memcheck is more effective
       for heap-allocated data than for stack-allocated data.  If you
       have to use this flag, you may wish to consider rewriting your
@@ -1104,9 +1099,9 @@ need to use these.</para>
       Memcheck therefore tries to run
       <function>__libc_freeres</function> at exit.</para>
 
-      <para>Unfortunately, in some versions of glibc,
+      <para>Unfortunately, in some very old versions of glibc,
       <function>__libc_freeres</function> is sufficiently buggy to cause
-      segmentation faults.  This is particularly noticeable on Red Hat
+      segmentation faults.  This was particularly noticeable on Red Hat
       7.1.  So this flag is provided in order to inhibit the run of
       <function>__libc_freeres</function>.  If your program seems to run
       fine on Valgrind, but segfaults at exit, you may find that
@@ -1186,9 +1181,17 @@ need to use these.</para>
       the stack, or detect self-modifying code anywhere.  Note that the
       default option will catch the vast majority of cases, as far as we
       know.  Running with <varname>all</varname> will slow Valgrind down
-      greatly (but running with <varname>none</varname> will rarely
+      greatly.  Running with <varname>none</varname> will rarely
       speed things up, since very little code gets put on the stack for
-      most programs).</para>
+      most programs.</para>
+
+      <para>Some architectures (including ppc32 and ppc64) require
+      programs which create code at runtime to flush the instruction
+      cache in between code generation and first use.  Valgrind
+      observes and honours such instructions.  Hence, on ppc32/Linux
+      and ppc64/Linux, Valgrind always provides complete, transparent
+      support for self-modifying code.  It is only on x86/Linux
+      and amd64/Linux that you need to use this flag.</para>
     </listitem>
   </varlistentry>
 
@@ -1294,8 +1297,8 @@ are not forced to run your program under Valgrind just because you
 use the macros in this file.  Also, you are not required to link your
 program with any extra supporting libraries.</para>
 
-<para>The code left in your binary has negligible performance impact:
-on x86, amd64 and ppc32, the overhead is 6 simple integer instructions
+<para>The code added to your binary has negligible performance impact:
+on x86, amd64, ppc32 and ppc64, the overhead is 6 simple integer instructions
 and is probably undetectable except in tight loops.
 However, if you really wish to compile out the client requests, you can
 compile with <computeroutput>-DNVALGRIND</computeroutput> (analogous to
@@ -1319,9 +1322,9 @@ tool-specific macros).</para>
   <varlistentry>
    <term><command><computeroutput>RUNNING_ON_VALGRIND</computeroutput></command>:</term>
    <listitem>
-    <para>returns 1 if running on Valgrind, 0 if running on the
-    real CPU.  If you are running Valgrind on itself, it will return the
-    number of layers of Valgrind emulation we're running on.
+    <para>Returns 1 if running on Valgrind, 0 if running on the
+    real CPU.  If you are running Valgrind on itself, returns the
+    number of layers of Valgrind emulation you're running on.
     </para>
    </listitem>
   </varlistentry>
@@ -1329,8 +1332,8 @@ tool-specific macros).</para>
   <varlistentry>
    <term><command><computeroutput>VALGRIND_DISCARD_TRANSLATIONS</computeroutput>:</command></term>
    <listitem>
-    <para>discard translations of code in the specified address
-    range.  Useful if you are debugging a JITter or some other
+    <para>Discards translations of code in the specified address
+    range.  Useful if you are debugging a JIT compiler or some other
     dynamic code generation system.  After this call, attempts to
     execute code in the invalidated address range will cause
     Valgrind to make new translations of that code, which is
@@ -1345,7 +1348,8 @@ tool-specific macros).</para>
     once.</para>
     <para>
     Alternatively, for transparent self-modifying-code support,
-    use<computeroutput>--smc-check=all</computeroutput>.
+    use<computeroutput>--smc-check=all</computeroutput>, or run
+    on ppc32/Linux or ppc64/Linux.
     </para>
    </listitem>
   </varlistentry>
@@ -1353,7 +1357,7 @@ tool-specific macros).</para>
   <varlistentry>
    <term><command><computeroutput>VALGRIND_COUNT_ERRORS</computeroutput>:</command></term>
    <listitem>
-    <para>returns the number of errors found so far by Valgrind.  Can be
+    <para>Returns the number of errors found so far by Valgrind.  Can be
     useful in test harness code when combined with the
     <option>--log-fd=-1</option> option; this runs Valgrind silently,
     but the client program can detect when errors occur.  Only useful
@@ -1434,7 +1438,7 @@ tool-specific macros).</para>
   <varlistentry>
    <term><command><computeroutput>VALGRIND_NON_SIMD_CALL[0123]</computeroutput>:</command></term>
    <listitem>
-    <para>executes a function of 0, 1, 2 or 3 args in the client
+    <para>Executes a function of 0, 1, 2 or 3 args in the client
     program on the <emphasis>real</emphasis> CPU, not the virtual
     CPU that Valgrind normally runs code on.  These are used in
     various ways internally to Valgrind.  They might be useful to
@@ -1467,7 +1471,7 @@ tool-specific macros).</para>
   <varlistentry>
    <term><command><computeroutput>VALGRIND_STACK_REGISTER(start, end)</computeroutput>:</command></term>
    <listitem>
-    <para>Register a new stack.  Informs Valgrind that the memory range
+    <para>Registers a new stack.  Informs Valgrind that the memory range
     between start and end is a unique stack.  Returns a stack identifier
     that can be used with other
     <computeroutput>VALGRIND_STACK_*</computeroutput> calls.</para>
@@ -1482,7 +1486,7 @@ tool-specific macros).</para>
   <varlistentry>
    <term><command><computeroutput>VALGRIND_STACK_DEREGISTER(id)</computeroutput>:</command></term>
    <listitem>
-    <para>Deregister a previously registered stack.  Informs
+    <para>Deregisters a previously registered stack.  Informs
     Valgrind that previously registered memory range with stack id
     <computeroutput>id</computeroutput> is no longer a stack.</para>
    </listitem>
@@ -1491,7 +1495,7 @@ tool-specific macros).</para>
   <varlistentry>
    <term><command><computeroutput>VALGRIND_STACK_CHANGE(id, start, end)</computeroutput>:</command></term>
    <listitem>
-    <para>Change a previously registered stack.  Informs
+    <para>Changes a previously registered stack.  Informs
     Valgrind that the previously registerer stack with stack id
     <computeroutput>id</computeroutput> has changed it's start and end
     values.  Use this if your user-level thread package implements
@@ -1514,11 +1518,11 @@ in your client if you include a tool-specific header.</para>
 <title>Support for Threads</title>
 
 <para>Valgrind supports programs which use POSIX pthreads.
-Getting this to work was technically challenging but it all works
+Getting this to work was technically challenging but it now works
 well enough for significant threaded applications to work.</para>
 
 <para>The main thing to point out is that although Valgrind works
-with the built-in threads system (eg. NPTL or LinuxThreads), it
+with the standard Linux threads library (eg. NPTL or LinuxThreads), it
 serialises execution so that only one thread is running at a time.  This
 approach avoids the horrible implementation problems of implementing a
 truly multiprocessor version of Valgrind, but it does mean that threaded
@@ -1527,7 +1531,7 @@ machine.</para>
 
 <para>Valgrind schedules your program's threads in a round-robin fashion,
 with all threads having equal priority.  It switches threads
-every 50000 basic blocks (on x86, typically around 300000
+every 100000 basic blocks (on x86, typically around 600000
 instructions), which means you'll get a much finer interleaving
 of thread executions than when run natively.  This in itself may
 cause your program to behave differently if you have some kind of
@@ -1539,8 +1543,8 @@ will work.  In particular, synchonisation of processes via shared-memory
 segments will not work.  This relies on special atomic instruction sequences 
 which Valgrind does not emulate in a way which works between processes.
 Unfortunately there's no way for Valgrind to warn when this is happening,
-and such calls will mostly work; it's only when there's a race that
-it will fail.
+and such calls will mostly work.  Only when there's a race will 
+it fail.
 </para>
 
 <para>Valgrind also supports direct use of the
@@ -1559,7 +1563,7 @@ memory between processes will not work reliably.
 <title>Handling of Signals</title>
 
 <para>Valgrind has a fairly complete signal implementation.  It should be
-able to cope with any valid use of signals.</para>
+able to cope with any POSIX-compliant use of signals.</para>
  
 <para>If you're using signals in clever ways (for example, catching
 SIGSEGV, modifying page state and restarting the instruction), you're
@@ -1575,7 +1579,7 @@ similar.  (Note: it will not generate a core if your core dump size limit is
 0.)  At the time of writing the core dumps do not include all the floating
 point register information.</para>
 
-<para>If Valgrind itself crashes (hopefully not) the operating system
+<para>In the unlikely event that Valgrind itself crashes, the operating system
 will create a core dump in the usual way.</para>
 
 </sect1>
@@ -1591,7 +1595,7 @@ supported targets.  In function wrapping, calls to some specified
 function are intercepted and rerouted to a different, user-supplied
 function.  This can do whatever it likes, typically examining the
 arguments, calling onwards to the original, and possibly examining the
-result.  Any number of different functions may be wrapped.</para>
+result.  Any number of functions may be wrapped.</para>
 
 <para>
 Function wrapping is useful for instrumenting an API in some way.  For
@@ -1653,7 +1657,7 @@ an ELF shared object with an empty
 ("<computeroutput>NONE</computeroutput>") soname field.  The specification 
 mechanism is powerful in
 that wildcards are allowed for both sonames and function names.  
-The fine details are discussed below.</para>
+The details are discussed below.</para>
 
 <para><computeroutput>VALGRIND_GET_ORIG_FN</computeroutput>: 
 once in the the wrapper, the first priority is
@@ -1694,7 +1698,7 @@ generally regarded as valid C identifier names.</para>
 <para>This flexibility is needed to write robust wrappers for POSIX pthread
 functions, where typically we are not completely sure of either the
 function name or the soname, or alternatively we want to wrap a whole
-bunch of functions at once.</para> 
+set of functions at once.</para> 
 
 <para>For example, <computeroutput>pthread_create</computeroutput> 
 in GNU libpthread is usually a
@@ -1724,6 +1728,8 @@ a capital Z acts as an escape character, with the following encoding:</para>
      Zs              (space)
      ZA              @
      ZZ              Z
+     ZL              (       # only in valgrind 3.3.0 and later
+     ZR              )       # only in valgrind 3.3.0 and later
 ]]></programlisting>
 
 <para>Hence <computeroutput>libpthreadZdsoZd0</computeroutput> is an 
@@ -1741,7 +1747,7 @@ often unnecessary, so a second macro,
 used instead.  The <computeroutput>_ZU</computeroutput> variant is 
 also useful for writing wrappers for
 C++ functions, in which the function name is usually already mangled
-using some other convention in which Z plays an important role; having
+using some other convention in which Z plays an important role.  Having
 to encode a second time quickly becomes confusing.</para>
 
 <para>Since the function name field may contain wildcards, it can be
@@ -1752,6 +1758,11 @@ have sonames.  Any object lacking a soname is treated as if its soname
 was <computeroutput>NONE</computeroutput>, which is why the original 
 example above had a name
 <computeroutput>I_WRAP_SONAME_FNNAME_ZU(NONE,foo)</computeroutput>.</para>
+
+<para>Note that the soname of an ELF object is not the same as its
+file name, although it is often similar.  You can find the soname of
+an object <computeroutput>libfoo.so</computeroutput> using the command
+<computeroutput>readelf -a libfoo.so | grep soname</computeroutput>.</para>
 </sect2>
 
 <sect2 id="manual-core.wrapping.semantics" xreflabel="Wrapping Semantics">
@@ -1872,7 +1883,7 @@ the <computeroutput>OrigFn</computeroutput> information using
 <computeroutput>VALGRIND_GET_ORIG_FN</computeroutput> before calling any
 other wrapped function.  Once you have the 
 <computeroutput>OrigFn</computeroutput>, arbitrary
-intercalling, recursion between, and longjumping out of wrappers
+calls between, recursion between, and longjumps out of wrappers
 should work correctly.  There is never any interaction between wrapped
 functions and merely replaced functions 
 (eg <computeroutput>malloc</computeroutput>), so you can call
@@ -1968,14 +1979,16 @@ almost 300 different wrappers.</para>
 
 
 <sect1 id="manual-core.install" xreflabel="Building and Installing">
-<title>Building and Installing</title>
+<title>Building and Installing Valgrind</title>
 
 <para>We use the standard Unix
 <computeroutput>./configure</computeroutput>,
 <computeroutput>make</computeroutput>, <computeroutput>make
 install</computeroutput> mechanism, and we have attempted to
 ensure that it works on machines with kernel 2.4 or 2.6 and glibc
-2.2.X or 2.3.X.  You may then want to run the regression tests
+2.2.X to 2.5.X.  Once you have completed 
+<computeroutput>make install</computeroutput> you may then want 
+to run the regression tests
 with <computeroutput>make regtest</computeroutput>.
 </para>
 
@@ -2028,7 +2041,7 @@ with <computeroutput>make regtest</computeroutput>.
 <para>The <computeroutput>configure</computeroutput> script tests
 the version of the X server currently indicated by the current
 <computeroutput>$DISPLAY</computeroutput>.  This is a known bug.
-The intention was to detect the version of the current XFree86
+The intention was to detect the version of the current X
 client libraries, so that correct suppressions could be selected
 for them, but instead the test checks the server version.  This
 is just plain wrong.</para>
@@ -2058,10 +2071,8 @@ known not to work on it.</para>
 internal self-checks.  They are permanently enabled, and we have no 
 plans to disable them.  If one of them breaks, please mail us!</para>
 
-<para>If you get an assertion failure on the expression
-<computeroutput>blockSane(ch)</computeroutput> in
-<computeroutput>VG_(free)()</computeroutput> in
-<filename>m_mallocfree.c</filename>, this may have happened because
+<para>If you get an assertion failure
+in <filename>m_mallocfree.c</filename>, this may have happened because
 your program wrote off the end of a malloc'd block, or before its
 beginning.  Valgrind hopefully will have emitted a proper message to that
 effect before dying in this way.  This is a known problem which
@@ -2089,9 +2100,8 @@ following constraints:</para>
    <para>On x86 and amd64, there is no support for 3DNow! instructions.
    If the translator encounters these, Valgrind will generate a SIGILL
    when the instruction is executed.  Apart from that, on x86 and amd64,
-   essentially all instructions are supported, up to and including SSE2.
-   Version 3.1.0 includes limited support for SSE3 on x86.  This could
-   be improved if necessary.</para>
+   essentially all instructions are supported, up to and including SSE3.
+   </para>
 
    <para>On ppc32 and ppc64, almost all integer, floating point and Altivec
    instructions are supported.  Specifically: integer and FP insns that are
@@ -2130,7 +2140,7 @@ following constraints:</para>
    <para>Machine instructions, and system calls, have been implemented
    on demand.  So it's possible, although unlikely, that a program will
    fall over with a message to that effect.  If this happens, please
-   report ALL the details printed out, so we can try and implement the
+   report all the details printed out, so we can try and implement the
    missing feature.</para>
   </listitem>
 
@@ -2207,7 +2217,7 @@ following constraints:</para>
    precision control), it can print a message giving a traceback of
    where this has happened, and continue execution.  This behaviour used
    to be the default, but the messages are annoying and so showing them
-   is now optional.  Use <option>--show-emwarns=yes</option> to see
+   is now disabled by default.  Use <option>--show-emwarns=yes</option> to see
    them.</para>
 
    <para>The above limitations define precisely the IEEE754 'default'
@@ -2277,14 +2287,13 @@ following constraints:</para>
 <sect1 id="manual-core.example" xreflabel="An Example Run">
 <title>An Example Run</title>
 
-<para>This is the log for a run of a small program using Memcheck
+<para>This is the log for a run of a small program using Memcheck.
 The program is in fact correct, and the reported error is as the
 result of a potentially serious code generation bug in GNU g++
 (snapshot 20010527).</para>
 
 <programlisting><![CDATA[
-sewardj@phoenix:~/newmat10$
-~/Valgrind-6/valgrind -v ./bogon 
+sewardj@phoenix:~/newmat10$ ~/Valgrind-6/valgrind -v ./bogon 
 ==25832== Valgrind 0.10, a memory error detector for x86 RedHat 7.1.
 ==25832== Copyright (C) 2000-2001, and GNU GPL'd, by Julian Seward.
 ==25832== Startup, with flags:
@@ -2355,14 +2364,14 @@ shipped.</para>
   <listitem>
     <para><computeroutput>Warning: client switching stacks?</computeroutput></para>
 
-    <para>Valgrind spotted such a large change in the stack pointer,
-    <literal>%esp</literal>, that it guesses the client is switching to
+    <para>Valgrind spotted such a large change in the stack pointer
+    that it guesses the client is switching to
     a different stack.  At this point it makes a kludgey guess where the
     base of the new stack is, and sets memory permissions accordingly.
     You may get many bogus error messages following this, if Valgrind
     guesses wrong.  At the moment "large change" is defined as a change
-    of more that 2000000 in the value of the <literal>%esp</literal>
-    (stack pointer) register.</para>
+    of more that 2000000 in the value of the
+    stack pointer register.</para>
   </listitem>
 
   <listitem>
@@ -2382,8 +2391,8 @@ shipped.</para>
 
     <para>Valgrind observed a call to one of the vast family of
     <computeroutput>ioctl</computeroutput> system calls, but did not
-    modify its memory status info (because I have not yet got round to
-    it).  The call will still have gone through, but you may get
+    modify its memory status info (because nobody has yet written a 
+    suitable wrapper).  The call will still have gone through, but you may get
     spurious errors after this as a result of the non-update of the
     memory info.</para>
   </listitem>
@@ -2426,7 +2435,7 @@ buffer which is too small.</para>
 <para>Unlike most of the rest of Valgrind, the wrapper library is subject to a
 BSD-style license, so you can link it into any code base you like.
 See the top of <computeroutput>auxprogs/libmpiwrap.c</computeroutput>
-for details.</para>
+for license details.</para>
 
 
 <sect2 id="manual-core.mpiwrap.build" xreflabel="Building MPI Wrappers">
@@ -2614,6 +2623,8 @@ PMPI_Sendrecv
 
 PMPI_Type_commit PMPI_Type_free
 
+PMPI_Pack PMPI_Unpack
+
 PMPI_Bcast PMPI_Gather PMPI_Scatter PMPI_Alltoall
 PMPI_Reduce PMPI_Allreduce PMPI_Op_create
 
@@ -2758,6 +2769,18 @@ which lack proper wrappers but which are nevertheless used.  You can
 then write wrappers for them.
 </para>
 
+<para>A known source of potential false errors are the
+<computeroutput>PMPI_Reduce</computeroutput> family of functions, when
+using a custom (user-defined) reduction function.  In a reduction
+operation, each node notionally sends data to a "central point" which
+uses the specified reduction function to merge the data items into a
+single item.  Hence, in general, data is passed between nodes and fed
+to the reduction function, but the wrapper library cannot mark the
+transferred data as initialised before it is handed to the reduction
+function, because all that happens "inside" the
+<computeroutput>PMPI_Reduce</computeroutput> call.  As a result you
+may see false positives reported in your reduction function.</para>
+
 </sect2>
 
 </sect1>
diff --git a/docs/xml/manual-intro.xml b/docs/xml/manual-intro.xml
index eac13d64c..a4b1b84f4 100644
--- a/docs/xml/manual-intro.xml
+++ b/docs/xml/manual-intro.xml
@@ -23,7 +23,7 @@ summary, these are:</para>
 
   <listitem>
     <para><command>Memcheck</command> detects memory-management problems
-    in your programs.  All reads and writes of memory are checked, and
+    in programs.  All reads and writes of memory are checked, and
     calls to malloc/new/free/delete are intercepted. As a result,
     Memcheck can detect the following problems:</para>
 
@@ -59,7 +59,7 @@ summary, these are:</para>
     </itemizedlist>
 
     <para>Problems like these can be difficult to find by other means,
-    often lying undetected for long periods, then causing occasional,
+    often remaining undetected for long periods, then causing occasional,
     difficult-to-diagnose crashes.</para>
    </listitem>
  
@@ -67,46 +67,43 @@ summary, these are:</para>
     <para><command>Cachegrind</command> is a cache profiler.  It
     performs detailed simulation of the I1, D1 and L2 caches in your CPU
     and so can accurately pinpoint the sources of cache misses in your
-    code.  If you desire, it will show the number of cache misses,
+    code.  It will show the number of cache misses,
     memory references and instructions accruing to each line of source
     code, with per-function, per-module and whole-program summaries.  If
     you ask really nicely it will even show counts for each individual
     machine instruction.</para>
 
-    <para>On x86 and AMD64, Cachegrind auto-detects your machine's cache
+    <para>On x86 and and64, Cachegrind auto-detects your machine's cache
     configuration using the <computeroutput>CPUID</computeroutput>
     instruction, and so needs no further configuration info, in most
     cases.</para>
+   </listitem>
 
-    <para>Cachegrind is nicely complemented by Josef Weidendorfer's
-    amazing KCacheGrind visualisation tool 
-    (<ulink url="http://kcachegrind.sourceforge.net/cgi-bin/show.cgi/KcacheGrindIndex">http://kcachegrind.sourceforge.net</ulink>),
-    a KDE application which presents these profiling results in a
-    graphical and easier-to-understand form.</para>
+   <listitem>
+     <para><command>Callgrind</command> is a profiler similar in
+     concept to Cachegrind, but which also tracks caller-callee
+     relationships.  By doing so it is able to show how instruction,
+     memory reference and cache miss costs flow between callers and
+     callees.  Callgrind collects a large amount of data which is best
+     navigated using Josef Weidendorfer's amazing KCachegrind
+     visualisation tool (<ulink
+     url="http://kcachegrind.sourceforge.net/cgi-bin/show.cgi/KcacheGrindIndex">http://kcachegrind.sourceforge.net</ulink>).
+     KCachegrind is a KDE application which presents 
+     these profiling results in a
+     graphical and easy-to-understand form.</para>
    </listitem>
 
    <listitem>
-    <para><command>Helgrind</command> finds data races in multithreaded
-    programs.  Helgrind looks for memory locations which are accessed by
-    more than one (POSIX p-)thread, but for which no consistently used
-    (pthread_mutex_)lock can be found.  Such locations are indicative of
-    missing synchronisation between threads, and could cause
-    hard-to-find timing-dependent problems.</para>
-
-    <para>Helgrind ("Hell's Gate", in Norse mythology) implements the
-    so-called "Eraser" data-race-detection algorithm, along with various
-    refinements (thread-segment lifetimes) which reduce the number of
-    false errors it reports.  It is as yet somewhat of an experimental
-    tool, so your feedback is especially welcomed here.</para>
-
-    <para>Helgrind has been hacked on extensively by Jeremy
-    Fitzhardinge, and we have him to thank for getting it to a
-    releasable state.</para>
-
-    <para>NOTE: Helgrind is, unfortunately, not available in Valgrind
-    3.2.X, as a result of threading changes that happened in the 2.4.0
-    release.  We hope to reinstate its functionality in the future.
-    </para>
+     <para><command>Massif</command> is a heap profiler.
+     It measures how much heap memory programs use.  In particular,
+     it can give you information about heap blocks, heap 
+     administration overheads, and stack sizes.</para>
+
+     <para>Heap profiling can help you reduce the amount of
+     memory your program uses.  On modern machines with virtual
+     memory, this reduces the chances that your program will run out
+     of memory, and may make it faster by reducing the amount of
+     paging needed.</para>
    </listitem>
 
 </orderedlist>
@@ -123,13 +120,12 @@ integer and floating point operations your program does.</para>
 <para>Valgrind is closely tied to details of the CPU and operating
 system, and to a lesser extent, the compiler and basic C libraries.
 Nonetheless, as of version 3.2.0 it supports several platforms:
-x86/Linux (mature), AMD64/Linux (maturing), PPC32/Linux and 
-PPC64/Linux (less mature but work well in practice).
-Valgrind uses the standard Unix
+x86/Linux (mature), amd64/Linux (maturing), ppc32/Linux and
+ppc64/Linux (less mature but work well).  Valgrind uses the standard Unix
 <computeroutput>./configure</computeroutput>,
 <computeroutput>make</computeroutput>, <computeroutput>make
 install</computeroutput> mechanism, and we have attempted to ensure that
-it works on machines with kernel 2.4 or 2.6 and glibc
+it works on machines with Linux kernel 2.4.X or 2.6.X and glibc
 2.2.X to 2.5.X.</para>
 
 <para>Valgrind is licensed under the <xref linkend="license.gpl"/>,
@@ -150,7 +146,7 @@ Inc.</para>
 <title>How to navigate this manual</title>
 
 <para>The Valgrind distribution consists of the Valgrind core, upon
-which are built Valgrind tools, which do different kinds of debugging
+which are built Valgrind tools.  The tools do different kinds of debugging
 and profiling.  This manual is structured similarly.</para>
 
 <para>First, we describe the Valgrind core, how to use it, and the flags
diff --git a/docs/xml/manual.xml b/docs/xml/manual.xml
index e65218194..7cb29e522 100644
--- a/docs/xml/manual.xml
+++ b/docs/xml/manual.xml
@@ -30,8 +30,10 @@
       xmlns:xi="http://www.w3.org/2001/XInclude" />
   <xi:include href="../../massif/docs/ms-manual.xml" parse="xml"  
       xmlns:xi="http://www.w3.org/2001/XInclude" />
+<!--
   <xi:include href="../../helgrind/docs/hg-manual.xml" parse="xml"  
       xmlns:xi="http://www.w3.org/2001/XInclude" />
+-->
   <xi:include href="../../none/docs/nl-manual.xml" parse="xml"  
       xmlns:xi="http://www.w3.org/2001/XInclude" />
   <xi:include href="../../lackey/docs/lk-manual.xml" parse="xml"  
diff --git a/docs/xml/quick-start-guide.xml b/docs/xml/quick-start-guide.xml
index 771e06318..69655bdbf 100644
--- a/docs/xml/quick-start-guide.xml
+++ b/docs/xml/quick-start-guide.xml
@@ -25,8 +25,9 @@
 <sect1 id="quick-start.intro" xreflabel="Introduction">
 <title>Introduction</title>
 
-<para>The Valgrind distribution has multiple tools.  The most popular is
-the memory checking tool (called Memcheck) which can detect many common
+<para>The Valgrind tool suite provides a number of debugging and
+profiling tools.  The most popular is
+Memcheck, a memory checking tool which can detect many common
 memory errors such as:</para>
 
 <itemizedlist>
@@ -48,7 +49,7 @@ memory errors such as:</para>
 
 <para>What follows is the minimum information you need to start
 detecting memory errors in your program with Memcheck.  Note that this
-guide applies to Valgrind version 2.4.0 and later; some of the
+guide applies to Valgrind version 2.4.0 and later.  Some of the
 information is not quite right for earlier versions.</para>
 
 </sect1>
@@ -209,13 +210,13 @@ However, it is typically right 99% of the time, so you should be wary of
 ignoring its error messages.  After all, you wouldn't ignore warning
 messages produced by a compiler, right?  The suppression mechanism is
 also useful if Memcheck is reporting errors in library code that you
-cannot change; the default suppression set hides a lot of these, but you
+cannot change.  The default suppression set hides a lot of these, but you
 may come across more.</para>
 
-<para>Memcheck also cannot detect every memory error your program has.
-For example, it can't detect if you overrun the bounds of an array that
-is allocated statically or on the stack.  But it should detect every
-error that could crash your program (eg. cause a segmentation
+<para>Memcheck cannot detect every memory error your program has.
+For example, it can't detect out-of-range reads or writes to arrays
+that are allocated statically or on the stack.  But it should detect many
+errors that could crash your program (eg. cause a segmentation
 fault).</para>
 
 </sect1>
diff --git a/lackey/docs/lk-manual.xml b/lackey/docs/lk-manual.xml
index 8949b5805..028035a73 100644
--- a/lackey/docs/lk-manual.xml
+++ b/lackey/docs/lk-manual.xml
@@ -17,7 +17,8 @@ command line.</para>
 <para>Lackey is a simple valgrind tool that does some basic program
 measurement.  It adds quite a lot of simple instrumentation to the
 program's code.  It is primarily intended to be of use as an example
-tool.</para>
+tool, and consequently emphasises clarity of implementation
+over performance.</para>
 
 <para>It measures and reports various things.</para>
 
@@ -36,7 +37,7 @@ tool.</para>
     <computeroutput>_dl_runtime_resolve()</computeroutput>, the
     function in glibc's dynamic linker that resolves function
     references to shared objects.</para>
-    <para>You can change the name of the function tracekd with command line
+    <para>You can change the name of the function tracked with command line
     option <computeroutput>--fnname=&lt;name&gt;</computeroutput>.</para>
    </listitem>
 
@@ -46,9 +47,9 @@ tool.</para>
    </listitem>
 
    <listitem>
-    <para>The number of basic blocks entered and completed by the
+    <para>The number of superblocks entered and completed by the
     program.  Note that due to optimisations done by the JIT, this
-    is not really an accurate value.</para>
+    is not at all an accurate value.</para>
    </listitem>
 
    <listitem>
@@ -127,12 +128,12 @@ to run absolutely utterly unbelievably slowly.</para>
 <!-- start of xi:include in the manpage -->
 <variablelist id="lk.opts.list">
 
-  <varlistentry id="opt.fnname" xreflabel="--fnname">
+  <varlistentry id="opt.basic-counts" xreflabel="--basic-counts">
     <term>
-      <option><![CDATA[--fnname=<name> [default: _dl_runtime_resolve()] ]]></option>
+      <option><![CDATA[--basic-counts=<no|yes> [default: yes] ]]></option>
     </term>
     <listitem>
-      <para>Count calls to &lt;name&gt;.</para>
+      <para>Count basic events, as described above.</para>
     </listitem>
   </varlistentry>
 
@@ -141,7 +142,17 @@ to run absolutely utterly unbelievably slowly.</para>
       <option><![CDATA[--detailed-counts=<no|yes> [default: no] ]]></option>
     </term>
     <listitem>
-      <para>Count loads, stores and alu ops.</para>
+      <para>Count loads, stores and alu ops, differentiated by their
+            IR types.</para>
+    </listitem>
+  </varlistentry>
+
+  <varlistentry id="opt.fnname" xreflabel="--fnname">
+    <term>
+      <option><![CDATA[--fnname=<name> [default: _dl_runtime_resolve()] ]]></option>
+    </term>
+    <listitem>
+      <para>Count calls to the function &lt;name&gt;.</para>
     </listitem>
   </varlistentry>
 
@@ -150,8 +161,8 @@ to run absolutely utterly unbelievably slowly.</para>
       <option><![CDATA[--trace-mem=<no|yes> [default: no] ]]></option>
     </term>
     <listitem>
-      <para>Print a line of text giving the address and size of each
-            data and instruction memory access done by the program.</para>
+      <para>Produce a log of all memory references, as described
+      above.</para>
     </listitem>
   </varlistentry>
 
diff --git a/massif/docs/ms-manual.xml b/massif/docs/ms-manual.xml
index f04f0d64b..4bbae501b 100644
--- a/massif/docs/ms-manual.xml
+++ b/massif/docs/ms-manual.xml
@@ -14,7 +14,7 @@ command line.</para>
 <sect1 id="ms-manual.spaceprof" xreflabel="Heap profiling">
 <title>Heap profiling</title>
 
-<para>Massif is a heap profiler, i.e. it measures how much heap
+<para>Massif is a heap profiler.  It measures how much heap
 memory programs use.  In particular, it can give you information
 about:</para>
 
@@ -113,8 +113,8 @@ be normally run.</para>
 <para>Then, run your program with <computeroutput>valgrind
 --tool=massif</computeroutput> in front of the normal command
 line invocation.  When the program finishes, Massif will print
-summary space statistics.  It also creates a graph representing
-the program's heap usage in a file called
+summary space statistics.  It also creates a graph showing
+the program's overall heap usage in a file called
 <filename>massif.pid.ps</filename>, which can be read by any
 PostScript viewer, such as Ghostview.</para>
 
@@ -181,8 +181,8 @@ possible parts of memory:</para>
 <title>Spacetime Graphs</title>
 
 <para>As well as printing summary information, Massif also
-creates a file representing a spacetime graph,
-<filename>massif.pid.hp</filename>.  It will produce a file
+creates a file showing the overall spacetime behaviour of the 
+program, in a file
 called <filename>massif.pid.ps</filename>, which can be viewed in
 a PostScript viewer.</para>
 
@@ -290,9 +290,9 @@ spacetime</para>
 
 <para>The first part shows the total spacetime due to heap
 allocations, and the places in the program where most memory was
-allocated (Nb: if this program had been compiled with
+allocated.  If this program had been compiled with
 <computeroutput>-g</computeroutput>, actual line numbers would be
-given).  These places are sorted, from most significant to least,
+given.  These places are sorted, from most significant to least,
 and correspond to the bands seen in the graph.  Insignificant
 sites (accounting for less than 0.5% of total spacetime) are
 omitted.</para>
@@ -374,8 +374,8 @@ the real code address is.</para>
 default, or HTML with the
 <computeroutput>--format=html</computeroutput> option.  The plain
 text version obviously doesn't have the links, but a similar
-effect can be achieved by searching on the code addresses.  (In
-Vim, the '*' and '#' searches are ideal for this.)</para>
+effect can be achieved by searching on the code addresses.  In
+the Vim editor, the '*' and '#' searches are ideal for this.</para>
 
 
 <sect2 id="ms-manual.accuracy" xreflabel="Accuracy">
diff --git a/memcheck/docs/mc-manual.xml b/memcheck/docs/mc-manual.xml
index 26aafe005..ac7f7d5fc 100644
--- a/memcheck/docs/mc-manual.xml
+++ b/memcheck/docs/mc-manual.xml
@@ -168,15 +168,15 @@ the following problems:</para>
       <para>Controls how <constant>memcheck</constant> handles word-sized,
       word-aligned loads from addresses for which some bytes are
       addressible and others are not.  When <varname>yes</varname>, such
-      loads do not elicit an address error.  Instead, the loaded V bytes
-      corresponding to the illegal addresses indicate Undefined, and
-      those corresponding to legal addresses are loaded from shadow
-      memory, as usual.</para>
+      loads do not produce an address error.  Instead, loaded bytes
+      originating from illegal addresses are marked as uninitialised, and
+      those corresponding to legal addresses are handled in the normal
+      way.</para>
 
       <para>When <varname>no</varname>, loads from partially invalid
       addresses are treated the same as loads from completely invalid
-      addresses: an illegal-address error is issued, and the resulting V
-      bytes indicate valid data.</para>
+      addresses: an illegal-address error is issued, and the resulting
+      bytes are marked as initialised.</para>
 
       <para>Note that code that behaves in this way is in violation of
       the the ISO C/C++ standards, and should be considered broken.  If
@@ -212,7 +212,7 @@ the following problems:</para>
 <para>Despite considerable sophistication under the hood, Memcheck can
 only really detect two kinds of errors: use of illegal addresses, and
 use of undefined values.  Nevertheless, this is enough to help you
-discover all sorts of memory-management nasties in your code.  This
+discover all sorts of memory-management problems in your code.  This
 section presents a quick summary of what error messages mean.  The
 precise behaviour of the error-checking machinery is described in 
 <xref linkend="mc-manual.machine"/>.</para>
@@ -227,7 +227,7 @@ precise behaviour of the error-checking machinery is described in
 Invalid read of size 4
    at 0x40F6BBCC: (within /usr/lib/libpng.so.2.1.0.9)
    by 0x40F6B804: (within /usr/lib/libpng.so.2.1.0.9)
-   by 0x40B07FF4: read_png_image__FP8QImageIO (kernel/qpngio.cpp:326)
+   by 0x40B07FF4: read_png_image(QImageIO *) (kernel/qpngio.cpp:326)
    by 0x40AC751B: QImageIO::read() (kernel/qimage.cpp:3621)
  Address 0xBFFFF0E0 is not stack'd, malloc'd or free'd
 ]]></programlisting>
@@ -296,7 +296,8 @@ issued only when your program attempts to make use of uninitialised
 data.  In this example, x is uninitialised.  Memcheck observes the value
 being passed to <literal>_IO_printf</literal> and thence to
 <literal>_IO_vfprintf</literal>, but makes no comment.  However,
-_IO_vfprintf has to examine the value of x so it can turn it into the
+<literal>_IO_vfprintf</literal> has to examine the value of 
+x so it can turn it into the
 corresponding ASCII string, and it is at this point that Memcheck
 complains.</para>
 
@@ -310,8 +311,7 @@ complains.</para>
     <para>The contents of malloc'd blocks, before you write something
     there.  In C++, the new operator is a wrapper round malloc, so if
     you create an object with new, its fields will be uninitialised
-    until you (or the constructor) fill them in, which is only Right and
-    Proper.</para>
+    until you (or the constructor) fill them in.</para>
   </listitem>
 </itemizedlist>
 
@@ -359,7 +359,7 @@ Mismatched free() / delete / delete []
    by 0x4C261C41: PptDoc::~PptDoc(void) (include/qmemarray.h:60)
    by 0x4C261F0E: PptXml::~PptXml(void) (pptxml.cc:44)
  Address 0x4BB292A8 is 0 bytes inside a block of size 64 alloc'd
-   at 0x4004318C: __builtin_vec_new (vg_clientfuncs.c:152)
+   at 0x4004318C: operator new[](unsigned int) (vg_clientfuncs.c:152)
    by 0x4C21BC15: KLaola::readSBStream(int) const (klaola.cc:314)
    by 0x4C21C155: KLaola::stream(KLaola::OLENode const *) (klaola.cc:416)
    by 0x4C21788F: OLEFilter::convert(QCString const &) (olefilter.cc:272)
@@ -388,18 +388,18 @@ way compatible with how it was allocated.  The deal is:</para>
 </itemizedlist>
 
 <para>The worst thing is that on Linux apparently it doesn't matter if
-you do muddle these up, and it all seems to work ok, but the same
-program may then crash on a different platform, Solaris for example.  So
-it's best to fix it properly.  According to the KDE folks "it's amazing
-how many C++ programmers don't know this".</para>
-
-<para>Pascal Massimino adds the following clarification:
-<function>delete[]</function> must be used for objects allocated by
-<function>new[]</function> because the compiler stores the size of the
-array and the pointer-to-member to the destructor of the array's content
-just before the pointer actually returned.  This implies a
-variable-sized overhead in what's returned by <function>new</function>
-or <function>new[]</function>.</para>
+you do mix these up, but the same program may then crash on a
+different platform, Solaris for example.  So it's best to fix it
+properly.  According to the KDE folks "it's amazing how many C++
+programmers don't know this".</para>
+
+<para>The reason behind the requirement is as follows.  In some C++
+implementations, <function>delete[]</function> must be used for
+objects allocated by <function>new[]</function> because the compiler
+stores the size of the array and the pointer-to-member to the
+destructor of the array's content just before the pointer actually
+returned.  This implies a variable-sized overhead in what's returned
+by <function>new</function> or <function>new[]</function>.</para>
 
 </sect2>
 
@@ -466,8 +466,8 @@ uninitialised value to <function>exit</function>.  Note that the first
 error refers to the memory pointed to by
 <computeroutput>buf</computeroutput> (not
 <computeroutput>buf</computeroutput> itself), but the second error
-refers to the argument <computeroutput>error_code</computeroutput>
-itself.</para>
+refers directly to <computeroutput>exit</computeroutput>'s argument
+<computeroutput>arr2[0]</computeroutput>.</para>
 
 </sect2>
 
@@ -492,11 +492,10 @@ Memcheck checks for this.</para>
 ==27492== Source and destination overlap in memcpy(0xbffff294, 0xbffff280, 21)
 ==27492==    at 0x40026CDC: memcpy (mc_replace_strmem.c:71)
 ==27492==    by 0x804865A: main (overlap.c:40)
-==27492== 
 ]]></programlisting>
 
 <para>You don't want the two blocks to overlap because one of them could
-get partially trashed by the copying.</para>
+get partially overwritten by the copying.</para>
 
 <para>You might think that Memcheck is being overly pedantic reporting
 this in the case where <computeroutput>dst</computeroutput> is less than
@@ -508,6 +507,11 @@ Also, some implementations of <function>memcpy()</function> zero
 <computeroutput>dst</computeroutput> before copying, because zeroing the
 destination's cache line(s) can improve performance.</para>
 
+<para>In addition, for many of these functions, the POSIX standards
+have wording along the lines "If copying takes place between objects
+that overlap, the behavior is undefined."  Hence overlapping copies
+violate the standard.</para>
+
 <para>The moral of the story is: if you want to write truly portable
 code, don't make any assumptions about the language
 implementation.</para>
@@ -585,9 +589,9 @@ which has no pointers to it.  An indirect leak is a block which is only
 pointed to by other leaked blocks.  Both kinds of leak are bad.</para>
 
 <para>The precise area of memory in which Memcheck searches for pointers
-is: all naturally-aligned machine-word-sized words for which all A bits
-indicate addressibility and all V bits indicated that the stored value
-is actually valid.</para>
+is: all naturally-aligned machine-word-sized words found in memory
+that Memcheck's records indicate is both accessible and initialised.
+</para>
 
 </sect2>
 
@@ -601,7 +605,7 @@ is actually valid.</para>
 <para>The basic suppression format is described in 
 <xref linkend="manual-core.suppress"/>.</para>
 
-<para>The suppression (2nd) line should have the form:</para>
+<para>The suppression-type (second) line should have the form:</para>
 <programlisting><![CDATA[
 Memcheck:suppression_type]]></programlisting>
 
@@ -619,13 +623,13 @@ Memcheck:suppression_type]]></programlisting>
   </listitem>
 
   <listitem>
-    <para>Or: <varname>Cond</varname> (or its old
+    <para><varname>Cond</varname> (or its old
     name, <varname>Value0</varname>), meaning use
     of an uninitialised CPU condition code.</para>
   </listitem>
 
   <listitem>
-    <para>Or: <varname>Addr1</varname>,
+    <para><varname>Addr1</varname>,
     <varname>Addr2</varname>, 
     <varname>Addr4</varname>,
     <varname>Addr8</varname>,
@@ -635,36 +639,37 @@ Memcheck:suppression_type]]></programlisting>
   </listitem>
 
   <listitem>
-    <para>Or: <varname>Jump</varname>, meaning an
+    <para><varname>Jump</varname>, meaning an
     jump to an unaddressable location error.</para>
   </listitem>
 
   <listitem>
-    <para>Or: <varname>Param</varname>, meaning an
+    <para><varname>Param</varname>, meaning an
     invalid system call parameter error.</para>
   </listitem>
 
   <listitem>
-    <para>Or: <varname>Free</varname>, meaning an
+    <para><varname>Free</varname>, meaning an
     invalid or mismatching free.</para>
   </listitem>
 
   <listitem>
-    <para>Or: <varname>Overlap</varname>, meaning a
+    <para><varname>Overlap</varname>, meaning a
     <computeroutput>src</computeroutput> /
     <computeroutput>dst</computeroutput> overlap in
     <function>memcpy()</function> or a similar function.</para>
   </listitem>
 
   <listitem>
-    <para>Or: <varname>Leak</varname>, meaning
+    <para><varname>Leak</varname>, meaning
     a memory leak.</para>
   </listitem>
 
 </itemizedlist>
 
-<para>The extra information line: for Param errors, is the name of the
-offending system call parameter.  No other error kinds have this extra
+<para><computeroutput>Param</computeroutput> errors have an extra
+information line at this point, which is the name of the offending
+system call parameter.  No other error kinds have this extra
 line.</para>
 
 <para>The first line of the calling context: for Value and Addr errors,
@@ -732,7 +737,8 @@ for ( i = 0; i < 10; i++ ) {
 
 <para>Memcheck emits no complaints about this, since it merely copies
 uninitialised values from <varname>a[]</varname> into
-<varname>b[]</varname>, and doesn't use them in any way.  However, if
+<varname>b[]</varname>, and doesn't use them in a way which could
+affect the behaviour of the program.  However, if
 the loop is changed to:</para>
 <programlisting><![CDATA[
 for ( i = 0; i < 10; i++ ) {
@@ -742,7 +748,7 @@ if ( j == 77 )
   printf("hello there\n");
 ]]></programlisting>
 
-<para>then Valgrind will complain, at the
+<para>then Memcheck will complain, at the
 <computeroutput>if</computeroutput>, that the condition depends on
 uninitialised values.  Note that it <command>doesn't</command> complain
 at the <varname>j += a[i];</varname>, since at that point the
@@ -757,13 +763,15 @@ complain.</para>
 
 <para>Checks on definedness only occur in three places: when a value is
 used to generate a memory address, when control flow decision needs to
-be made, and when a system call is detected, Valgrind checks definedness
+be made, and when a system call is detected, Memcheck checks definedness
 of parameters as required.</para>
 
 <para>If a check should detect undefinedness, an error message is
 issued.  The resulting value is subsequently regarded as well-defined.
-To do otherwise would give long chains of error messages.  In effect, we
-say that undefined values are non-infectious.</para>
+To do otherwise would give long chains of error messages.  In other
+words, once Memcheck reports an undefined value error, it tries to
+avoid reporting further errors derived from that same undefined
+value.</para>
 
 <para>This sounds overcomplicated.  Why not just check all reads from
 memory, and complain if an undefined value is loaded into a CPU
@@ -782,18 +790,18 @@ s2 = s1;
 <para>The question to ask is: how large is <varname>struct S</varname>,
 in bytes?  An <varname>int</varname> is 4 bytes and a
 <varname>char</varname> one byte, so perhaps a <varname>struct
-S</varname> occupies 5 bytes?  Wrong.  All (non-toy) compilers we know
+S</varname> occupies 5 bytes?  Wrong.  All non-toy compilers we know
 of will round the size of <varname>struct S</varname> up to a whole
 number of words, in this case 8 bytes.  Not doing this forces compilers
-to generate truly appalling code for subscripting arrays of
-<varname>struct S</varname>'s.</para>
+to generate truly appalling code for accessing arrays of
+<varname>struct S</varname>'s on some architectures.</para>
 
 <para>So <varname>s1</varname> occupies 8 bytes, yet only 5 of them will
 be initialised.  For the assignment <varname>s2 = s1</varname>, gcc
 generates code to copy all 8 bytes wholesale into <varname>s2</varname>
 without regard for their meaning.  If Memcheck simply checked values as
 they came out of memory, it would yelp every time a structure assignment
-like this happened.  So the more complicated semantics described above
+like this happened.  So the more complicated behaviour described above
 is necessary.  This allows <literal>gcc</literal> to copy
 <varname>s1</varname> into <varname>s2</varname> any way it likes, and a
 warning will only be emitted if the uninitialised values are later
@@ -808,7 +816,7 @@ used.</para>
 <para>Notice that the previous subsection describes how the validity of
 values is established and maintained without having to say whether the
 program does or does not have the right to access any particular memory
-location.  We now consider the latter issue.</para>
+location.  We now consider the latter question.</para>
 
 <para>As described above, every bit in memory or in the CPU has an
 associated valid-value (V) bit.  In addition, all bytes in memory, but
@@ -853,13 +861,14 @@ themselves do not change the A bits, only consult them.</para>
 
   <listitem>
     <para>When doing system calls, A bits are changed appropriately.
-    For example, mmap() magically makes files appear in the process'
-    address space, so the A bits must be updated if mmap()
+    For example, <literal>mmap</literal>
+    magically makes files appear in the process'
+    address space, so the A bits must be updated if <literal>mmap</literal>
     succeeds.</para>
   </listitem>
 
   <listitem>
-    <para>Optionally, your program can tell Valgrind about such changes
+    <para>Optionally, your program can tell Memcheck about such changes
     explicitly, using the client request mechanism described
     above.</para>
   </listitem>
@@ -885,7 +894,7 @@ follows:</para>
 
   <listitem>
     <para>When memory is read or written, the relevant A bits are
-    consulted.  If they indicate an invalid address, Valgrind emits an
+    consulted.  If they indicate an invalid address, Memcheck emits an
     Invalid read or Invalid write error.</para>
   </listitem>
 
@@ -909,18 +918,18 @@ follows:</para>
 
   <listitem>
     <para>When values in CPU registers are used for any other purpose,
-    Valgrind computes the V bits for the result, but does not check
+    Memcheck computes the V bits for the result, but does not check
     them.</para>
   </listitem>
 
   <listitem>
-    <para>One the V bits for a value in the CPU have been checked, they
+    <para>Once the V bits for a value in the CPU have been checked, they
     are then set to indicate validity.  This avoids long chains of
     errors.</para>
   </listitem>
 
   <listitem>
-    <para>When values are loaded from memory, valgrind checks the A bits
+    <para>When values are loaded from memory, Memcheck checks the A bits
     for that location and issues an illegal-address warning if needed.
     In that case, the V bits loaded are forced to indicate Valid,
     despite the location being invalid.</para>
@@ -950,13 +959,13 @@ is:</para>
 
   <listitem>
     <para>malloc/new/new[]: the returned memory is marked as addressible
-    but not having valid values.  This means you have to write on it
+    but not having valid values.  This means you have to write to it
     before you can read it.</para>
   </listitem>
 
   <listitem>
     <para>calloc: returned memory is marked both addressible and valid,
-    since calloc() clears the area to zero.</para>
+    since calloc clears the area to zero.</para>
   </listitem>
 
   <listitem>
@@ -973,8 +982,8 @@ is:</para>
   <listitem>
     <para>free/delete/delete[]: you may only pass to these functions a
     pointer previously issued to you by the corresponding allocation
-    function.  Otherwise, Valgrind complains.  If the pointer is indeed
-    valid, Valgrind marks the entire area it points at as unaddressible,
+    function.  Otherwise, Memcheck complains.  If the pointer is indeed
+    valid, Memcheck marks the entire area it points at as unaddressible,
     and places the block in the freed-blocks-queue.  The aim is to defer
     as long as possible reallocation of this block.  Until that happens,
     all attempts to access it will elicit an invalid-address error, as
@@ -1050,10 +1059,10 @@ arguments.</para>
   </listitem>
 
   <listitem>
-    <para><varname>VALGRIND_DO_LEAK_CHECK</varname>: run the memory leak
-    detector right now.  Returns no value.  I guess this could be used
-    to incrementally check for leaks between arbitrary places in the
-    program's execution.  Warning: not properly tested!</para>
+    <para><varname>VALGRIND_DO_LEAK_CHECK</varname>: runs the memory
+    leak detector right now.  Is useful for incrementally checking for
+    leaks between arbitrary places in the program's execution.  Returns
+    no value.</para>
   </listitem>
 
   <listitem>