aboutsummaryrefslogtreecommitdiff
path: root/doc/pcre2.txt
diff options
context:
space:
mode:
authorAndroid Build Coastguard Worker <android-build-coastguard-worker@google.com>2023-07-07 05:05:03 +0000
committerAndroid Build Coastguard Worker <android-build-coastguard-worker@google.com>2023-07-07 05:05:03 +0000
commitff8bed5b235d6f3634157c1b49fbb400d630bf1a (patch)
tree95826d3e96d80ddeb3a3585438d34344a22a0178 /doc/pcre2.txt
parent747b015be24e8ef9bd83056f0527218d197d0536 (diff)
parent1ff4be9c7a25a26d3217db330589ce6c472a60a6 (diff)
downloadpcre-ff8bed5b235d6f3634157c1b49fbb400d630bf1a.tar.gz
Change-Id: I0d1b301baa18c44c37de4492c282e121970a8696
Diffstat (limited to 'doc/pcre2.txt')
-rw-r--r--doc/pcre2.txt472
1 files changed, 246 insertions, 226 deletions
diff --git a/doc/pcre2.txt b/doc/pcre2.txt
index 641a1f9d..a997f73b 100644
--- a/doc/pcre2.txt
+++ b/doc/pcre2.txt
@@ -1028,7 +1028,7 @@ PCRE2 CONTEXTS
pcre2jit documentation for more details). If the limit is reached, the
negative error code PCRE2_ERROR_HEAPLIMIT is returned. The default
limit can be set when PCRE2 is built; if it is not, the default is set
- very large and is essentially "unlimited".
+ very large and is essentially unlimited.
A value for the heap limit may also be supplied by an item at the start
of a pattern of the form
@@ -1039,19 +1039,15 @@ PCRE2 CONTEXTS
less ddd is less than the limit set by the caller of pcre2_match() or,
if no such limit is set, less than the default.
- The pcre2_match() function starts out using a 20KiB vector on the sys-
- tem stack for recording backtracking points. The more nested backtrack-
- ing points there are (that is, the deeper the search tree), the more
- memory is needed. Heap memory is used only if the initial vector is
- too small. If the heap limit is set to a value less than 21 (in partic-
- ular, zero) no heap memory will be used. In this case, only patterns
- that do not have a lot of nested backtracking can be successfully pro-
- cessed.
+ The pcre2_match() function always needs some heap memory, so setting a
+ value of zero guarantees a "heap limit exceeded" error. Details of how
+ pcre2_match() uses the heap are given in the pcre2perform documenta-
+ tion.
- Similarly, for pcre2_dfa_match(), a vector on the system stack is used
- when processing pattern recursions, lookarounds, or atomic groups, and
- only if this is not big enough is heap memory used. In this case, too,
- setting a value of zero disables the use of the heap.
+ For pcre2_dfa_match(), a vector on the system stack is used when pro-
+ cessing pattern recursions, lookarounds, or atomic groups, and only if
+ this is not big enough is heap memory used. In this case, setting a
+ value of zero disables the use of the heap.
int pcre2_set_match_limit(pcre2_match_context *mcontext,
uint32_t value);
@@ -1093,12 +1089,12 @@ PCRE2 CONTEXTS
This parameter limits the depth of nested backtracking in
pcre2_match(). Each time a nested backtracking point is passed, a new
- memory "frame" is used to remember the state of matching at that point.
+ memory frame is used to remember the state of matching at that point.
Thus, this parameter indirectly limits the amount of memory that is
- used in a match. However, because the size of each memory "frame" de-
- pends on the number of capturing parentheses, the actual memory limit
- varies from pattern to pattern. This limit was more useful in versions
- before 10.30, where function recursion was used for backtracking.
+ used in a match. However, because the size of each memory frame depends
+ on the number of capturing parentheses, the actual memory limit varies
+ from pattern to pattern. This limit was more useful in versions before
+ 10.30, where function recursion was used for backtracking.
The depth limit is not relevant, and is ignored, when matching is done
using JIT compiled code. However, it is supported by pcre2_dfa_match(),
@@ -1372,27 +1368,29 @@ COMPILING A PATTERN
diately. Otherwise, the variables to which these point are set to an
error code and an offset (number of code units) within the pattern, re-
spectively, when pcre2_compile() returns NULL because a compilation er-
- ror has occurred. The values are not defined when compilation is suc-
- cessful and pcre2_compile() returns a non-NULL value.
+ ror has occurred.
- There are nearly 100 positive error codes that pcre2_compile() may re-
- turn if it finds an error in the pattern. There are also some negative
- error codes that are used for invalid UTF strings when validity check-
- ing is in force. These are the same as given by pcre2_match() and
+ There are nearly 100 positive error codes that pcre2_compile() may re-
+ turn if it finds an error in the pattern. There are also some negative
+ error codes that are used for invalid UTF strings when validity check-
+ ing is in force. These are the same as given by pcre2_match() and
pcre2_dfa_match(), and are described in the pcre2unicode documentation.
- There is no separate documentation for the positive error codes, be-
- cause the textual error messages that are obtained by calling the
+ There is no separate documentation for the positive error codes, be-
+ cause the textual error messages that are obtained by calling the
pcre2_get_error_message() function (see "Obtaining a textual error mes-
- sage" below) should be self-explanatory. Macro names starting with
- PCRE2_ERROR_ are defined for both positive and negative error codes in
- pcre2.h.
+ sage" below) should be self-explanatory. Macro names starting with
+ PCRE2_ERROR_ are defined for both positive and negative error codes in
+ pcre2.h. When compilation is successful errorcode is set to a value
+ that returns the message "no error" if passed to pcre2_get_error_mes-
+ sage().
The value returned in erroroffset is an indication of where in the pat-
- tern the error occurred. It is not necessarily the furthest point in
- the pattern that was read. For example, after the error "lookbehind as-
- sertion is not fixed length", the error offset points to the start of
- the failing assertion. For an invalid UTF-8 or UTF-16 string, the off-
- set is that of the first code unit of the failing character.
+ tern an error occurred. When there is no error, zero is returned. A
+ non-zero value is not necessarily the furthest point in the pattern
+ that was read. For example, after the error "lookbehind assertion is
+ not fixed length", the error offset points to the start of the failing
+ assertion. For an invalid UTF-8 or UTF-16 string, the offset is that of
+ the first code unit of the failing character.
Some errors are not detected until the whole pattern has been scanned;
in these cases, the offset passed back is the length of the pattern.
@@ -2496,7 +2494,9 @@ THE MATCH DATA BLOCK
A minimum of at least 1 pair is imposed by pcre2_match_data_create(),
so it is always possible to return the overall matched string in the
case of pcre2_match() or the longest match in the case of
- pcre2_dfa_match().
+ pcre2_dfa_match(). The maximum number of pairs is 65535; if the the
+ first argument of pcre2_match_data_create() is greater than this, 65535
+ is used.
The second argument of pcre2_match_data_create() is a pointer to a gen-
eral context, which can specify custom memory management for obtaining
@@ -3049,12 +3049,12 @@ ERROR RETURNS FROM pcre2_match()
PCRE2_ERROR_NOMEMORY
- If a pattern contains many nested backtracking points, heap memory is
- used to remember them. This error is given when the memory allocation
- function (default or custom) fails. Note that a different error,
- PCRE2_ERROR_HEAPLIMIT, is given if the amount of memory needed exceeds
- the heap limit. PCRE2_ERROR_NOMEMORY is also returned if
- PCRE2_COPY_MATCHED_SUBJECT is set and memory allocation fails.
+ Heap memory is used to remember backgracking points. This error is
+ given when the memory allocation function (default or custom) fails.
+ Note that a different error, PCRE2_ERROR_HEAPLIMIT, is given if the
+ amount of memory needed exceeds the heap limit. PCRE2_ERROR_NOMEMORY is
+ also returned if PCRE2_COPY_MATCHED_SUBJECT is set and memory alloca-
+ tion fails.
PCRE2_ERROR_NULL
@@ -3858,8 +3858,8 @@ AUTHOR
REVISION
- Last updated: 14 December 2021
- Copyright (c) 1997-2021 University of Cambridge.
+ Last updated: 27 July 2022
+ Copyright (c) 1997-2022 University of Cambridge.
------------------------------------------------------------------------------
@@ -4116,41 +4116,40 @@ LIMITING PCRE2 RESOURCE USAGE
pcre2_dfa_match() matching function, and to JIT matching (though the
counting is done differently).
- The pcre2_match() function starts out using a 20KiB vector on the sys-
- tem stack to record backtracking points. The more nested backtracking
- points there are (that is, the deeper the search tree), the more memory
- is needed. If the initial vector is not large enough, heap memory is
- used, up to a certain limit, which is specified in kibibytes (units of
- 1024 bytes). The limit can be changed at run time, as described in the
- pcre2api documentation. The default limit (in effect unlimited) is 20
- million. You can change this by a setting such as
+ The pcre2_match() function uses heap memory to record backtracking
+ points. The more nested backtracking points there are (that is, the
+ deeper the search tree), the more memory is needed. There is an upper
+ limit, specified in kibibytes (units of 1024 bytes). This limit can be
+ changed at run time, as described in the pcre2api documentation. The
+ default limit (in effect unlimited) is 20 million. You can change this
+ by a setting such as
--with-heap-limit=500
- which limits the amount of heap to 500 KiB. This limit applies only to
+ which limits the amount of heap to 500 KiB. This limit applies only to
interpretive matching in pcre2_match() and pcre2_dfa_match(), which may
- also use the heap for internal workspace when processing complicated
- patterns. This limit does not apply when JIT (which has its own memory
+ also use the heap for internal workspace when processing complicated
+ patterns. This limit does not apply when JIT (which has its own memory
arrangements) is used.
- You can also explicitly limit the depth of nested backtracking in the
+ You can also explicitly limit the depth of nested backtracking in the
pcre2_match() interpreter. This limit defaults to the value that is set
- for --with-match-limit. You can set a lower default limit by adding,
+ for --with-match-limit. You can set a lower default limit by adding,
for example,
--with-match-limit-depth=10000
- to the configure command. This value can be overridden at run time.
- This depth limit indirectly limits the amount of heap memory that is
- used, but because the size of each backtracking "frame" depends on the
- number of capturing parentheses in a pattern, the amount of heap that
- is used before the limit is reached varies from pattern to pattern.
+ to the configure command. This value can be overridden at run time.
+ This depth limit indirectly limits the amount of heap memory that is
+ used, but because the size of each backtracking "frame" depends on the
+ number of capturing parentheses in a pattern, the amount of heap that
+ is used before the limit is reached varies from pattern to pattern.
This limit was more useful in versions before 10.30, where function re-
cursion was used for backtracking.
As well as applying to pcre2_match(), the depth limit also controls the
- depth of recursive function calls in pcre2_dfa_match(). These are used
- for lookaround assertions, atomic groups, and recursion within pat-
+ depth of recursive function calls in pcre2_dfa_match(). These are used
+ for lookaround assertions, atomic groups, and recursion within pat-
terns. The limit does not apply to JIT matching.
@@ -4158,67 +4157,67 @@ CREATING CHARACTER TABLES AT BUILD TIME
PCRE2 uses fixed tables for processing characters whose code points are
less than 256. By default, PCRE2 is built with a set of tables that are
- distributed in the file src/pcre2_chartables.c.dist. These tables are
+ distributed in the file src/pcre2_chartables.c.dist. These tables are
for ASCII codes only. If you add
--enable-rebuild-chartables
- to the configure command, the distributed tables are no longer used.
+ to the configure command, the distributed tables are no longer used.
Instead, a program called pcre2_dftables is compiled and run. This out-
puts the source for new set of tables, created in the default locale of
- your C run-time system. This method of replacing the tables does not
+ your C run-time system. This method of replacing the tables does not
work if you are cross compiling, because pcre2_dftables needs to be run
on the local host and therefore not compiled with the cross compiler.
If you need to create alternative tables when cross compiling, you will
- have to do so "by hand". There may also be other reasons for creating
- tables manually. To cause pcre2_dftables to be built on the local
+ have to do so "by hand". There may also be other reasons for creating
+ tables manually. To cause pcre2_dftables to be built on the local
host, run a normal compiling command, and then run the program with the
output file as its argument, for example:
cc src/pcre2_dftables.c -o pcre2_dftables
./pcre2_dftables src/pcre2_chartables.c
- This builds the tables in the default locale of the local host. If you
+ This builds the tables in the default locale of the local host. If you
want to specify a locale, you must use the -L option:
LC_ALL=fr_FR ./pcre2_dftables -L src/pcre2_chartables.c
You can also specify -b (with or without -L). This causes the tables to
- be written in binary instead of as source code. A set of binary tables
- can be loaded into memory by an application and passed to pcre2_com-
+ be written in binary instead of as source code. A set of binary tables
+ can be loaded into memory by an application and passed to pcre2_com-
pile() in the same way as tables created by calling pcre2_maketables().
- The tables are just a string of bytes, independent of hardware charac-
- teristics such as endianness. This means they can be bundled with an
- application that runs in different environments, to ensure consistent
+ The tables are just a string of bytes, independent of hardware charac-
+ teristics such as endianness. This means they can be bundled with an
+ application that runs in different environments, to ensure consistent
behaviour.
USING EBCDIC CODE
- PCRE2 assumes by default that it will run in an environment where the
- character code is ASCII or Unicode, which is a superset of ASCII. This
+ PCRE2 assumes by default that it will run in an environment where the
+ character code is ASCII or Unicode, which is a superset of ASCII. This
is the case for most computer operating systems. PCRE2 can, however, be
compiled to run in an 8-bit EBCDIC environment by adding
--enable-ebcdic --disable-unicode
to the configure command. This setting implies --enable-rebuild-charta-
- bles. You should only use it if you know that you are in an EBCDIC en-
+ bles. You should only use it if you know that you are in an EBCDIC en-
vironment (for example, an IBM mainframe operating system).
- It is not possible to support both EBCDIC and UTF-8 codes in the same
- version of the library. Consequently, --enable-unicode and --enable-
+ It is not possible to support both EBCDIC and UTF-8 codes in the same
+ version of the library. Consequently, --enable-unicode and --enable-
ebcdic are mutually exclusive.
The EBCDIC character that corresponds to an ASCII LF is assumed to have
- the value 0x15 by default. However, in some EBCDIC environments, 0x25
+ the value 0x15 by default. However, in some EBCDIC environments, 0x25
is used. In such an environment you should use
--enable-ebcdic-nl25
as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR
- has the same value as in ASCII, namely, 0x0d. Whichever of 0x15 and
+ has the same value as in ASCII, namely, 0x0d. Whichever of 0x15 and
0x25 is not chosen as LF is made to correspond to the Unicode NEL char-
acter (which, in Unicode, is 0x85).
@@ -4230,47 +4229,47 @@ USING EBCDIC CODE
PCRE2GREP SUPPORT FOR EXTERNAL SCRIPTS
By default pcre2grep supports the use of callouts with string arguments
- within the patterns it is matching. There are two kinds: one that gen-
+ within the patterns it is matching. There are two kinds: one that gen-
erates output using local code, and another that calls an external pro-
- gram or script. If --disable-pcre2grep-callout-fork is added to the
- configure command, only the first kind of callout is supported; if
- --disable-pcre2grep-callout is used, all callouts are completely ig-
- nored. For more details of pcre2grep callouts, see the pcre2grep docu-
+ gram or script. If --disable-pcre2grep-callout-fork is added to the
+ configure command, only the first kind of callout is supported; if
+ --disable-pcre2grep-callout is used, all callouts are completely ig-
+ nored. For more details of pcre2grep callouts, see the pcre2grep docu-
mentation.
PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT
- By default, pcre2grep reads all files as plain text. You can build it
- so that it recognizes files whose names end in .gz or .bz2, and reads
+ By default, pcre2grep reads all files as plain text. You can build it
+ so that it recognizes files whose names end in .gz or .bz2, and reads
them with libz or libbz2, respectively, by adding one or both of
--enable-pcre2grep-libz
--enable-pcre2grep-libbz2
to the configure command. These options naturally require that the rel-
- evant libraries are installed on your system. Configuration will fail
+ evant libraries are installed on your system. Configuration will fail
if they are not.
PCRE2GREP BUFFER SIZE
- pcre2grep uses an internal buffer to hold a "window" on the file it is
+ pcre2grep uses an internal buffer to hold a "window" on the file it is
scanning, in order to be able to output "before" and "after" lines when
it finds a match. The default starting size of the buffer is 20KiB. The
- buffer itself is three times this size, but because of the way it is
+ buffer itself is three times this size, but because of the way it is
used for holding "before" lines, the longest line that is guaranteed to
be processable is the notional buffer size. If a longer line is encoun-
- tered, pcre2grep automatically expands the buffer, up to a specified
- maximum size, whose default is 1MiB or the starting size, whichever is
- the larger. You can change the default parameter values by adding, for
+ tered, pcre2grep automatically expands the buffer, up to a specified
+ maximum size, whose default is 1MiB or the starting size, whichever is
+ the larger. You can change the default parameter values by adding, for
example,
--with-pcre2grep-bufsize=51200
--with-pcre2grep-max-bufsize=2097152
- to the configure command. The caller of pcre2grep can override these
- values by using --buffer-size and --max-buffer-size on the command
+ to the configure command. The caller of pcre2grep can override these
+ values by using --buffer-size and --max-buffer-size on the command
line.
@@ -4281,26 +4280,26 @@ PCRE2TEST OPTION FOR LIBREADLINE SUPPORT
--enable-pcre2test-libreadline
--enable-pcre2test-libedit
- to the configure command, pcre2test is linked with the libreadline or-
- libedit library, respectively, and when its input is from a terminal,
- it reads it using the readline() function. This provides line-editing
- and history facilities. Note that libreadline is GPL-licensed, so if
- you distribute a binary of pcre2test linked in this way, there may be
+ to the configure command, pcre2test is linked with the libreadline or-
+ libedit library, respectively, and when its input is from a terminal,
+ it reads it using the readline() function. This provides line-editing
+ and history facilities. Note that libreadline is GPL-licensed, so if
+ you distribute a binary of pcre2test linked in this way, there may be
licensing issues. These can be avoided by linking instead with libedit,
which has a BSD licence.
- Setting --enable-pcre2test-libreadline causes the -lreadline option to
- be added to the pcre2test build. In many operating environments with a
- sytem-installed readline library this is sufficient. However, in some
+ Setting --enable-pcre2test-libreadline causes the -lreadline option to
+ be added to the pcre2test build. In many operating environments with a
+ sytem-installed readline library this is sufficient. However, in some
environments (e.g. if an unmodified distribution version of readline is
- in use), some extra configuration may be necessary. The INSTALL file
+ in use), some extra configuration may be necessary. The INSTALL file
for libreadline says this:
"Readline uses the termcap functions, but does not link with
the termcap or curses library itself, allowing applications
which link with readline the to choose an appropriate library."
- If your environment has not been set up so that an appropriate library
+ If your environment has not been set up so that an appropriate library
is automatically included, you may need to add something like
LIBS="-ncurses"
@@ -4314,7 +4313,7 @@ INCLUDING DEBUGGING CODE
--enable-debug
- to the configure command, additional debugging code is included in the
+ to the configure command, additional debugging code is included in the
build. This feature is intended for use by the PCRE2 maintainers.
@@ -4324,14 +4323,14 @@ DEBUGGING WITH VALGRIND SUPPORT
--enable-valgrind
- to the configure command, PCRE2 will use valgrind annotations to mark
- certain memory regions as unaddressable. This allows it to detect in-
+ to the configure command, PCRE2 will use valgrind annotations to mark
+ certain memory regions as unaddressable. This allows it to detect in-
valid memory accesses, and is mostly useful for debugging PCRE2 itself.
CODE COVERAGE REPORTING
- If your C compiler is gcc, you can build a version of PCRE2 that can
+ If your C compiler is gcc, you can build a version of PCRE2 that can
generate a code coverage report for its test suite. To enable this, you
must install lcov version 1.6 or above. Then specify
@@ -4340,20 +4339,20 @@ CODE COVERAGE REPORTING
to the configure command and build PCRE2 in the usual way.
Note that using ccache (a caching C compiler) is incompatible with code
- coverage reporting. If you have configured ccache to run automatically
+ coverage reporting. If you have configured ccache to run automatically
on your system, you must set the environment variable
CCACHE_DISABLE=1
before running make to build PCRE2, so that ccache is not used.
- When --enable-coverage is used, the following addition targets are
+ When --enable-coverage is used, the following addition targets are
added to the Makefile:
make coverage
- This creates a fresh coverage report for the PCRE2 test suite. It is
- equivalent to running "make coverage-reset", "make coverage-baseline",
+ This creates a fresh coverage report for the PCRE2 test suite. It is
+ equivalent to running "make coverage-reset", "make coverage-baseline",
"make check", and then "make coverage-report".
make coverage-reset
@@ -4370,73 +4369,73 @@ CODE COVERAGE REPORTING
make coverage-clean-report
- This removes the generated coverage report without cleaning the cover-
+ This removes the generated coverage report without cleaning the cover-
age data itself.
make coverage-clean-data
- This removes the captured coverage data without removing the coverage
+ This removes the captured coverage data without removing the coverage
files created at compile time (*.gcno).
make coverage-clean
- This cleans all coverage data including the generated coverage report.
- For more information about code coverage, see the gcov and lcov docu-
+ This cleans all coverage data including the generated coverage report.
+ For more information about code coverage, see the gcov and lcov docu-
mentation.
DISABLING THE Z AND T FORMATTING MODIFIERS
- The C99 standard defines formatting modifiers z and t for size_t and
- ptrdiff_t values, respectively. By default, PCRE2 uses these modifiers
+ The C99 standard defines formatting modifiers z and t for size_t and
+ ptrdiff_t values, respectively. By default, PCRE2 uses these modifiers
in environments other than old versions of Microsoft Visual Studio when
- __STDC_VERSION__ is defined and has a value greater than or equal to
- 199901L (indicating support for C99). However, there is at least one
+ __STDC_VERSION__ is defined and has a value greater than or equal to
+ 199901L (indicating support for C99). However, there is at least one
environment that claims to be C99 but does not support these modifiers.
If
--disable-percent-zt
is specified, no use is made of the z or t modifiers. Instead of %td or
- %zu, a suitable format is used depending in the size of long for the
+ %zu, a suitable format is used depending in the size of long for the
platform.
SUPPORT FOR FUZZERS
- There is a special option for use by people who want to run fuzzing
+ There is a special option for use by people who want to run fuzzing
tests on PCRE2:
--enable-fuzz-support
At present this applies only to the 8-bit library. If set, it causes an
- extra library called libpcre2-fuzzsupport.a to be built, but not in-
- stalled. This contains a single function called LLVMFuzzerTestOneIn-
- put() whose arguments are a pointer to a string and the length of the
- string. When called, this function tries to compile the string as a
- pattern, and if that succeeds, to match it. This is done both with no
- options and with some random options bits that are generated from the
+ extra library called libpcre2-fuzzsupport.a to be built, but not in-
+ stalled. This contains a single function called LLVMFuzzerTestOneIn-
+ put() whose arguments are a pointer to a string and the length of the
+ string. When called, this function tries to compile the string as a
+ pattern, and if that succeeds, to match it. This is done both with no
+ options and with some random options bits that are generated from the
string.
- Setting --enable-fuzz-support also causes a binary called pcre2fuz-
- zcheck to be created. This is normally run under valgrind or used when
+ Setting --enable-fuzz-support also causes a binary called pcre2fuz-
+ zcheck to be created. This is normally run under valgrind or used when
PCRE2 is compiled with address sanitizing enabled. It calls the fuzzing
- function and outputs information about what it is doing. The input
- strings are specified by arguments: if an argument starts with "=" the
- rest of it is a literal input string. Otherwise, it is assumed to be a
+ function and outputs information about what it is doing. The input
+ strings are specified by arguments: if an argument starts with "=" the
+ rest of it is a literal input string. Otherwise, it is assumed to be a
file name, and the contents of the file are the test string.
OBSOLETE OPTION
- In versions of PCRE2 prior to 10.30, there were two ways of handling
- backtracking in the pcre2_match() function. The default was to use the
+ In versions of PCRE2 prior to 10.30, there were two ways of handling
+ backtracking in the pcre2_match() function. The default was to use the
system stack, but if
--disable-stack-for-recursion
- was set, memory on the heap was used. From release 10.30 onwards this
- has changed (the stack is no longer used) and this option now does
+ was set, memory on the heap was used. From release 10.30 onwards this
+ has changed (the stack is no longer used) and this option now does
nothing except give a warning.
@@ -4448,14 +4447,14 @@ SEE ALSO
AUTHOR
Philip Hazel
- University Computing Service
+ Retired from University Computing Service
Cambridge, England.
REVISION
- Last updated: 08 December 2021
- Copyright (c) 1997-2021 University of Cambridge.
+ Last updated: 27 July 2022
+ Copyright (c) 1997-2022 University of Cambridge.
------------------------------------------------------------------------------
@@ -5594,18 +5593,22 @@ SIZE AND OTHER LIMITATIONS
The maximum length of a string argument to a callout is the largest
number a 32-bit unsigned integer can hold.
+ The maximum amount of heap memory used for matching is controlled by
+ the heap limit, which can be set in a pattern or in a match context.
+ The default is a very large number, effectively unlimited.
+
AUTHOR
Philip Hazel
- University Computing Service
+ Retired from University Computing Service
Cambridge, England.
REVISION
- Last updated: 02 February 2019
- Copyright (c) 1997-2019 University of Cambridge.
+ Last updated: 26 July 2022
+ Copyright (c) 1997-2022 University of Cambridge.
------------------------------------------------------------------------------
@@ -9771,152 +9774,169 @@ STACK AND HEAP USAGE AT RUN TIME
sive function calls could use a great deal of stack, and this could
cause problems, but this usage has been eliminated. Backtracking posi-
tions are now explicitly remembered in memory frames controlled by the
- code. An initial 20KiB vector of frames is allocated on the system
- stack (enough for about 100 frames for small patterns), but if this is
- insufficient, heap memory is used. The amount of heap memory can be
- limited; if the limit is set to zero, only the initial stack vector is
- used. Rewriting patterns to be time-efficient, as described below, may
- also reduce the memory requirements.
-
- In contrast to pcre2_match(), pcre2_dfa_match() does use recursive
- function calls, but only for processing atomic groups, lookaround as-
+ code.
+
+ The size of each frame depends on the size of pointer variables and the
+ number of capturing parenthesized groups in the pattern being matched.
+ On a 64-bit system the frame size for a pattern with no captures is 128
+ bytes. For each capturing group the size increases by 16 bytes.
+
+ Until release 10.41, an initial 20KiB frames vector was allocated on
+ the system stack, but this still caused some issues for multi-thread
+ applications where each thread has a very small stack. From release
+ 10.41 backtracking memory frames are always held in heap memory. An
+ initial heap allocation is obtained the first time any match data block
+ is passed to pcre2_match(). This is remembered with the match data
+ block and re-used if that block is used for another match. It is freed
+ when the match data block itself is freed.
+
+ The size of the initial block is the larger of 20KiB or ten times the
+ pattern's frame size, unless the heap limit is less than this, in which
+ case the heap limit is used. If the initial block proves to be too
+ small during matching, it is replaced by a larger block, subject to the
+ heap limit. The heap limit is checked only when a new block is to be
+ allocated. Reducing the heap limit between calls to pcre2_match() with
+ the same match data block does not affect the saved block.
+
+ In contrast to pcre2_match(), pcre2_dfa_match() does use recursive
+ function calls, but only for processing atomic groups, lookaround as-
sertions, and recursion within the pattern. The original version of the
- code used to allocate quite large internal workspace vectors on the
- stack, which caused some problems for some patterns in environments
- with small stacks. From release 10.32 the code for pcre2_dfa_match()
- has been re-factored to use heap memory when necessary for internal
- workspace when recursing, though recursive function calls are still
+ code used to allocate quite large internal workspace vectors on the
+ stack, which caused some problems for some patterns in environments
+ with small stacks. From release 10.32 the code for pcre2_dfa_match()
+ has been re-factored to use heap memory when necessary for internal
+ workspace when recursing, though recursive function calls are still
used.
- The "match depth" parameter can be used to limit the depth of function
- recursion, and the "match heap" parameter to limit heap memory in
+ The "match depth" parameter can be used to limit the depth of function
+ recursion, and the "match heap" parameter to limit heap memory in
pcre2_dfa_match().
PROCESSING TIME
- Certain items in regular expression patterns are processed more effi-
+ Certain items in regular expression patterns are processed more effi-
ciently than others. It is more efficient to use a character class like
- [aeiou] than a set of single-character alternatives such as
- (a|e|i|o|u). In general, the simplest construction that provides the
+ [aeiou] than a set of single-character alternatives such as
+ (a|e|i|o|u). In general, the simplest construction that provides the
required behaviour is usually the most efficient. Jeffrey Friedl's book
- contains a lot of useful general discussion about optimizing regular
+ contains a lot of useful general discussion about optimizing regular
expressions for efficient performance. This document contains a few ob-
servations about PCRE2.
- Using Unicode character properties (the \p, \P, and \X escapes) is
- slow, because PCRE2 has to use a multi-stage table lookup whenever it
- needs a character's property. If you can find an alternative pattern
+ Using Unicode character properties (the \p, \P, and \X escapes) is
+ slow, because PCRE2 has to use a multi-stage table lookup whenever it
+ needs a character's property. If you can find an alternative pattern
that does not use character properties, it will probably be faster.
- By default, the escape sequences \b, \d, \s, and \w, and the POSIX
- character classes such as [:alpha:] do not use Unicode properties,
+ By default, the escape sequences \b, \d, \s, and \w, and the POSIX
+ character classes such as [:alpha:] do not use Unicode properties,
partly for backwards compatibility, and partly for performance reasons.
- However, you can set the PCRE2_UCP option or start the pattern with
- (*UCP) if you want Unicode character properties to be used. This can
- double the matching time for items such as \d, when matched with
- pcre2_match(); the performance loss is less with a DFA matching func-
+ However, you can set the PCRE2_UCP option or start the pattern with
+ (*UCP) if you want Unicode character properties to be used. This can
+ double the matching time for items such as \d, when matched with
+ pcre2_match(); the performance loss is less with a DFA matching func-
tion, and in both cases there is not much difference for \b.
- When a pattern begins with .* not in atomic parentheses, nor in paren-
- theses that are the subject of a backreference, and the PCRE2_DOTALL
- option is set, the pattern is implicitly anchored by PCRE2, since it
- can match only at the start of a subject string. If the pattern has
+ When a pattern begins with .* not in atomic parentheses, nor in paren-
+ theses that are the subject of a backreference, and the PCRE2_DOTALL
+ option is set, the pattern is implicitly anchored by PCRE2, since it
+ can match only at the start of a subject string. If the pattern has
multiple top-level branches, they must all be anchorable. The optimiza-
- tion can be disabled by the PCRE2_NO_DOTSTAR_ANCHOR option, and is au-
+ tion can be disabled by the PCRE2_NO_DOTSTAR_ANCHOR option, and is au-
tomatically disabled if the pattern contains (*PRUNE) or (*SKIP).
- If PCRE2_DOTALL is not set, PCRE2 cannot make this optimization, be-
- cause the dot metacharacter does not then match a newline, and if the
- subject string contains newlines, the pattern may match from the char-
+ If PCRE2_DOTALL is not set, PCRE2 cannot make this optimization, be-
+ cause the dot metacharacter does not then match a newline, and if the
+ subject string contains newlines, the pattern may match from the char-
acter immediately following one of them instead of from the very start.
For example, the pattern
.*second
- matches the subject "first\nand second" (where \n stands for a newline
- character), with the match starting at the seventh character. In order
- to do this, PCRE2 has to retry the match starting after every newline
+ matches the subject "first\nand second" (where \n stands for a newline
+ character), with the match starting at the seventh character. In order
+ to do this, PCRE2 has to retry the match starting after every newline
in the subject.
- If you are using such a pattern with subject strings that do not con-
- tain newlines, the best performance is obtained by setting
- PCRE2_DOTALL, or starting the pattern with ^.* or ^.*? to indicate ex-
- plicit anchoring. That saves PCRE2 from having to scan along the sub-
+ If you are using such a pattern with subject strings that do not con-
+ tain newlines, the best performance is obtained by setting
+ PCRE2_DOTALL, or starting the pattern with ^.* or ^.*? to indicate ex-
+ plicit anchoring. That saves PCRE2 from having to scan along the sub-
ject looking for a newline to restart at.
- Beware of patterns that contain nested indefinite repeats. These can
- take a long time to run when applied to a string that does not match.
+ Beware of patterns that contain nested indefinite repeats. These can
+ take a long time to run when applied to a string that does not match.
Consider the pattern fragment
^(a+)*
- This can match "aaaa" in 16 different ways, and this number increases
- very rapidly as the string gets longer. (The * repeat can match 0, 1,
- 2, 3, or 4 times, and for each of those cases other than 0 or 4, the +
- repeats can match different numbers of times.) When the remainder of
- the pattern is such that the entire match is going to fail, PCRE2 has
- in principle to try every possible variation, and this can take an ex-
+ This can match "aaaa" in 16 different ways, and this number increases
+ very rapidly as the string gets longer. (The * repeat can match 0, 1,
+ 2, 3, or 4 times, and for each of those cases other than 0 or 4, the +
+ repeats can match different numbers of times.) When the remainder of
+ the pattern is such that the entire match is going to fail, PCRE2 has
+ in principle to try every possible variation, and this can take an ex-
tremely long time, even for relatively short strings.
An optimization catches some of the more simple cases such as
(a+)*b
- where a literal character follows. Before embarking on the standard
- matching procedure, PCRE2 checks that there is a "b" later in the sub-
- ject string, and if there is not, it fails the match immediately. How-
- ever, when there is no following literal this optimization cannot be
+ where a literal character follows. Before embarking on the standard
+ matching procedure, PCRE2 checks that there is a "b" later in the sub-
+ ject string, and if there is not, it fails the match immediately. How-
+ ever, when there is no following literal this optimization cannot be
used. You can see the difference by comparing the behaviour of
(a+)*\d
- with the pattern above. The former gives a failure almost instantly
- when applied to a whole line of "a" characters, whereas the latter
+ with the pattern above. The former gives a failure almost instantly
+ when applied to a whole line of "a" characters, whereas the latter
takes an appreciable time with strings longer than about 20 characters.
In many cases, the solution to this kind of performance issue is to use
- an atomic group or a possessive quantifier. This can often reduce mem-
+ an atomic group or a possessive quantifier. This can often reduce mem-
ory requirements as well. As another example, consider this pattern:
([^<]|<(?!inet))+
- It matches from wherever it starts until it encounters "<inet" or the
- end of the data, and is the kind of pattern that might be used when
+ It matches from wherever it starts until it encounters "<inet" or the
+ end of the data, and is the kind of pattern that might be used when
processing an XML file. Each iteration of the outer parentheses matches
- either one character that is not "<" or a "<" that is not followed by
- "inet". However, each time a parenthesis is processed, a backtracking
- position is passed, so this formulation uses a memory frame for each
+ either one character that is not "<" or a "<" that is not followed by
+ "inet". However, each time a parenthesis is processed, a backtracking
+ position is passed, so this formulation uses a memory frame for each
matched character. For a long string, a lot of memory is required. Con-
- sider now this rewritten pattern, which matches exactly the same
+ sider now this rewritten pattern, which matches exactly the same
strings:
([^<]++|<(?!inet))+
This runs much faster, because sequences of characters that do not con-
tain "<" are "swallowed" in one item inside the parentheses, and a pos-
- sessive quantifier is used to stop any backtracking into the runs of
- non-"<" characters. This version also uses a lot less memory because
- entry to a new set of parentheses happens only when a "<" character
- that is not followed by "inet" is encountered (and we assume this is
+ sessive quantifier is used to stop any backtracking into the runs of
+ non-"<" characters. This version also uses a lot less memory because
+ entry to a new set of parentheses happens only when a "<" character
+ that is not followed by "inet" is encountered (and we assume this is
relatively rare).
This example shows that one way of optimizing performance when matching
- long subject strings is to write repeated parenthesized subpatterns to
+ long subject strings is to write repeated parenthesized subpatterns to
match more than one character whenever possible.
SETTING RESOURCE LIMITS
- You can set limits on the amount of processing that takes place when
- matching, and on the amount of heap memory that is used. The default
+ You can set limits on the amount of processing that takes place when
+ matching, and on the amount of heap memory that is used. The default
values of the limits are very large, and unlikely ever to operate. They
- can be changed when PCRE2 is built, and they can also be set when
- pcre2_match() or pcre2_dfa_match() is called. For details of these in-
- terfaces, see the pcre2build documentation and the section entitled
+ can be changed when PCRE2 is built, and they can also be set when
+ pcre2_match() or pcre2_dfa_match() is called. For details of these in-
+ terfaces, see the pcre2build documentation and the section entitled
"The match context" in the pcre2api documentation.
- The pcre2test test program has a modifier called "find_limits" which,
- if applied to a subject line, causes it to find the smallest limits
+ The pcre2test test program has a modifier called "find_limits" which,
+ if applied to a subject line, causes it to find the smallest limits
that allow a pattern to match. This is done by repeatedly matching with
different limits.
@@ -9924,14 +9944,14 @@ PROCESSING TIME
AUTHOR
Philip Hazel
- University Computing Service
+ Retired from University Computing Service
Cambridge, England.
REVISION
- Last updated: 03 February 2019
- Copyright (c) 1997-2019 University of Cambridge.
+ Last updated: 27 July 2022
+ Copyright (c) 1997-2022 University of Cambridge.
------------------------------------------------------------------------------
@@ -10434,7 +10454,7 @@ SAVING COMPILED PATTERNS
PCRE2_ERROR_BADDATA the number of patterns is zero or less
PCRE2_ERROR_BADMAGIC mismatch of id bytes in one of the patterns
- PCRE2_ERROR_MEMORY memory allocation failed
+ PCRE2_ERROR_NOMEMORY memory allocation failed
PCRE2_ERROR_MIXEDTABLES the patterns do not all use the same tables
PCRE2_ERROR_NULL the 1st, 3rd, or 4th argument is NULL