aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorPhilip Hazel <Philip.Hazel@gmail.com>2023-11-30 16:05:33 +0000
committerPhilip Hazel <Philip.Hazel@gmail.com>2023-11-30 16:05:33 +0000
commit8e83acc59946c15e3faa8be8617e4fb438370704 (patch)
treeffe4a8164087570f9867f32e60d0c9e339ba9f55
parent86919c90182854b5369bc10d5ad43b637f464f50 (diff)
downloadpcre-8e83acc59946c15e3faa8be8617e4fb438370704.tar.gz
Upgrade interpreter to match JIT in handling of nested pattern recursions
-rw-r--r--ChangeLog8
-rw-r--r--doc/html/pcre2compat.html8
-rw-r--r--doc/pcre2.txt9
-rw-r--r--doc/pcre2compat.39
-rw-r--r--doc/pcre2demo.32
-rw-r--r--maint/README5
-rw-r--r--src/pcre2_dfa_match.c17
-rw-r--r--src/pcre2_intmodedep.h34
-rw-r--r--src/pcre2_match.c22
-rw-r--r--testdata/testinput13
-rw-r--r--testdata/testinput212
-rw-r--r--testdata/testoutput14
-rw-r--r--testdata/testoutput216
13 files changed, 87 insertions, 62 deletions
diff --git a/ChangeLog b/ChangeLog
index 995d02bc..b274adf1 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -172,8 +172,12 @@ undefined behaviour.
47. Refactor the handling of whole-pattern recursion (?0) in pcre2_match() so
that its end is handled similarly to other recursions. This has altered the
-behaviour of /|(?0)./endanchored which was previously not right. However,
-it still differs from JIT.
+behaviour of /|(?0)./endanchored which was previously not right.
+
+48. Improved the test for looping recursion by checking the last referenced
+character as well as the current character. This allows some patterns that
+previously triggered the check to run to completion instead of giving the loop
+error.
Version 10.42 11-December-2022
diff --git a/doc/html/pcre2compat.html b/doc/html/pcre2compat.html
index b3165bbb..d60182ed 100644
--- a/doc/html/pcre2compat.html
+++ b/doc/html/pcre2compat.html
@@ -246,6 +246,12 @@ to set characters by context just like Perl's "/d". A regular expression using
PCRE2_UTF and PCRE2_UCP will use similar rules to Perl's "/u"; something closer
to "/a" could be selected by adding other PCRE2_EXTRA_ASCII* options on top.
</P>
+<P>
+22. Some recursive patterns that Perl diagnoses as infinite recursions can be
+handled by PCRE2, either by the interpreter or the JIT. An example is
+/(?:|(?0)abcd)(?(R)|\z)/, which matches a sequence of any number of repeated
+"abcd" substrings at the end of the subject.
+</P>
<br><b>
AUTHOR
</b><br>
@@ -261,7 +267,7 @@ Cambridge, England.
REVISION
</b><br>
<P>
-Last updated: 12 October 2023
+Last updated: 30 November 2023
<br>
Copyright &copy; 1997-2023 University of Cambridge.
<br>
diff --git a/doc/pcre2.txt b/doc/pcre2.txt
index 325495a1..bee12b8b 100644
--- a/doc/pcre2.txt
+++ b/doc/pcre2.txt
@@ -5256,6 +5256,11 @@ DIFFERENCES BETWEEN PCRE2 AND PERL
to Perl's "/u"; something closer to "/a" could be selected by adding
other PCRE2_EXTRA_ASCII* options on top.
+ 22. Some recursive patterns that Perl diagnoses as infinite recursions
+ can be handled by PCRE2, either by the interpreter or the JIT. An exam-
+ ple is /(?:|(?0)abcd)(?(R)|\z)/, which matches a sequence of any number
+ of repeated "abcd" substrings at the end of the subject.
+
AUTHOR
@@ -5266,11 +5271,11 @@ AUTHOR
REVISION
- Last updated: 12 October 2023
+ Last updated: 30 November 2023
Copyright (c) 1997-2023 University of Cambridge.
-PCRE2 10.43 19 September 2023 PCRE2COMPAT(3)
+PCRE2 10.43 30 November 2023 PCRE2COMPAT(3)
------------------------------------------------------------------------------
diff --git a/doc/pcre2compat.3 b/doc/pcre2compat.3
index fa13fc63..8313e03a 100644
--- a/doc/pcre2compat.3
+++ b/doc/pcre2compat.3
@@ -1,4 +1,4 @@
-.TH PCRE2COMPAT 3 "19 September 2023" "PCRE2 10.43"
+.TH PCRE2COMPAT 3 "30 November 2023" "PCRE2 10.43"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "DIFFERENCES BETWEEN PCRE2 AND PERL"
@@ -210,6 +210,11 @@ fall into any stack-overflow limit. PCRE2 made a similar change at release
to set characters by context just like Perl's "/d". A regular expression using
PCRE2_UTF and PCRE2_UCP will use similar rules to Perl's "/u"; something closer
to "/a" could be selected by adding other PCRE2_EXTRA_ASCII* options on top.
+.P
+22. Some recursive patterns that Perl diagnoses as infinite recursions can be
+handled by PCRE2, either by the interpreter or the JIT. An example is
+/(?:|(?0)abcd)(?(R)|\ez)/, which matches a sequence of any number of repeated
+"abcd" substrings at the end of the subject.
.
.
.SH AUTHOR
@@ -226,6 +231,6 @@ Cambridge, England.
.rs
.sp
.nf
-Last updated: 12 October 2023
+Last updated: 30 November 2023
Copyright (c) 1997-2023 University of Cambridge.
.fi
diff --git a/doc/pcre2demo.3 b/doc/pcre2demo.3
index 87f73cde..b134dda3 100644
--- a/doc/pcre2demo.3
+++ b/doc/pcre2demo.3
@@ -1,4 +1,4 @@
-.TH PCRE2DEMO 3 "24 November 2023" "PCRE2 10.43-DEV"
+.TH PCRE2DEMO 3 "30 November 2023" "PCRE2 10.43-DEV"
.\"AUTOMATICALLY GENERATED BY PrepareRelease - do not EDIT!
.SH NAME
// - A demonstration C program for PCRE2 - //
diff --git a/maint/README b/maint/README
index df4bce18..a5be5961 100644
--- a/maint/README
+++ b/maint/README
@@ -448,12 +448,9 @@ years.
gigabyte memory, but perhaps another implementation might be considered.
Needs coordination between the interpreters and JIT.
-. There are regular requests for variable-length lookbehinds. An implementation
- exists but is missing JIT support.
-
. See also any suggestions in the GitHub issues.
Philip Hazel
Email local part: Philip.Hazel
Email domain: gmail.com
-Last updated: 30 September 2023
+Last updated: 30 November 2023
diff --git a/src/pcre2_dfa_match.c b/src/pcre2_dfa_match.c
index 57684073..1c48ad67 100644
--- a/src/pcre2_dfa_match.c
+++ b/src/pcre2_dfa_match.c
@@ -2913,7 +2913,6 @@ for (;;)
int *local_workspace;
PCRE2_SIZE *local_offsets;
RWS_anchor *rws = (RWS_anchor *)RWS;
- dfa_recursion_info *ri;
PCRE2_SPTR callpat = start_code + GET(code, 1);
uint32_t recno = (callpat == mb->start_code)? 0 :
GET2(callpat, 1 + LINK_SIZE);
@@ -2930,18 +2929,24 @@ for (;;)
rws->free -= RWS_RSIZE + RWS_OVEC_RSIZE;
/* Check for repeating a recursion without advancing the subject
- pointer. This should catch convoluted mutual recursions. (Some simple
- cases are caught at compile time.) */
+ pointer or last used character. This should catch convoluted mutual
+ recursions. (Some simple cases are caught at compile time.) */
- for (ri = mb->recursive; ri != NULL; ri = ri->prevrec)
- if (recno == ri->group_num && ptr == ri->subject_position)
+ for (dfa_recursion_info *ri = mb->recursive;
+ ri != NULL;
+ ri = ri->prevrec)
+ {
+ if (recno == ri->group_num && ptr == ri->subject_position &&
+ mb->last_used_ptr == ri->last_used_ptr)
return PCRE2_ERROR_RECURSELOOP;
+ }
/* Remember this recursion and where we started it so as to
catch infinite loops. */
new_recursive.group_num = recno;
new_recursive.subject_position = ptr;
+ new_recursive.last_used_ptr = mb->last_used_ptr;
new_recursive.prevrec = mb->recursive;
mb->recursive = &new_recursive;
@@ -4015,7 +4020,7 @@ for (;;)
}
match_data->subject_length = length;
match_data->leftchar = (PCRE2_SIZE)(mb->start_used_ptr - subject);
- match_data->rightchar = (PCRE2_SIZE)( mb->last_used_ptr - subject);
+ match_data->rightchar = (PCRE2_SIZE)(mb->last_used_ptr - subject);
match_data->startchar = (PCRE2_SIZE)(start_match - subject);
match_data->rc = rc;
diff --git a/src/pcre2_intmodedep.h b/src/pcre2_intmodedep.h
index 423764d2..5fcddce5 100644
--- a/src/pcre2_intmodedep.h
+++ b/src/pcre2_intmodedep.h
@@ -677,8 +677,8 @@ typedef struct pcre2_real_match_data {
#ifndef PCRE2_PCRE2TEST
-/* Structures for checking for mutual recursion when scanning compiled or
-parsed code. */
+/* Structures for checking for mutual function recursion when scanning compiled
+or parsed code. */
typedef struct recurse_check {
struct recurse_check *prev;
@@ -690,7 +690,7 @@ typedef struct parsed_recurse_check {
uint32_t *groupptr;
} parsed_recurse_check;
-/* Structure for building a cache when filling in recursion offsets. */
+/* Structure for building a cache when filling in pattern recursion offsets. */
typedef struct recurse_cache {
PCRE2_SPTR group;
@@ -757,7 +757,7 @@ typedef struct compile_block {
int max_lookbehind; /* Maximum lookbehind encountered (characters) */
BOOL had_accept; /* (*ACCEPT) encountered */
BOOL had_pruneorskip; /* (*PRUNE) or (*SKIP) encountered */
- BOOL had_recurse; /* Had a recursion or subroutine call */
+ BOOL had_recurse; /* Had a pattern recursion or subroutine call */
BOOL dupnames; /* Duplicate names exist */
} compile_block;
@@ -775,6 +775,7 @@ call within the pattern when running pcre2_dfa_match(). */
typedef struct dfa_recursion_info {
struct dfa_recursion_info *prevrec;
PCRE2_SPTR subject_position;
+ PCRE2_SPTR last_used_ptr;
uint32_t group_num;
} dfa_recursion_info;
@@ -795,7 +796,7 @@ typedef struct heapframe {
PCRE2_SIZE length; /* Used for character, string, or code lengths */
PCRE2_SIZE back_frame; /* Amount to subtract on RRETURN */
PCRE2_SIZE temp_size; /* Used for short-term PCRE2_SIZE values */
- uint32_t rdepth; /* "Recursion" depth */
+ uint32_t rdepth; /* Function "recursion" depth within pcre2_match() */
uint32_t group_frame_type; /* Type information for group frames */
uint32_t temp_32[4]; /* Used for short-term 32-bit or BOOL values */
uint8_t return_id; /* Where to go on in internal "return" */
@@ -828,14 +829,15 @@ typedef struct heapframe {
allows for exactly the right size ovector for the number of capturing
parentheses. (See also the comment for pcre2_real_match_data above.) */
- PCRE2_SPTR eptr; /* MUST BE FIRST */
- PCRE2_SPTR start_match; /* Can be adjusted by \K */
- PCRE2_SPTR mark; /* Most recent mark on the success path */
- uint32_t current_recurse; /* Current (deepest) recursion number */
- uint32_t capture_last; /* Most recent capture */
- PCRE2_SIZE last_group_offset; /* Saved offset to most recent group frame */
- PCRE2_SIZE offset_top; /* Offset after highest capture */
- PCRE2_SIZE ovector[131072]; /* Must be last in the structure */
+ PCRE2_SPTR eptr; /* MUST BE FIRST */
+ PCRE2_SPTR start_match; /* Can be adjusted by \K */
+ PCRE2_SPTR mark; /* Most recent mark on the success path */
+ PCRE2_SPTR recurse_last_used; /* Last character used at time of pattern recursion */
+ uint32_t current_recurse; /* Group number of current (deepest) pattern recursion */
+ uint32_t capture_last; /* Most recent capture */
+ PCRE2_SIZE last_group_offset; /* Saved offset to most recent group frame */
+ PCRE2_SIZE offset_top; /* Offset after highest capture */
+ PCRE2_SIZE ovector[131072]; /* Must be last in the structure */
} heapframe;
/* This typedef is a check that the size of the heapframe structure is a
@@ -877,7 +879,7 @@ typedef struct match_block {
uint16_t name_count; /* Number of names in name table */
uint16_t name_entry_size; /* Size of entry in names table */
PCRE2_SPTR name_table; /* Table of group names */
- PCRE2_SPTR start_code; /* For use when recursing */
+ PCRE2_SPTR start_code; /* For use in pattern recursion */
PCRE2_SPTR start_subject; /* Start of the subject string */
PCRE2_SPTR check_subject; /* Where UTF-checked from */
PCRE2_SPTR end_subject; /* Usable end of the subject string */
@@ -889,7 +891,7 @@ typedef struct match_block {
PCRE2_SPTR nomatch_mark; /* Mark pointer to pass back on failure */
PCRE2_SPTR verb_ecode_ptr; /* For passing back info */
PCRE2_SPTR verb_skip_ptr; /* For passing back a (*SKIP) name */
- uint32_t verb_current_recurse; /* Current recurse when (*VERB) happens */
+ uint32_t verb_current_recurse; /* Current recursion group when (*VERB) happens */
uint32_t moptions; /* Match options */
uint32_t poptions; /* Pattern options */
uint32_t skip_arg_count; /* For counting SKIP_ARGs */
@@ -929,7 +931,7 @@ typedef struct dfa_match_block {
pcre2_callout_block *cb; /* Points to a callout block */
void *callout_data; /* To pass back to callouts */
int (*callout)(pcre2_callout_block *,void *); /* Callout function or NULL */
- dfa_recursion_info *recursive; /* Linked list of recursion data */
+ dfa_recursion_info *recursive; /* Linked list of pattern recursion data */
} dfa_match_block;
#endif /* PCRE2_PCRE2TEST */
diff --git a/src/pcre2_match.c b/src/pcre2_match.c
index 5e9facc5..d162e707 100644
--- a/src/pcre2_match.c
+++ b/src/pcre2_match.c
@@ -5383,9 +5383,11 @@ fprintf(stderr, "++ %2ld op=%3d %s\n", Fecode - mb->start_code, *Fecode,
/* ===================================================================== */
- /* Recursion either matches the current regex, or some subexpression. The
- offset data is the offset to the starting bracket from the start of the
- whole pattern. (This is so that it works from duplicated subpatterns.) */
+ /* Pattern recursion either matches the current regex, or some
+ subexpression. The offset data is the offset to the starting bracket from
+ the start of the whole pattern. This is so that it works from duplicated
+ subpatterns. For a whole-pattern recursion, we have to infer the number
+ zero. */
#define Lframe_type F->temp_32[0]
#define Lstart_branch F->temp_sptr[0]
@@ -5394,9 +5396,10 @@ fprintf(stderr, "++ %2ld op=%3d %s\n", Fecode - mb->start_code, *Fecode,
bracode = mb->start_code + GET(Fecode, 1);
number = (bracode == mb->start_code)? 0 : GET2(bracode, 1 + LINK_SIZE);
- /* If we are already in a recursion, check for repeating the same one
- without advancing the subject pointer. This should catch convoluted mutual
- recursions. (Some simple cases are caught at compile time.) */
+ /* If we are already in a pattern recursion, check for repeating the same
+ one without changing the subject pointer or the last referenced character
+ in the subject. This should catch convoluted mutual recursions. (Some
+ simple cases are caught at compile time.) */
if (Fcurrent_recurse != RECURSE_UNSET)
{
@@ -5407,15 +5410,18 @@ fprintf(stderr, "++ %2ld op=%3d %s\n", Fecode - mb->start_code, *Fecode,
P = (heapframe *)((char *)N - frame_size);
if (N->group_frame_type == (GF_RECURSE | number))
{
- if (Feptr == P->eptr) return PCRE2_ERROR_RECURSELOOP;
+ if (Feptr == P->eptr && mb->last_used_ptr == P->recurse_last_used)
+ return PCRE2_ERROR_RECURSELOOP;
break;
}
offset = P->last_group_offset;
}
}
- /* Now run the recursion, branch by branch. */
+ /* Remember the current last referenced character and then run the
+ recursion branch by branch. */
+ F->recurse_last_used = mb->last_used_ptr;
Lstart_branch = bracode;
Lframe_type = GF_RECURSE | number;
diff --git a/testdata/testinput1 b/testdata/testinput1
index c0da415e..e7140601 100644
--- a/testdata/testinput1
+++ b/testdata/testinput1
@@ -6648,4 +6648,7 @@ $/x
a[]b
(a)(?(1)a|b|c)
+/^..A(*SKIP)B|C/
+ 12ADC
+
# End of testinput1
diff --git a/testdata/testinput2 b/testdata/testinput2
index bfa088c7..f1d369a0 100644
--- a/testdata/testinput2
+++ b/testdata/testinput2
@@ -6066,19 +6066,13 @@ a)"xI
/\G(?:(?=(\1.|)(.))){1,13}?(?!.*\2.*\2)\1\K\2/g
aaabcccdeee
-# This currently doesn't match JIT
-
-/|(?0)./endanchored,aftertext
-\= Expect error
- abcd\=no_jit
+/|(?0)./endanchored
+ abcd
/|a(?0)/endanchored
aaaa
-# This currently doesn't match JIT
-
/(?:|(?0).)(?(R)|\z)/
-\= Expect error
- abcd\=no_jit
+ abcd
# End of testinput2
diff --git a/testdata/testoutput1 b/testdata/testoutput1
index 84fe0c6f..645866ce 100644
--- a/testdata/testoutput1
+++ b/testdata/testoutput1
@@ -10495,4 +10495,8 @@ No match
(a)(?(1)a|b|c)
No match
+/^..A(*SKIP)B|C/
+ 12ADC
+ 0: C
+
# End of testinput1
diff --git a/testdata/testoutput2 b/testdata/testoutput2
index e5bf4be9..775d38f4 100644
--- a/testdata/testoutput2
+++ b/testdata/testoutput2
@@ -17965,23 +17965,17 @@ No match
1: ccc
2: d
-# This currently doesn't match JIT
-
-/|(?0)./endanchored,aftertext
-\= Expect error
- abcd\=no_jit
-Failed: error -52: nested recursion at the same subject position
+/|(?0)./endanchored
+ abcd
+ 0: abcd
/|a(?0)/endanchored
aaaa
0: aaaa
-# This currently doesn't match JIT
-
/(?:|(?0).)(?(R)|\z)/
-\= Expect error
- abcd\=no_jit
-Failed: error -52: nested recursion at the same subject position
+ abcd
+ 0: abcd
# End of testinput2
Error -70: PCRE2_ERROR_BADDATA (unknown error number)