diff options
author | Philip Hazel <Philip.Hazel@gmail.com> | 2023-11-30 16:05:33 +0000 |
---|---|---|
committer | Philip Hazel <Philip.Hazel@gmail.com> | 2023-11-30 16:05:33 +0000 |
commit | 8e83acc59946c15e3faa8be8617e4fb438370704 (patch) | |
tree | ffe4a8164087570f9867f32e60d0c9e339ba9f55 | |
parent | 86919c90182854b5369bc10d5ad43b637f464f50 (diff) | |
download | pcre-8e83acc59946c15e3faa8be8617e4fb438370704.tar.gz |
Upgrade interpreter to match JIT in handling of nested pattern recursions
-rw-r--r-- | ChangeLog | 8 | ||||
-rw-r--r-- | doc/html/pcre2compat.html | 8 | ||||
-rw-r--r-- | doc/pcre2.txt | 9 | ||||
-rw-r--r-- | doc/pcre2compat.3 | 9 | ||||
-rw-r--r-- | doc/pcre2demo.3 | 2 | ||||
-rw-r--r-- | maint/README | 5 | ||||
-rw-r--r-- | src/pcre2_dfa_match.c | 17 | ||||
-rw-r--r-- | src/pcre2_intmodedep.h | 34 | ||||
-rw-r--r-- | src/pcre2_match.c | 22 | ||||
-rw-r--r-- | testdata/testinput1 | 3 | ||||
-rw-r--r-- | testdata/testinput2 | 12 | ||||
-rw-r--r-- | testdata/testoutput1 | 4 | ||||
-rw-r--r-- | testdata/testoutput2 | 16 |
13 files changed, 87 insertions, 62 deletions
@@ -172,8 +172,12 @@ undefined behaviour. 47. Refactor the handling of whole-pattern recursion (?0) in pcre2_match() so that its end is handled similarly to other recursions. This has altered the -behaviour of /|(?0)./endanchored which was previously not right. However, -it still differs from JIT. +behaviour of /|(?0)./endanchored which was previously not right. + +48. Improved the test for looping recursion by checking the last referenced +character as well as the current character. This allows some patterns that +previously triggered the check to run to completion instead of giving the loop +error. Version 10.42 11-December-2022 diff --git a/doc/html/pcre2compat.html b/doc/html/pcre2compat.html index b3165bbb..d60182ed 100644 --- a/doc/html/pcre2compat.html +++ b/doc/html/pcre2compat.html @@ -246,6 +246,12 @@ to set characters by context just like Perl's "/d". A regular expression using PCRE2_UTF and PCRE2_UCP will use similar rules to Perl's "/u"; something closer to "/a" could be selected by adding other PCRE2_EXTRA_ASCII* options on top. </P> +<P> +22. Some recursive patterns that Perl diagnoses as infinite recursions can be +handled by PCRE2, either by the interpreter or the JIT. An example is +/(?:|(?0)abcd)(?(R)|\z)/, which matches a sequence of any number of repeated +"abcd" substrings at the end of the subject. +</P> <br><b> AUTHOR </b><br> @@ -261,7 +267,7 @@ Cambridge, England. REVISION </b><br> <P> -Last updated: 12 October 2023 +Last updated: 30 November 2023 <br> Copyright © 1997-2023 University of Cambridge. <br> diff --git a/doc/pcre2.txt b/doc/pcre2.txt index 325495a1..bee12b8b 100644 --- a/doc/pcre2.txt +++ b/doc/pcre2.txt @@ -5256,6 +5256,11 @@ DIFFERENCES BETWEEN PCRE2 AND PERL to Perl's "/u"; something closer to "/a" could be selected by adding other PCRE2_EXTRA_ASCII* options on top. + 22. Some recursive patterns that Perl diagnoses as infinite recursions + can be handled by PCRE2, either by the interpreter or the JIT. An exam- + ple is /(?:|(?0)abcd)(?(R)|\z)/, which matches a sequence of any number + of repeated "abcd" substrings at the end of the subject. + AUTHOR @@ -5266,11 +5271,11 @@ AUTHOR REVISION - Last updated: 12 October 2023 + Last updated: 30 November 2023 Copyright (c) 1997-2023 University of Cambridge. -PCRE2 10.43 19 September 2023 PCRE2COMPAT(3) +PCRE2 10.43 30 November 2023 PCRE2COMPAT(3) ------------------------------------------------------------------------------ diff --git a/doc/pcre2compat.3 b/doc/pcre2compat.3 index fa13fc63..8313e03a 100644 --- a/doc/pcre2compat.3 +++ b/doc/pcre2compat.3 @@ -1,4 +1,4 @@ -.TH PCRE2COMPAT 3 "19 September 2023" "PCRE2 10.43" +.TH PCRE2COMPAT 3 "30 November 2023" "PCRE2 10.43" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .SH "DIFFERENCES BETWEEN PCRE2 AND PERL" @@ -210,6 +210,11 @@ fall into any stack-overflow limit. PCRE2 made a similar change at release to set characters by context just like Perl's "/d". A regular expression using PCRE2_UTF and PCRE2_UCP will use similar rules to Perl's "/u"; something closer to "/a" could be selected by adding other PCRE2_EXTRA_ASCII* options on top. +.P +22. Some recursive patterns that Perl diagnoses as infinite recursions can be +handled by PCRE2, either by the interpreter or the JIT. An example is +/(?:|(?0)abcd)(?(R)|\ez)/, which matches a sequence of any number of repeated +"abcd" substrings at the end of the subject. . . .SH AUTHOR @@ -226,6 +231,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 12 October 2023 +Last updated: 30 November 2023 Copyright (c) 1997-2023 University of Cambridge. .fi diff --git a/doc/pcre2demo.3 b/doc/pcre2demo.3 index 87f73cde..b134dda3 100644 --- a/doc/pcre2demo.3 +++ b/doc/pcre2demo.3 @@ -1,4 +1,4 @@ -.TH PCRE2DEMO 3 "24 November 2023" "PCRE2 10.43-DEV" +.TH PCRE2DEMO 3 "30 November 2023" "PCRE2 10.43-DEV" .\"AUTOMATICALLY GENERATED BY PrepareRelease - do not EDIT! .SH NAME // - A demonstration C program for PCRE2 - // diff --git a/maint/README b/maint/README index df4bce18..a5be5961 100644 --- a/maint/README +++ b/maint/README @@ -448,12 +448,9 @@ years. gigabyte memory, but perhaps another implementation might be considered. Needs coordination between the interpreters and JIT. -. There are regular requests for variable-length lookbehinds. An implementation - exists but is missing JIT support. - . See also any suggestions in the GitHub issues. Philip Hazel Email local part: Philip.Hazel Email domain: gmail.com -Last updated: 30 September 2023 +Last updated: 30 November 2023 diff --git a/src/pcre2_dfa_match.c b/src/pcre2_dfa_match.c index 57684073..1c48ad67 100644 --- a/src/pcre2_dfa_match.c +++ b/src/pcre2_dfa_match.c @@ -2913,7 +2913,6 @@ for (;;) int *local_workspace; PCRE2_SIZE *local_offsets; RWS_anchor *rws = (RWS_anchor *)RWS; - dfa_recursion_info *ri; PCRE2_SPTR callpat = start_code + GET(code, 1); uint32_t recno = (callpat == mb->start_code)? 0 : GET2(callpat, 1 + LINK_SIZE); @@ -2930,18 +2929,24 @@ for (;;) rws->free -= RWS_RSIZE + RWS_OVEC_RSIZE; /* Check for repeating a recursion without advancing the subject - pointer. This should catch convoluted mutual recursions. (Some simple - cases are caught at compile time.) */ + pointer or last used character. This should catch convoluted mutual + recursions. (Some simple cases are caught at compile time.) */ - for (ri = mb->recursive; ri != NULL; ri = ri->prevrec) - if (recno == ri->group_num && ptr == ri->subject_position) + for (dfa_recursion_info *ri = mb->recursive; + ri != NULL; + ri = ri->prevrec) + { + if (recno == ri->group_num && ptr == ri->subject_position && + mb->last_used_ptr == ri->last_used_ptr) return PCRE2_ERROR_RECURSELOOP; + } /* Remember this recursion and where we started it so as to catch infinite loops. */ new_recursive.group_num = recno; new_recursive.subject_position = ptr; + new_recursive.last_used_ptr = mb->last_used_ptr; new_recursive.prevrec = mb->recursive; mb->recursive = &new_recursive; @@ -4015,7 +4020,7 @@ for (;;) } match_data->subject_length = length; match_data->leftchar = (PCRE2_SIZE)(mb->start_used_ptr - subject); - match_data->rightchar = (PCRE2_SIZE)( mb->last_used_ptr - subject); + match_data->rightchar = (PCRE2_SIZE)(mb->last_used_ptr - subject); match_data->startchar = (PCRE2_SIZE)(start_match - subject); match_data->rc = rc; diff --git a/src/pcre2_intmodedep.h b/src/pcre2_intmodedep.h index 423764d2..5fcddce5 100644 --- a/src/pcre2_intmodedep.h +++ b/src/pcre2_intmodedep.h @@ -677,8 +677,8 @@ typedef struct pcre2_real_match_data { #ifndef PCRE2_PCRE2TEST -/* Structures for checking for mutual recursion when scanning compiled or -parsed code. */ +/* Structures for checking for mutual function recursion when scanning compiled +or parsed code. */ typedef struct recurse_check { struct recurse_check *prev; @@ -690,7 +690,7 @@ typedef struct parsed_recurse_check { uint32_t *groupptr; } parsed_recurse_check; -/* Structure for building a cache when filling in recursion offsets. */ +/* Structure for building a cache when filling in pattern recursion offsets. */ typedef struct recurse_cache { PCRE2_SPTR group; @@ -757,7 +757,7 @@ typedef struct compile_block { int max_lookbehind; /* Maximum lookbehind encountered (characters) */ BOOL had_accept; /* (*ACCEPT) encountered */ BOOL had_pruneorskip; /* (*PRUNE) or (*SKIP) encountered */ - BOOL had_recurse; /* Had a recursion or subroutine call */ + BOOL had_recurse; /* Had a pattern recursion or subroutine call */ BOOL dupnames; /* Duplicate names exist */ } compile_block; @@ -775,6 +775,7 @@ call within the pattern when running pcre2_dfa_match(). */ typedef struct dfa_recursion_info { struct dfa_recursion_info *prevrec; PCRE2_SPTR subject_position; + PCRE2_SPTR last_used_ptr; uint32_t group_num; } dfa_recursion_info; @@ -795,7 +796,7 @@ typedef struct heapframe { PCRE2_SIZE length; /* Used for character, string, or code lengths */ PCRE2_SIZE back_frame; /* Amount to subtract on RRETURN */ PCRE2_SIZE temp_size; /* Used for short-term PCRE2_SIZE values */ - uint32_t rdepth; /* "Recursion" depth */ + uint32_t rdepth; /* Function "recursion" depth within pcre2_match() */ uint32_t group_frame_type; /* Type information for group frames */ uint32_t temp_32[4]; /* Used for short-term 32-bit or BOOL values */ uint8_t return_id; /* Where to go on in internal "return" */ @@ -828,14 +829,15 @@ typedef struct heapframe { allows for exactly the right size ovector for the number of capturing parentheses. (See also the comment for pcre2_real_match_data above.) */ - PCRE2_SPTR eptr; /* MUST BE FIRST */ - PCRE2_SPTR start_match; /* Can be adjusted by \K */ - PCRE2_SPTR mark; /* Most recent mark on the success path */ - uint32_t current_recurse; /* Current (deepest) recursion number */ - uint32_t capture_last; /* Most recent capture */ - PCRE2_SIZE last_group_offset; /* Saved offset to most recent group frame */ - PCRE2_SIZE offset_top; /* Offset after highest capture */ - PCRE2_SIZE ovector[131072]; /* Must be last in the structure */ + PCRE2_SPTR eptr; /* MUST BE FIRST */ + PCRE2_SPTR start_match; /* Can be adjusted by \K */ + PCRE2_SPTR mark; /* Most recent mark on the success path */ + PCRE2_SPTR recurse_last_used; /* Last character used at time of pattern recursion */ + uint32_t current_recurse; /* Group number of current (deepest) pattern recursion */ + uint32_t capture_last; /* Most recent capture */ + PCRE2_SIZE last_group_offset; /* Saved offset to most recent group frame */ + PCRE2_SIZE offset_top; /* Offset after highest capture */ + PCRE2_SIZE ovector[131072]; /* Must be last in the structure */ } heapframe; /* This typedef is a check that the size of the heapframe structure is a @@ -877,7 +879,7 @@ typedef struct match_block { uint16_t name_count; /* Number of names in name table */ uint16_t name_entry_size; /* Size of entry in names table */ PCRE2_SPTR name_table; /* Table of group names */ - PCRE2_SPTR start_code; /* For use when recursing */ + PCRE2_SPTR start_code; /* For use in pattern recursion */ PCRE2_SPTR start_subject; /* Start of the subject string */ PCRE2_SPTR check_subject; /* Where UTF-checked from */ PCRE2_SPTR end_subject; /* Usable end of the subject string */ @@ -889,7 +891,7 @@ typedef struct match_block { PCRE2_SPTR nomatch_mark; /* Mark pointer to pass back on failure */ PCRE2_SPTR verb_ecode_ptr; /* For passing back info */ PCRE2_SPTR verb_skip_ptr; /* For passing back a (*SKIP) name */ - uint32_t verb_current_recurse; /* Current recurse when (*VERB) happens */ + uint32_t verb_current_recurse; /* Current recursion group when (*VERB) happens */ uint32_t moptions; /* Match options */ uint32_t poptions; /* Pattern options */ uint32_t skip_arg_count; /* For counting SKIP_ARGs */ @@ -929,7 +931,7 @@ typedef struct dfa_match_block { pcre2_callout_block *cb; /* Points to a callout block */ void *callout_data; /* To pass back to callouts */ int (*callout)(pcre2_callout_block *,void *); /* Callout function or NULL */ - dfa_recursion_info *recursive; /* Linked list of recursion data */ + dfa_recursion_info *recursive; /* Linked list of pattern recursion data */ } dfa_match_block; #endif /* PCRE2_PCRE2TEST */ diff --git a/src/pcre2_match.c b/src/pcre2_match.c index 5e9facc5..d162e707 100644 --- a/src/pcre2_match.c +++ b/src/pcre2_match.c @@ -5383,9 +5383,11 @@ fprintf(stderr, "++ %2ld op=%3d %s\n", Fecode - mb->start_code, *Fecode, /* ===================================================================== */ - /* Recursion either matches the current regex, or some subexpression. The - offset data is the offset to the starting bracket from the start of the - whole pattern. (This is so that it works from duplicated subpatterns.) */ + /* Pattern recursion either matches the current regex, or some + subexpression. The offset data is the offset to the starting bracket from + the start of the whole pattern. This is so that it works from duplicated + subpatterns. For a whole-pattern recursion, we have to infer the number + zero. */ #define Lframe_type F->temp_32[0] #define Lstart_branch F->temp_sptr[0] @@ -5394,9 +5396,10 @@ fprintf(stderr, "++ %2ld op=%3d %s\n", Fecode - mb->start_code, *Fecode, bracode = mb->start_code + GET(Fecode, 1); number = (bracode == mb->start_code)? 0 : GET2(bracode, 1 + LINK_SIZE); - /* If we are already in a recursion, check for repeating the same one - without advancing the subject pointer. This should catch convoluted mutual - recursions. (Some simple cases are caught at compile time.) */ + /* If we are already in a pattern recursion, check for repeating the same + one without changing the subject pointer or the last referenced character + in the subject. This should catch convoluted mutual recursions. (Some + simple cases are caught at compile time.) */ if (Fcurrent_recurse != RECURSE_UNSET) { @@ -5407,15 +5410,18 @@ fprintf(stderr, "++ %2ld op=%3d %s\n", Fecode - mb->start_code, *Fecode, P = (heapframe *)((char *)N - frame_size); if (N->group_frame_type == (GF_RECURSE | number)) { - if (Feptr == P->eptr) return PCRE2_ERROR_RECURSELOOP; + if (Feptr == P->eptr && mb->last_used_ptr == P->recurse_last_used) + return PCRE2_ERROR_RECURSELOOP; break; } offset = P->last_group_offset; } } - /* Now run the recursion, branch by branch. */ + /* Remember the current last referenced character and then run the + recursion branch by branch. */ + F->recurse_last_used = mb->last_used_ptr; Lstart_branch = bracode; Lframe_type = GF_RECURSE | number; diff --git a/testdata/testinput1 b/testdata/testinput1 index c0da415e..e7140601 100644 --- a/testdata/testinput1 +++ b/testdata/testinput1 @@ -6648,4 +6648,7 @@ $/x a[]b (a)(?(1)a|b|c) +/^..A(*SKIP)B|C/ + 12ADC + # End of testinput1 diff --git a/testdata/testinput2 b/testdata/testinput2 index bfa088c7..f1d369a0 100644 --- a/testdata/testinput2 +++ b/testdata/testinput2 @@ -6066,19 +6066,13 @@ a)"xI /\G(?:(?=(\1.|)(.))){1,13}?(?!.*\2.*\2)\1\K\2/g aaabcccdeee -# This currently doesn't match JIT - -/|(?0)./endanchored,aftertext -\= Expect error - abcd\=no_jit +/|(?0)./endanchored + abcd /|a(?0)/endanchored aaaa -# This currently doesn't match JIT - /(?:|(?0).)(?(R)|\z)/ -\= Expect error - abcd\=no_jit + abcd # End of testinput2 diff --git a/testdata/testoutput1 b/testdata/testoutput1 index 84fe0c6f..645866ce 100644 --- a/testdata/testoutput1 +++ b/testdata/testoutput1 @@ -10495,4 +10495,8 @@ No match (a)(?(1)a|b|c) No match +/^..A(*SKIP)B|C/ + 12ADC + 0: C + # End of testinput1 diff --git a/testdata/testoutput2 b/testdata/testoutput2 index e5bf4be9..775d38f4 100644 --- a/testdata/testoutput2 +++ b/testdata/testoutput2 @@ -17965,23 +17965,17 @@ No match 1: ccc 2: d -# This currently doesn't match JIT - -/|(?0)./endanchored,aftertext -\= Expect error - abcd\=no_jit -Failed: error -52: nested recursion at the same subject position +/|(?0)./endanchored + abcd + 0: abcd /|a(?0)/endanchored aaaa 0: aaaa -# This currently doesn't match JIT - /(?:|(?0).)(?(R)|\z)/ -\= Expect error - abcd\=no_jit -Failed: error -52: nested recursion at the same subject position + abcd + 0: abcd # End of testinput2 Error -70: PCRE2_ERROR_BADDATA (unknown error number) |