Age | Commit message (Collapse) | Author |
|
This is a partial revert of b7e368fb. If SVE assembly is guarded by
__ARM_FEATURE_SVE, it cannot build when SVE is not enabled by the build
system. This is ok on AOR, but because Android (bionic) uses ifuncs to
select the appropriate assembly at runtime, these need to compile
regardless of if the target actually supports the instructions.
Check for AArch64 and GCC >= 8 or Clang >= 5 so that SVE is not used on
compilers that do not support it. This condition will always be true on
future builds of Android for AArch64.
|
|
Optimize strcpy main loop - large strings are ~22% faster.
|
|
Use shrn for narrowing the mask which simplifies code. Unroll the
strchr search loop which improves performance on large strings.
|
|
The branch out of the core memchr loop to label 60 jumps over the
popping of registers r4-r7. The restoration of the cfi state at 60 is
adjusted to reflect this fact, avoiding restoring a state where r4-r7
have already been popped off the stack.
Built w/ arm-none-linux-gnueabihf, ran make check-string w/ qemu-arm-static.
|
|
Move code fragment corresponding to L(fastpath_exit) to after function
entry so that a .cfi_remember_state/.cfi_restore_state pair are not
needed prior to strcmp start.
The resulting reshuffle of code cleans up the entry part, fixing the
.size directive calculation, which at present calculates the function
size based on the address of __strcmp_arm and not L(strcmp_start_addr).
|
|
The .fnstart/.fnend directives can be inlined now that asmdefs.h is
arm specific.
|
|
This is preprocessed asm code, so /**/ style comments are most
appropriate.
|
|
Currently this is not expected to change behaviour, but if global
directives are added in asmdefs.h (like .thumb) those should be in
all asm files in case the link ABI is affected.
|
|
The definitions in this header are necessarily target specific, so
better to have a separate version in each target directory.
|
|
asmdefs.h ifdef logic was wrong: arm only macro definitions were
outside of defined(__arm__).
Added some ifdef indentation to make the code more readable.
|
|
This patch adds options for automatic alignment enforcement and for
pushing/popping the lr register to prologue and epilogue assembler macros,
while making the pushing of the ip register optional for PACBTI.
Furthermore, as the use of these macros is independent of PACBTI and
may be used on architectures without the feature, the macros are moved
to a common header.
Improvements are also made to cfi handling. Where absolute cfi offset
calculation is complicated by optional function prologue
parameters (e.g. the pushing of pac-codes to the stack for M-profile
pacbti on function entry and pushing of dummy register when alignment
required), replace the use of .cfi_offset for .cfi_rel_offset,
simplifying cfi calculations by basing offsets on SP rather than the
cfa.
Finally, extensive in-source documentation is added to these macros to
facilitate their use and further development.
Built w/ arm-none-linux-gnueabihf, ran make check-string w/ qemu-arm-static.
|
|
Optimize the main loop - large strings are 40% faster.
|
|
Optimize the main loop - large strings are 43% faster.
|
|
Simplify calculation of the mask using shrn. Unroll the main loop.
Small strings are 20% faster.
|
|
Unroll the main loop, which gives a small gain.
|
|
Use shrn for the mask, merge tst+bne into cbnz, tweak code alignment.
The random strlen test improves by 2%.
|
|
Optimize strlen by unrolling the main loop. Large strings are 64% faster.
|
|
Optimize strnlen using the shrn instruction and improve the main loop.
Small strings are 10% faster, large strings are 40% faster.
|
|
The use of the PAC_CFI_ADJ macro for calculating the effect of pushing
the IP register onto the stack assumes that pushing the register is
always optional and is always supressed when PAC_LEAF_PUSH_IP is set
to 0. This leads to CFI alignment issues for functions where the IP
register is clobbered and thus where IP is always pushed to the stack
in the function prologue.
This patch introduces a new macro PAC_CFI_ADJ_DEFAULT whose value is
never zeroed when PAC signing is requested, irrespective of the
PAC_LEAF_PUSH_IP settings.
Example:
* HAVE_PAC_LEAF == 1 && PAC_LEAF_PUSH_IP == 1:
PAC_CFI_ADJ = 4
PAC_CFI_ADJ_DEFAULT = 4
* HAVE_PAC_LEAF == 1 && PAC_LEAF_PUSH_IP == 0:
PAC_CFI_ADJ = 0
PAC_CFI_ADJ_DEFAULT = 4
Built w/ arm-none-linux-gnueabihf, ran make check-string w/ qemu-arm-static.
|
|
Modify previously defined PACBTI macros to allow for
more flexible push/pop expressions at function prologue/epilogues,
allowing further simplification of code predicated on the use of
M-profile PACBTI hardware features.
This patch also allows for the specification of whether generated pac
keys are pushed onto the stack for leaf functions where this may not
be necessary.
It defines the following preprocessor macros:
* HAVE_PAC_LEAF: Indicates whether pac-signing has been requested for
leaf functions.
* PAC_LEAF_PUSH_IP: Whether leaf functions should push the pac code
to the stack irrespective of whether the ip register is clobbered in
the function or not.
* PAC_CFI_ADJ: Given values for the above two parameters, this
holds the calculated offset applied to default CFI address/offset
values as a consequence of potentially pushing the pac-code to the
stack.
It also defines the following assembler macros:
* prologue: In addition to pushing any callee-saved registers onto
the stack, it generates any requested pacbti instructions.
Pushed registers are specified via the optional `first', `last' and
`savepac' macro argument parameters.
when a single register number is provided, it pushes that
register. When two register numbers are provided, they specify a
rage to save. If savepac is non-zero, the ip register is also
saved.
For example:
prologue savepac=1 -> push {sp}
prologue 1 -> push {r1}
prologue 1 savepac=1 -> push {r1, ip}
prologue 1 4 -> push {r1-r4}
prologue 1 4 savepac=1 -> push {r1-r4, ip}
* epilogue: pops registes off the stack and emmits pac key signing
instruction if requested. The optional `first', `last' and
`savepac' function as per the prologue macro, generating a pop
instead of push instruction.
* cfisavelist - prologue macro helper function, generating
necessary .cfi_offset directives associated with push instruction.
Therefore, the net effect of calling `prologue 1 2 savepac=1' is
to generate the following:
push {r1-r2, ip}
.cfi_adjust_cfa_offset 12
.cfi_offset 143, -12
.cfi_offset 2, -8
.cfi_offset 1, -4
* cfirestorelist - epilogue macro helper function, emitting
.cfi_restore instructions prior to resetting the cfa offset. As
such, calling `epilogue 1 2 savepac=1' will produce:
pop {r1-r2, ip}
.cfi_restore 143
.cfi_restore 2
.cfi_restore 1
.cfi_def_cfa_offset 0
|
|
As leaf functions cannot throw exceptions, with EHABI only supporting
synchronous exceptions, add support for emitting a `.cantunwind'
directive prior to `.fnend' in ARM_FNEND preprocessor macro.
This ensures no personality routine or exception table data is
generated. Existing `.save' directives used in leaf functions are also
removed.
Built w/ arm-none-linux-gnueabihf, ran make check-string w/ qemu-arm-static.
|
|
Add the `.cfi_register 143, 12' directive immediately after pac
instruction is emitted.
Ensures unwind info consumers know immediately that if they need
the PAC for the function, they can find it in ip register.
|
|
Move away from use of the non-portable __ARM_ARCH_8M_MAIN__ feature
test macro in favour of __ARM_ARCH >= 8 in selecting for target
architecture selection.
|
|
Adjust critetion for M-profile PACBTI signing of leaf function to be
contingent on +leaf option being passed to -mbranch-protect compilation
option.
|
|
Optimize
__memchr_aarch64_mte
__memrchr_aarch64
__strchrnul_aarch64_mte
__stpcpy_aarch64
__strcpy_aarch64
__strlen_aarch64_mte
using the shrn instruction for computing the nibble mask instead of
and + addp, which reduces instruction count.
|
|
Merge stack pop instructions prior to returning from function. This also
introduces fixes to CFI offset calculations to reflect the register
ordering on push and pop instructions, with the lowest-numbered register
saved to the lowest memory address.
|
|
Merge stack pop instructions prior to returning from function. This also
introduces fixes to CFI offset calculations to reflect the register
ordering on push and pop instructions, with the lowest-numbered register
saved to the lowest memory address.
|
|
Fix build failure introduced by
commit 40b662ce7b65d5eaefa40fd8046d6f3c6b3238c1
string: add .fnstart and .fnend directives to ENTRY/END macros
|
|
Ensure BTI indirect branch landing pads (BTI) and pointer authentication
code genetaion (PAC) and verification instructions (BXAUT) are
conditionally added to assembly when branch protection is requested
|
|
Ensure BTI indirect branch landing pads (BTI) and pointer authentication
code genetaion (PAC) and verification instructions (BXAUT) are
conditionally added to assembly when branch protection is requested.
|
|
Ensure BTI indirect branch landing pads (BTI) and pointer authentication
code genetaion (PAC) and verification instructions (BXAUT) are
conditionally added to assembly when branch protection is requested.
NOTE: ENTRY_ALIGN() Macro factored out as .fnstart & .cfi_startproc
directives needed to be moved to prior to L(fastpath_exit)
|
|
Header adds assembler macro to handle Pointer Authentication and Branch
Target Identification assembly instructions in function prologues
and epilogues according to selected flags at compile-time.
|
|
Modify the ENTRY_ALIGN and END assembler macros to mark the start and
end of functions for arm unwind tables.
Enables the pacbti epilogue function to emit .save{} directives for
stack unwinding.
|
|
Fix missing include directive for use of ENTRY_ALIGN and END macros.
|
|
Remove unnecessary sys/mman.h dependency.
|
|
Document contributor requirements.
|
|
The outgoing license was MIT only. The new dual license allows
using the code under Apache-2.0 WITH LLVM-exception license too.
|
|
Merge the MTE and non-MTE versions of strcpy and stpcpy since the MTE
versions are faster.
|
|
Merge the MTE and non-MTE versions of strcmp and strncmp since the MTE
versions are faster.
|
|
Add an initial SVE memcpy implementation. Copies up to 32 bytes use SVE
vectors which improves the random memcpy benchmark significantly.
|
|
Rewrite memcmp to improve performance. On small and medium inputs
performance is typically 25% better. Large inputs use a SIMD loop
processing 64 bytes per iteration, which is 50% faster than the
previous version.
|
|
Improve memcpy benchmark. Double the number of random tests and the memory size.
Add separate tests using a direct call to memcpy to compare with indirect call to
GLIBC memcpy. Add a test for small aligned and unaligned memcpy.
|
|
Increase the number of iterations of the random test. Minor code cleanup.
|
|
Add a randomized memset benchmark using string length and alignment distribution
based on SPEC2017.
|
|
Scripted copyright year updates based on git committer date.
|
|
Add optimized __mtag_tag_zero_region(dst, len) operation to AOR. It tags
the memory according to the tag of the dst pointer then memsets it to 0
and returns dst. It requires MTE support. The memory remains untagged if
tagging is not enabled for it. The dst must be 16 bytes aligned and len
must be a multiple of 16.
Similar to __mtag_tag_region, but uses the zeroing instructions.
|
|
Add optimized __mtag_tag_region(dst, len) operation to AOR. It tags the
given memory region according to the tag of the dst pointer and returns
dst. It requires MTE support. The memory remains untagged if tagging is
not enabled for it. The dst must be 16 bytes aligned and len must be a
multiple of 16.
|
|
Cleanup spurious .text and .arch. Use ENTRY rather than ENTRY_ALIGN.
|
|
The error report was copied from the seekchar test above,
and needs adjustment to match the gating IF.
|
|
There were nops before the beginning of the function to place
the main loop on a 64-byte boundary, but the addition of BTI
and instructions for ILP32 has corrupted that.
As per review, drop 64-byte alignment entirely, and use the
default 16-byte alignment from ENTRY.
|