Age | Commit message (Collapse) | Author |
|
The outgoing license was MIT only. The new dual license allows
using the code under Apache-2.0 WITH LLVM-exception license too.
|
|
Merge the MTE and non-MTE versions of strcpy and stpcpy since the MTE
versions are faster.
|
|
Merge the MTE and non-MTE versions of strcmp and strncmp since the MTE
versions are faster.
|
|
Add an initial SVE memcpy implementation. Copies up to 32 bytes use SVE
vectors which improves the random memcpy benchmark significantly.
|
|
Scripted copyright year updates based on git committer date.
|
|
Add optimized __mtag_tag_zero_region(dst, len) operation to AOR. It tags
the memory according to the tag of the dst pointer then memsets it to 0
and returns dst. It requires MTE support. The memory remains untagged if
tagging is not enabled for it. The dst must be 16 bytes aligned and len
must be a multiple of 16.
Similar to __mtag_tag_region, but uses the zeroing instructions.
|
|
Add optimized __mtag_tag_region(dst, len) operation to AOR. It tags the
given memory region according to the tag of the dst pointer and returns
dst. It requires MTE support. The memory remains untagged if tagging is
not enabled for it. The dst must be 16 bytes aligned and len must be a
multiple of 16.
|
|
Add optimized MTE-compatible strcpy-mte and stpcpy-mte. On various micro
architectures the speedup over the non-MTE version is 53% on large strings
and 20-60% on small strings.
|
|
Add optimized MTE-comparible memrchr. This walks the input backwards
using the same algorithm as memchr-mte.
|
|
Reading outside the range of the string is only allowed within 16 byte
aligned granules when MTE is enabled.
This implementation is based on string/aarch64/strncmp.S
Change the case when strings are are misaligned, align the pointers
down, and ignore bytes before the start of the string. Carry the part
that is not compared to the next comparison.
Testing done:
string/test/strncmp.c on big endian, little endian and with MTE support.
Booted nanodroid with MTE enabled.
Bechmarked on Pixel4.
|
|
Reading outside the range of the string is only allowed within 16 byte
aligned granules when MTE is enabled.
This implementation is based on string/aarch64/strcmp.S
Change the case when strings are are misaligned, align the pointers
down, and ignore bytes before the start of the string. Carry the part
that is not compared to the next comparison.
Testing done:
optimized-routines/string/test/strcmp.c on big and little endian.
Booted nanodroid with MTE enabled.
bionic string tests with MTE enabled.
Benchmarks results:
Run both bionic benchmarks and glibc benchmarks on Pixel4. Cores A76 and A55.
|
|
Reading outside the range of the string is only allowed within
16 byte aligned granules when MTE is enabled.
This implementation is based on string/aarch64/strrchr.S.
Testing done:
optimized-routines/string/test/strrchr.c
Booted nanodroid with MTE enabled.
Bionic string tests with MTE enabled.
Big endian with Qemu: qemu-aarch64_be
|
|
Reading outside the range of the string is only allowed within
16 byte aligned granules when MTE is enabled.
This implementation is based on string/aarch64/strchr-mte.S and
string/aarch64/strchrnul.S
Testing done:
optimized-routines/string/test/strchrnul.c
Booted nanodroid with MTE enabled.
bionic string tests with MTE enabled.
Big endian with Qemu: qemu-aarch64_be
|
|
Reading outside the range of the string is only allowed within 16 byte
aligned granules when MTE is enabled.
This implementation is based on string/aarch64/memchr.S
The 64-bit syndrome value is changed to contain only 16 bytes of data.
The 32 byte loop is unrolled to two 16 byte reads.
Testing done:
optimized-routines/string/test/memchr.c
Booted nanodroid with MTE enabled.
bionic string tests with MTE enabled.
|
|
Add memcpy benchmark based on size and alignment distribution of SPEC2017.
|
|
Reading outside the range of the string is only allowed within 16 byte
aligned granules when MTE is enabled.
This implementation is based on string/aarch64/strchr.S
The 64-bit syndrome value is changed to contain only 16 bytes of
data.
The 32 byte loop is unrolled by two 16 byte reads.
|
|
Add support for stpcpy on AArch64.
|
|
Reading outside the range of the string is only allowed within 16 byte
aligned granules when MTE is enabled.
This implementation is based on string/aarch64/strlen.S
Merged the page cross code into the main path and optimized it.
Modified the zeroones mask to ignore the bytes that are loaded but are
not part of the string. Made a special case for when there is 8 bytes
or less to check before the alignment boundary.
|
|
This was a placeholder for testing the build system before we added
optimized string code and thus no longer needed.
|
|
Add strrchr for AArch64. Originally written by Richard Earnshaw, same
code is present in newlib, this copy has minor edits for inclusion into
the optimized-routines repo.
|
|
Modify integer and SIMD versions of memcpy to handle overlaps correctly.
Make __memmove_aarch64 and __memmove_aarch64_simd alias to
__memcpy_aarch64 and __memcpy_aarch64_simd respectively.
Complete sharing of code between memcpy and memmove implementations is
possible without noticeable performance penalty. This is thanks to
moving the source and destination buffer overlap detection after
the code for handling small and medium copies which are overlap-safe
anyway.
Benchmarking shows that keeping two versions of memcpy is necessary
because newer platforms favor aligning src over destination for large
copies. Using NEON registers also gives a small speedup. However,
aligning dst and using general-purpose registers works best for older
platforms. Consequently, memcpy.S and memcpy_simd.S contain memcpy
code which is identical except for the registers used and src vs dst
alignment.
|
|
Create a new memcpy implementation for targets with the NEON extension.
__memcpy_aarch64_simd has been tested on a range of modern
microarchitectures. It turned out to be faster than __memcpy_aarch64 on
all of them, with a performance improvement of 3-11% depending on the
platform.
|
|
The only difference is changing the symbol name from strrchr
to __strrchr_aarch64_sve.
|
|
The only difference is changing the symbol name from strnlen
to __strnlen_aarch64_sve.
|
|
The only difference is changing the symbol name from strncmp
to __strncmp_aarch64_sve.
|
|
The only difference is changing the symbol name from strlen
to __strlen_aarch64_sve.
|
|
The only difference is changing the symbol name from strcpy
to __strcpy_aarch64_sve.
|
|
The only difference is changing the symbol name from strcmp
to __strcmp_aarch64_sve.
|
|
The only difference is changing the symbol name from strchr/strchrnul
to __strchr_aarch64_sve and __strchrnul_aarch64_sve.
|
|
The only difference is changing the symbol name from memcmp
to __memcmp_aarch64_sve.
|
|
The only difference is changing the symbol name from memchr
to __memchr_aarch64_sve.
|
|
The only difference is changing the symbol name from strlen
to __strlen_armv6t2.
|
|
The only difference is changing the symbol name from strcmp
to __strcmp_armv6m.
|
|
The only difference is changing the symbol name from strcmp
to __strcmp_arm.
|
|
The differences from cortex-strings are:
- Simplify the thumb-2/thumb selection by removing the usage of
PREFER_SIZE_OVER_SPEED and __OPTIMIZE_SIZE__.
- Removed the dumb byte-per-byte loops.
|
|
The only difference is changing the symbol name from memchr
to __memchr_arm and the final .size directive.
|
|
The only difference is changing the symbol name from memset
to __memset_arm and the final .size directive.
|
|
The only difference is changing the symbol name from memcpy
to __memcpy_arm.
|
|
The only difference is changing the symbol name from strncmp
to __strncmp_aarch64.
|
|
The only difference is changing the symbol name from strnlen
to __strnlen_aarch64.
|
|
The only difference is changing the symbol name from strlen
to __strlen_aarch64.
|
|
The only difference is changing the symbol name from strchrnul
to __strchrnul_aarch64.
|
|
The only difference is changing the symbol name from strchr
to __strchr_aarch64.
|
|
The only difference is changing the symbol name from strcmp
to __strcmp_aarch64.
|
|
The only difference is changing the symbol name from strcpy
to __strcpy_aarch64.
|
|
The only difference is changing the symbol name from memcmp
to __memcmp_aarch64.
|
|
The only difference is changing the symbol name from memchr
to __memchr_aarch64.
|
|
The only difference is changing the symbol name from memset
to __memset_aarch64.
|
|
The only difference is changing the symbol name from memmove
to __memmove_aarch64 and the memcpy branch to __memcpy_aarch64.
|
|
The only difference is changing the symbol name from memcpy
to __memcpy_aarch64.
|