summaryrefslogtreecommitdiff
path: root/win-aarch64
diff options
context:
space:
mode:
authorEllen Arteca <emarteca@google.com>2024-04-30 18:52:22 +0000
committerEllen Arteca <emarteca@google.com>2024-05-14 17:49:03 +0000
commita0eaaf6c26907580f4232bc573ede3df9f706658 (patch)
tree3a46bcc2911e3087ae06ede7a7fb74a8b80faf67 /win-aarch64
parent41be05a790e708f9b1f7841f791b3bb73b434136 (diff)
downloadboringssl-a0eaaf6c26907580f4232bc573ede3df9f706658.tar.gz
external/boringssl: Sync to 4d50a595b49a2e7b7017060a4d402c4ee9fe28a2.
This includes the following changes: https://boringssl.googlesource.com/boringssl/+log/538b2a6cf0497cf8bb61ae726a484a3d7a34e54e..4d50a595b49a2e7b7017060a4d402c4ee9fe28a2 * Make googletest a full dependency, not a dev_dependency * Rename function pointers to avoid shadowing global declaration * Don't add extra 'informational' errors in the delegate * Remove remnants of C++ runtime workarounds * Add a standalone Bazel build * Reset DTLS1_BITMAP without resorting to memset * Add an OUT_DIR option for finding bindgen output for Android Update-Note: When this rolls into Android, remove the sed logic from Android.bp and instead set up the OUT_DIR cargo emulation. * Discuss pointer rules in even more detail in API-CONVENTIONS * short-circuit verification on invalid SPKI * Add certificates to a couple of tests * Change unsupported KEM identifier * Add a CLIENT_AUTH_STRICT_LEAF and SERVER_AUTH_STRICT_LEAF which do STRICT requirements on the leaf certificate, and not STRICT on the rest of the chain. * Make SSL_CTX_set_keylog_callback constant time * clarify a few tests * Add some tests for SSL_CTX_set_keylog_callback * Switch some pointer arithmetic to spans * Disable fork detection for Zephyr and CrOS EC * Enable thread support for Zephyr RTOS * Fix Zephyr define and description * Remove unnecessary NULL checks * Avoid strdup in crypto/err/err.c * Increase DTLS window size from 64 to 256 * delocate: handle more SVE2 forms. * Disable `-Wcast-function-type-strict` for `BORINGSSL_DEFINE_STACK_OF_IMPL.` * Set service indicator for TLS 1.3 KDF. * Rewrite RAND_enable_fork_unsafe_buffering documentation * Document that our Unicode APIs reject noncharacters * Add missing public header for libpki * Switch EVP_CIPHERs to C99 initializers * Add a PrivacyInfo plist file * Make Go an optional build dependency for the CMake build * Install the Windows toolchain under util/bot * Reflect latest FIPS updates, including 186-5. * Update CI build tools * [rust] Tell Cargo to link cpp runtime library * Update run_android_tests to exit on invalid ABI * Move fips_fragments into bcm.internal_hdrs in build.json * Move internal headers to build.json * Flatten crypto/CMakeLists.txt into the top-level * Move crypto_sources to build.json * Specify public headers in build.json * Rework the test data story Update-Note: This will require some tweaks to downstream builds. We no longer emit an (unwieldy) crypto_test_data.cc file. Instead, tests will expect test data be available at the current working directory. This can be overridden with the BORINGSSL_TEST_DATA_ROOT environment variable. * Move the rest of sources.cmake into util/pregenerate * Use source lists to find pki_test data in run_android_tests.go * Move test data lists to util/pregenerate * Support glob patterns in build.json * Correctly sort err_data.c inputs * Regenerate err_data.c * Check in pre-generated perlasm and error data files Update-Note: generate_build_files.py no longer generates assembly files or err_data.c. Those are now checked into the tree directly. * Flatten crypto/fipsmodule/CMakeLists.txt up a layer * Document that null STACK_OF(T) can be used with several functions * Remove unused flags argument from trust handlers * Build fips_shared_support.c as part of libcrypto * Make it plainly obvious this is experimental code. * Add some barebones support for DH in EVP * Add verify_errors as public error API * Fix EVP_PKEY_CTX_dup with EC generation * Start making asserts constant-time too * Clear some more false positives from constant-time validation * Fix X509_ALGOR_set_md() * Trim unused files from PKI_TEST_DATA * Remove unnecessary LINKER_LANGUAGE setting in CMake build * Move ssl and decrepit sources to sources.cmake * Add threading documentation to DH and DSA * Make EVP_PKEY_type into the identity function Update-Note: EVP_PKEY_type used to return NID_undef when given a garbage key type. Given it is only ever used in concert with EVP_PKEY_id, this is unlikely to impact anyone. If it does, we can do the more tedious option. * Move EVP_PKEY setters to their corresponding type-specific files * Avoid EVP_PKEY_set_type in EVP_PKEY_new_raw_*_key * Remove some unnecessary dependencies on EVP_PKEY_set_type * Gate -Wframe-larger-than on Clang 13 * Make ninja run_tests output less confusing * X509_ALGOR_set_md is a mess, document it * Filter out DW.ref.__gxx_personality_v0 in read_symbols.go * Remove unused app_data from EVP_CIPHER * Re-remove unnecesary stat calls from by_dir.c * Add a regression test for error handling and hash_dir * Fix spelling of Identifier * Revert "Remove unnecessary stat calls from by_dir.c" * Don't dereference hs->credential on TLS 1.2 PSK ciphers * Add ERR_lib_symbol_name and ERR_reason_symbol_name * Add BIO_FP_TEXT * Fix a number of cases overwriting certificates, keys, etc. with SSL_CREDENTIAL * Set -Wframe-larger-than=25344 for a typical cmake clang compile. * Make crypto_test build with -Wframe-larger-than=25344 * Revert "Add a Dilithium implementation." * Fix sha1 dynamic dispatch issues. * Remove an unused runner/shim flag in SSL tests * Only negotiate ECDHE curves and sigalgs once * Add an SSL_CREDENTIAL API for ECDSA/RSA and delegated credentials Update-Note: The delegated credentials API has been revamped. Previously, it worked by configuring an optional delegated credential and key with your normal certificate chain. This has the side effect of forcing your DC issuer and your fallback certificate to be the same. The SSL_CREDENTIAL API lifts this restriction. * Rename CRYPTO_get_ex_new_index to CRYPTO_get_ex_new_index_ex * Remove unused group_id parameter in TLS 1.3 cipher suite selection * Check ECDSA curves in TLS 1.2 servers Update-Note: A TLS 1.2 (or below) server, using an ECDSA certificate, connecting to a client which doesn't advertise its ECDSA curve will now fail the connection slightly earlier, rather than sending the certificate and waiting for the client to reject it. The connection should fail either way, but now it will fail earlier with SSL_R_WRONG_CURVE. If the client was buggy and did not correctly advertise its own capabilities, this may cause a connection to fail despite previously succeeding. We have included a temporary API, SSL_set_check_ecdsa_curve, to disable this behavior in the event this has any impact, but please contact the BoringSSL team if you need it, as it will interfere with improvements down the line. * Inline CBS_init, CBS_data, and CBS_len * Check client certificate types in TLS <= 1.2 Update-Note: A TLS 1.2 (or below) client, using client certificates, connecting to a TLS server which doesn't support its certificate type will now fail the connection slightly earlier, rather than sending the certificate and waiting for the server to reject it. The connection should fail either way, but now it will fail earlier with SSL_R_UNKNOWN_CERTIFICATE_TYPE. If the server was buggy and did not correctly advertise its own capabilities (very very unlikely), this may cause a connection to fail despite previously succeeding. We have included a temporary API, SSL_set_check_client_certificate_type, to disable this behavior in the unlikely event this has any impact, but please contact the BoringSSL team if you need it, as it will interfere with improvements down the line. * runner: Add a test for hint mismatch due to public key * Add a Dilithium implementation. * Tidy up Rust HPKE binding. * Move spx from internal to include/openssl/experimental * runner: Configure all relevant fields from the Credential type * runner: Rename CertificateChain to Credential * Align CRYPTO_get_ex_new_index with the public API's calling convention * Make bssl_shim's setup logic infallible * Slightly simplify ssl_x509.cc * Forbid RSA delegated credentials * Fix delegated credential signature algorithm handling * Make DelegatedCredentials-KeyMismatch test less confusing * Use slices.Contains in ssl/test/runner * Fold ssl_add_cert_chain into its caller * runner: Remove the ability to configure multiple certificates * runner: Use go:embed * Generate certs on the fly in runner, pass trusted cert to shim * Make pki_sources available to Soong * Finish documenting x509.h * Add safety coments to bssl-sys * Test X509_verify_cert with CAs that share a name * Document the remaining struct types in x509.h * Expand and document the remaining DECLARE_ASN1_* macros * Unexport i2d, d2i, and ASN1_ITEM for X.509 interior types Update-Note: Some interior ASN.1 types no longer have d2i and i2d functions or ASN1_ITEMs. I checked code search and no one was using any of these. We can restore them as needed. * Document filesystem-based X509_STORE APIs * Document APIs relating to built-in and custom extensions * Add tests for what happens when no certificate is configured * Introduce a test helper for asserting on the error * Make an include/openssl/experimental. Move kyber to it for now. Update-Note: <openssl/kyber.h> has moved to <openssl/experimental/kyber.h> * Deprecate and simplify SSL_CTX_check_private_key * Use a more fine-grained lock in by_dir.c * Remove unnecessary stat calls from by_dir.c * Use std::copy instead of OPENSSL_memcpy for the internal bssl::Array::CopyFrom * Consistently open files in binary mode on Windows Update-Note: BIO_read_filename, etc., now open in binary mode on Windows. This matches OpenSSL behavior. * Add some tests for X509_LOOKUP_hash_dir * Add some utilities for testing temporary files * Remove redundant piece of DC state * Test an unusual split between context and connection configuration * Remove redundant bssl_sys import * Remove some impossible null checks * Remove some indirection in SSL_certs_clear * Make an internal RefCounted base class for libssl * Const-correct the 'kstr' parameter of PEM functions * Implement Hybrid Public Key Encryption in Rust. * Use BIO_TYPE_* constants for flags * Move capability checks in sha256-586.pl to C * Integrate TLS 1.2 sigalg and cipher suite selection Update-Note: TLS 1.2 servers will now consider RSA key exchange when the signature algorithm portion of ECDHE_RSA fails. Previously, the connection would just fail. This change will not impact any connections that previously succeeded, only make some previously failing connections start to succeed. It also changes the error returned in some cases from NO_COMMON_SIGNATURE_ALGORITHMS to NO_SHARED_CIPHER. * Remove old "check for P4" in sha256-586.pl * Document some miscellaneous x509.h functions * Move capability checks in sha1-586.pl to C * Write down the bounds for the sha*_block_data_order functions * Move capability checks in chacha-x86.pl to C * Remove OPENSSL_IA32_SSE2 checks in x86 perlasm * Update delegated credentials to the final RFC * Don't report libpki headers as part of libcrypto Update-Note: Downstream Bazel and GN builds that build libpki may need to also list the pki_headers variable. * Add a no-op OPENSSL_INIT_NO_ATEXIT * bssl-crypto: remove unused code. * Add x509.h to doc.config * Unexport DIST_POINT_set_dpname * Allow a C++ runtime dependency in libssl Update-Note: libssl now requires a C++ runtime, in addition to the pre-existing C++ requirement. Contact the BoringSSL team if this causes an issue. Some projects may need to switch the final link to use a C++ linker rather than a C linker. * Rewrite the warning about X509_AUX * Remove pki/tag.h Update-Note: pki/tag.h no longer exists. Use CBS_ASN1_TAG instead of bssl::der::Tag and CBS_ASN1_* instead of bssl::der::k*. * Work around bindgen bug around constants * Guard C++ headers. * Include verify_unittest files in PKI_TEST_DATA * Switch to bindgen's static inline support Update-Note: Rust support now requires your build correctly handle --wrap-static-fns. On Android, you will need to enable the unsupported_inline_wrappers cfg option until b/290347127 is fixed. Chromium doesn't actually use any of the inline functions yet, so we can handle --wrap-static-fns asynchronously, but I have a CL ready to enable that. * Document X509_V_FLAG_* * Merge X509_PURPOSE/X509_TRUST IDs and indices * Unexport most of X509_TRUST and X509_PURPOSE and simplify * Remove X509_TRUST_DEFAULT * Add X509_STORE_get1_objects * Mark ASN1_STRFLAGS_* and XN_FLAG_* with the right type * Remove unused include in now public header * Move signature_verify_cache.h to openssl/pki as public api * Make ContainsError look only for Errors, not Warnings. * Don't assume that Fiat assembly is available on Windows. * Add public API for a certificate. * Allow the delegate to indicate it wishes to accept PreCertificates when building chains. * Use uint64_t for num_read and num_write in BIO * Add functions to convert from Span<const uint8> and std::string_view * Minor formatting fixes * Expose OPENSSL_timegm in posix_time.h * Add SSL_get0_chain method * Tighten up the warning about RSAES-PKCS1-v1_5 * Avoid conversion overflow from struct tm. * Ensure additions in this call can't overflow. * Create a new NameConstraints constructor that takes in an already constructed GeneralNames object for permitted names. * Fix strict aliasing issues with DES_cblock * Require SSE2 when targetting 32-bit x86 Update-Note: Building for 32-bit x86 may require fixing your builds to pass -msse2 to the compiler. This will also speed up the rest of the code in your project. If your project needs to support the Pentium III, please contact BoringSSL maintainers. * Remove unused files from pki * Move NEON dispatch in bn_mul_mont to C * Rewrite bn_big_endian_to_words to avoid a GCC false positive * Enable SSE2 intrinsics on MSVC * Rename <openssl/time.h> to <openssl/posix_time.h> Update-Note: <openssl/time.h> has moved to <openssl/posix_time.h> * Tweak generate_build_files.py output to pass gn's formatter * Remove remnants of the old Android CMake toolchain * bn: Move ia32cap_P references from x86_64-mont.pl to C. * Stop generating unused assembly for 32-bit iOS * Fix SHA ABI tests * sha: Move Armv7 dispatching to C (reland) * bn: Move x86-64 argument-based dispatching of bn_mul_mont to C. * Import upstream's tests for DES_ede3_cfb_encrypt * Move single-use macros from internal.h to des.c * Unexport uint32_t-based DES APIs * Import upstream tests for CVE-2024-0727 * aes gcm: Remove Atom Silvermont optimizations. * Arrange other X509_STORE, etc. symbols into sections * Simplify purpose checks * Stop processing the Netscape cert type extension Update-Note: Certificates with a critical Netscape cert type extension will now be rejected by the certificate verifier, matching the behavior of the Chromium verifier. Non-critical extensions will continue to work fine. They will instead be ignored. * Remove X509_STORE_CTX_purpose_inherit * Document and test X509_PURPOSE and X509_TRUST machinery * Fix threads detection for CROS_EC/CROS_ZEPHYR * Stop passing der::Input by const-ref * Make der::Input a little closer to Span * Remove pki/patches * Document assumptions made by bssl-crypto's unboxed HMAC_CTX * delocate: update to handle SVE2 * Use four-iterator std::equal for bssl::Span::operator== * Avoid unions in CCM * Reworking bssl_crypto: don't use zero keys in examples. * Fix AES-GCM-SIV with huge inputs on 32-bit. * Reworking bssl_crypto: support AES-GCM-SIV open_gather. * Reworking bssl_crypto: bump version and fix license. * Reworking bssl_crypto: Sync+Send for ECC and RSA. * Reworking bssl_crypto: tidy up module list. * Reworking bssl_crypto: Add RSA support * Reworking bssl_crypto: Ed25519 * Reworking bssl_crypto: add ECDSA support * Reworking bssl_crypto: rand * Reworking bssl_crypto: ECDH * Reworking bssl_crypto: make with_output_array_fallible use a bool. * Reworking bssl_crypto: AES * Reworking bssl_crypto: AEAD * Clarify that X509_NAME_hash(_old) are specific to hash-dir * Reduce the BER conversion recursion depth * Fix a bug detecting BER deeply nested inside DER * Replace CONF's internal representation with something more typesafe * Elaborate a bit on static vs dynamic EC_GROUPs in documentation * Have generate_build_files.py output Rust sources. * Make the debug vs release build note in BUILDING.md more prominent * Simplify Montgomery RR precomputation. * Update build tools on CI * Disable the __SHA__ static check for now * Update Go dependencies * Clear some false positives in constant-time validation * Fix segfault if CRYPTO_set_thread_local fails and calls rand_thread_state_free. * Move CRL_REASON_* back to x509v3.h * Reworking bssl_crypto: HMAC * Reworking bssl_crypto: HKDF * Reworking bssl_crypto: imports_granularity = "Crate" * Reworking bssl_crypto: digest * Reworking bssl_crypto: x25519 * Revert "sha: Move Armv7 dispatching to C" * acvp: test with internal nonce generation. * chacha: Move x86-64 CPU & length dispatching from assembly to C. * Do not condition CRYPTO_is_RDRAND_capable on __RDRND__ * Fix PKI test data list in sources.cmake * Remove all -1 returns from X509_check_purpose * Add some more TSan tests for crypto/x509 * Don't define OPENSSL_LINUX for CROS_EC and CROS_ZEPHYR * Remove X509_TRUST_OCSP_SIGN and X509_TRUST_OCSP_REQUEST * Remove X509_{PURPOSE,TRUST}_{MIN,MAX} * Some miscellaneous openssl/x509.h documentation fixes * Const-correct a bunch of X509_STORE_CTX functions * Move some deprecated X.509 functions into the deprecated section * Const-correct X509_alias_get0 and X509_keyid_get0 Update-Note: The above functions are now const-correct. Store the result in a const pointer to avoid compatibility issues. * Add a missing error check for sk_X509_push * Fix error-handling convention in x509_vfy.c and avoid -1 returns Update-Note: X509_verify_cert no longer returns -1 on some error conditions, only zero. * Forbid unusual return values out of verify_cb Update-Note: If the verify callback returns anything other than 0 or 1, X509_verify_cert will now crash in BSSL_CHECK. If this happens, fix the callback to use the correct return value. * get_issuer can never return -1 * Make X509_V_FLAG_NOTIFY_POLICY into a no-op Update-Note: X509_V_FLAG_NOTIFY_POLICY is now a no-op. This is not expected to impact anyone. * Remove X509_STORE_CTX_get0_current_issuer Update-Note: Removed an unused function. Change-Id: I545e654d6c8f0a7973636217f3da27d05c0ef831 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/65068 Commit-Queue: David Benjamin <davidben@google.com> Reviewed-by: Bob Beck <bbe@google.com> * Test the X509_V_ERR_UNABLE_TO_VERIFY_LEAF_SIGNATURE codepath * Remove remnants of Netscape Server Gated Crypto from the new verifier Update-Note: SHA-1 certificates with the Netscape SGC OID will no longer skip their EKU check in the new verifier. By default, SHA-1 certificates are rejected, in which case this only impacts error reporting, not which certificates are ultimately accepted. * Make configure_callback in x509_test.cc take the X509_STORE_CTX * Use X509_get0_pubkey to simplify things slightly * Eagerly compute the cached EVP_PKEY in X509_PUBKEY * Test signature verification in X509_verify_cert * Fix X509_PUBKEY_set0_param to clear the cached EVP_PKEY * Do a better job testing expiration checks * Allow for the path builder to limit the number of valid paths. * Warn more explicitly not to use the callback in SSL_set_verify * Simplify some logic around X509_verify_cert callbacks * Remove X509_STORE_set_get_issuer Update-Note: Removed a handful of unused functions. * chacha: Move 32-bit Arm CPU dispatch from assembly to C * chacha: Move ARMv8 OPENSSL_armcap_P dispatching from assembly to C. * Move dispatch from sha512-586.pl to C * Allow creation of HKDF using PRK byteet. * Remove SSE2 checks in 32-bit x86 assembly * [DEPS] Migrate from Chromium git to CIPD * Add re-exports for making inline functions available * Add HPKE secret export and implement Send for EvpHpkeCtx. * Make Dilithium pass constant-time validation * bn: Move dispatching logic from x86_64-mont5.pl to C. * Add verify.cc and verify.h as top level public API. * Add certificates to the remaining ECH client tests * Re-apply dilithium and make it work with a limited stack * Add tests for some odd escaping behavior in the CONF parser * Test some more CONF edge cases Test: treehugger Test: atest boringssl_crypto_test Test: atest boringssl_ssl_test Change-Id: I99443e9ead57e854ccb77e47bbab1c1f892be480
Diffstat (limited to 'win-aarch64')
-rw-r--r--win-aarch64/crypto/chacha/chacha-armv8-win.S1990
-rw-r--r--win-aarch64/crypto/cipher_extra/chacha20_poly1305_armv8-win.S3015
-rw-r--r--win-aarch64/crypto/fipsmodule/aesv8-armv8-win.S803
-rw-r--r--win-aarch64/crypto/fipsmodule/aesv8-gcm-armv8-win.S1559
-rw-r--r--win-aarch64/crypto/fipsmodule/armv8-mont-win.S1431
-rw-r--r--win-aarch64/crypto/fipsmodule/bn-armv8-win.S89
-rw-r--r--win-aarch64/crypto/fipsmodule/ghash-neon-armv8-win.S341
-rw-r--r--win-aarch64/crypto/fipsmodule/ghashv8-armv8-win.S573
-rw-r--r--win-aarch64/crypto/fipsmodule/p256-armv8-asm-win.S1766
-rw-r--r--win-aarch64/crypto/fipsmodule/p256_beeu-armv8-asm-win.S309
-rw-r--r--win-aarch64/crypto/fipsmodule/sha1-armv8-win.S1222
-rw-r--r--win-aarch64/crypto/fipsmodule/sha256-armv8-win.S1197
-rw-r--r--win-aarch64/crypto/fipsmodule/sha512-armv8-win.S1600
-rw-r--r--win-aarch64/crypto/fipsmodule/vpaes-armv8-win.S1262
-rw-r--r--win-aarch64/crypto/test/trampoline-armv8-win.S750
15 files changed, 0 insertions, 17907 deletions
diff --git a/win-aarch64/crypto/chacha/chacha-armv8-win.S b/win-aarch64/crypto/chacha/chacha-armv8-win.S
deleted file mode 100644
index 1aae896f..00000000
--- a/win-aarch64/crypto/chacha/chacha-armv8-win.S
+++ /dev/null
@@ -1,1990 +0,0 @@
-// This file is generated from a similarly-named Perl script in the BoringSSL
-// source tree. Do not edit by hand.
-
-#include <openssl/asm_base.h>
-
-#if !defined(OPENSSL_NO_ASM) && defined(OPENSSL_AARCH64) && defined(_WIN32)
-#include <openssl/arm_arch.h>
-
-
-
-
-.section .rodata
-
-.align 5
-Lsigma:
-.quad 0x3320646e61707865,0x6b20657479622d32 // endian-neutral
-Lone:
-.long 1,0,0,0
-.byte 67,104,97,67,104,97,50,48,32,102,111,114,32,65,82,77,118,56,44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112,114,111,64,111,112,101,110,115,115,108,46,111,114,103,62,0
-.align 2
-
-.text
-
-.globl ChaCha20_ctr32
-
-.def ChaCha20_ctr32
- .type 32
-.endef
-.align 5
-ChaCha20_ctr32:
- AARCH64_VALID_CALL_TARGET
- cbz x2,Labort
-#if defined(OPENSSL_HWASAN) && __clang_major__ >= 10
- adrp x5,:pg_hi21_nc:OPENSSL_armcap_P
-#else
- adrp x5,OPENSSL_armcap_P
-#endif
- cmp x2,#192
- b.lo Lshort
- ldr w17,[x5,:lo12:OPENSSL_armcap_P]
- tst w17,#ARMV7_NEON
- b.ne ChaCha20_neon
-
-Lshort:
- AARCH64_SIGN_LINK_REGISTER
- stp x29,x30,[sp,#-96]!
- add x29,sp,#0
-
- adrp x5,Lsigma
- add x5,x5,:lo12:Lsigma
- stp x19,x20,[sp,#16]
- stp x21,x22,[sp,#32]
- stp x23,x24,[sp,#48]
- stp x25,x26,[sp,#64]
- stp x27,x28,[sp,#80]
- sub sp,sp,#64
-
- ldp x22,x23,[x5] // load sigma
- ldp x24,x25,[x3] // load key
- ldp x26,x27,[x3,#16]
- ldp x28,x30,[x4] // load counter
-#ifdef __AARCH64EB__
- ror x24,x24,#32
- ror x25,x25,#32
- ror x26,x26,#32
- ror x27,x27,#32
- ror x28,x28,#32
- ror x30,x30,#32
-#endif
-
-Loop_outer:
- mov w5,w22 // unpack key block
- lsr x6,x22,#32
- mov w7,w23
- lsr x8,x23,#32
- mov w9,w24
- lsr x10,x24,#32
- mov w11,w25
- lsr x12,x25,#32
- mov w13,w26
- lsr x14,x26,#32
- mov w15,w27
- lsr x16,x27,#32
- mov w17,w28
- lsr x19,x28,#32
- mov w20,w30
- lsr x21,x30,#32
-
- mov x4,#10
- subs x2,x2,#64
-Loop:
- sub x4,x4,#1
- add w5,w5,w9
- add w6,w6,w10
- add w7,w7,w11
- add w8,w8,w12
- eor w17,w17,w5
- eor w19,w19,w6
- eor w20,w20,w7
- eor w21,w21,w8
- ror w17,w17,#16
- ror w19,w19,#16
- ror w20,w20,#16
- ror w21,w21,#16
- add w13,w13,w17
- add w14,w14,w19
- add w15,w15,w20
- add w16,w16,w21
- eor w9,w9,w13
- eor w10,w10,w14
- eor w11,w11,w15
- eor w12,w12,w16
- ror w9,w9,#20
- ror w10,w10,#20
- ror w11,w11,#20
- ror w12,w12,#20
- add w5,w5,w9
- add w6,w6,w10
- add w7,w7,w11
- add w8,w8,w12
- eor w17,w17,w5
- eor w19,w19,w6
- eor w20,w20,w7
- eor w21,w21,w8
- ror w17,w17,#24
- ror w19,w19,#24
- ror w20,w20,#24
- ror w21,w21,#24
- add w13,w13,w17
- add w14,w14,w19
- add w15,w15,w20
- add w16,w16,w21
- eor w9,w9,w13
- eor w10,w10,w14
- eor w11,w11,w15
- eor w12,w12,w16
- ror w9,w9,#25
- ror w10,w10,#25
- ror w11,w11,#25
- ror w12,w12,#25
- add w5,w5,w10
- add w6,w6,w11
- add w7,w7,w12
- add w8,w8,w9
- eor w21,w21,w5
- eor w17,w17,w6
- eor w19,w19,w7
- eor w20,w20,w8
- ror w21,w21,#16
- ror w17,w17,#16
- ror w19,w19,#16
- ror w20,w20,#16
- add w15,w15,w21
- add w16,w16,w17
- add w13,w13,w19
- add w14,w14,w20
- eor w10,w10,w15
- eor w11,w11,w16
- eor w12,w12,w13
- eor w9,w9,w14
- ror w10,w10,#20
- ror w11,w11,#20
- ror w12,w12,#20
- ror w9,w9,#20
- add w5,w5,w10
- add w6,w6,w11
- add w7,w7,w12
- add w8,w8,w9
- eor w21,w21,w5
- eor w17,w17,w6
- eor w19,w19,w7
- eor w20,w20,w8
- ror w21,w21,#24
- ror w17,w17,#24
- ror w19,w19,#24
- ror w20,w20,#24
- add w15,w15,w21
- add w16,w16,w17
- add w13,w13,w19
- add w14,w14,w20
- eor w10,w10,w15
- eor w11,w11,w16
- eor w12,w12,w13
- eor w9,w9,w14
- ror w10,w10,#25
- ror w11,w11,#25
- ror w12,w12,#25
- ror w9,w9,#25
- cbnz x4,Loop
-
- add w5,w5,w22 // accumulate key block
- add x6,x6,x22,lsr#32
- add w7,w7,w23
- add x8,x8,x23,lsr#32
- add w9,w9,w24
- add x10,x10,x24,lsr#32
- add w11,w11,w25
- add x12,x12,x25,lsr#32
- add w13,w13,w26
- add x14,x14,x26,lsr#32
- add w15,w15,w27
- add x16,x16,x27,lsr#32
- add w17,w17,w28
- add x19,x19,x28,lsr#32
- add w20,w20,w30
- add x21,x21,x30,lsr#32
-
- b.lo Ltail
-
- add x5,x5,x6,lsl#32 // pack
- add x7,x7,x8,lsl#32
- ldp x6,x8,[x1,#0] // load input
- add x9,x9,x10,lsl#32
- add x11,x11,x12,lsl#32
- ldp x10,x12,[x1,#16]
- add x13,x13,x14,lsl#32
- add x15,x15,x16,lsl#32
- ldp x14,x16,[x1,#32]
- add x17,x17,x19,lsl#32
- add x20,x20,x21,lsl#32
- ldp x19,x21,[x1,#48]
- add x1,x1,#64
-#ifdef __AARCH64EB__
- rev x5,x5
- rev x7,x7
- rev x9,x9
- rev x11,x11
- rev x13,x13
- rev x15,x15
- rev x17,x17
- rev x20,x20
-#endif
- eor x5,x5,x6
- eor x7,x7,x8
- eor x9,x9,x10
- eor x11,x11,x12
- eor x13,x13,x14
- eor x15,x15,x16
- eor x17,x17,x19
- eor x20,x20,x21
-
- stp x5,x7,[x0,#0] // store output
- add x28,x28,#1 // increment counter
- stp x9,x11,[x0,#16]
- stp x13,x15,[x0,#32]
- stp x17,x20,[x0,#48]
- add x0,x0,#64
-
- b.hi Loop_outer
-
- ldp x19,x20,[x29,#16]
- add sp,sp,#64
- ldp x21,x22,[x29,#32]
- ldp x23,x24,[x29,#48]
- ldp x25,x26,[x29,#64]
- ldp x27,x28,[x29,#80]
- ldp x29,x30,[sp],#96
- AARCH64_VALIDATE_LINK_REGISTER
-Labort:
- ret
-
-.align 4
-Ltail:
- add x2,x2,#64
-Less_than_64:
- sub x0,x0,#1
- add x1,x1,x2
- add x0,x0,x2
- add x4,sp,x2
- neg x2,x2
-
- add x5,x5,x6,lsl#32 // pack
- add x7,x7,x8,lsl#32
- add x9,x9,x10,lsl#32
- add x11,x11,x12,lsl#32
- add x13,x13,x14,lsl#32
- add x15,x15,x16,lsl#32
- add x17,x17,x19,lsl#32
- add x20,x20,x21,lsl#32
-#ifdef __AARCH64EB__
- rev x5,x5
- rev x7,x7
- rev x9,x9
- rev x11,x11
- rev x13,x13
- rev x15,x15
- rev x17,x17
- rev x20,x20
-#endif
- stp x5,x7,[sp,#0]
- stp x9,x11,[sp,#16]
- stp x13,x15,[sp,#32]
- stp x17,x20,[sp,#48]
-
-Loop_tail:
- ldrb w10,[x1,x2]
- ldrb w11,[x4,x2]
- add x2,x2,#1
- eor w10,w10,w11
- strb w10,[x0,x2]
- cbnz x2,Loop_tail
-
- stp xzr,xzr,[sp,#0]
- stp xzr,xzr,[sp,#16]
- stp xzr,xzr,[sp,#32]
- stp xzr,xzr,[sp,#48]
-
- ldp x19,x20,[x29,#16]
- add sp,sp,#64
- ldp x21,x22,[x29,#32]
- ldp x23,x24,[x29,#48]
- ldp x25,x26,[x29,#64]
- ldp x27,x28,[x29,#80]
- ldp x29,x30,[sp],#96
- AARCH64_VALIDATE_LINK_REGISTER
- ret
-
-
-.def ChaCha20_neon
- .type 32
-.endef
-.align 5
-ChaCha20_neon:
- AARCH64_SIGN_LINK_REGISTER
- stp x29,x30,[sp,#-96]!
- add x29,sp,#0
-
- adrp x5,Lsigma
- add x5,x5,:lo12:Lsigma
- stp x19,x20,[sp,#16]
- stp x21,x22,[sp,#32]
- stp x23,x24,[sp,#48]
- stp x25,x26,[sp,#64]
- stp x27,x28,[sp,#80]
- cmp x2,#512
- b.hs L512_or_more_neon
-
- sub sp,sp,#64
-
- ldp x22,x23,[x5] // load sigma
- ld1 {v24.4s},[x5],#16
- ldp x24,x25,[x3] // load key
- ldp x26,x27,[x3,#16]
- ld1 {v25.4s,v26.4s},[x3]
- ldp x28,x30,[x4] // load counter
- ld1 {v27.4s},[x4]
- ld1 {v31.4s},[x5]
-#ifdef __AARCH64EB__
- rev64 v24.4s,v24.4s
- ror x24,x24,#32
- ror x25,x25,#32
- ror x26,x26,#32
- ror x27,x27,#32
- ror x28,x28,#32
- ror x30,x30,#32
-#endif
- add v27.4s,v27.4s,v31.4s // += 1
- add v28.4s,v27.4s,v31.4s
- add v29.4s,v28.4s,v31.4s
- shl v31.4s,v31.4s,#2 // 1 -> 4
-
-Loop_outer_neon:
- mov w5,w22 // unpack key block
- lsr x6,x22,#32
- mov v0.16b,v24.16b
- mov w7,w23
- lsr x8,x23,#32
- mov v4.16b,v24.16b
- mov w9,w24
- lsr x10,x24,#32
- mov v16.16b,v24.16b
- mov w11,w25
- mov v1.16b,v25.16b
- lsr x12,x25,#32
- mov v5.16b,v25.16b
- mov w13,w26
- mov v17.16b,v25.16b
- lsr x14,x26,#32
- mov v3.16b,v27.16b
- mov w15,w27
- mov v7.16b,v28.16b
- lsr x16,x27,#32
- mov v19.16b,v29.16b
- mov w17,w28
- mov v2.16b,v26.16b
- lsr x19,x28,#32
- mov v6.16b,v26.16b
- mov w20,w30
- mov v18.16b,v26.16b
- lsr x21,x30,#32
-
- mov x4,#10
- subs x2,x2,#256
-Loop_neon:
- sub x4,x4,#1
- add v0.4s,v0.4s,v1.4s
- add w5,w5,w9
- add v4.4s,v4.4s,v5.4s
- add w6,w6,w10
- add v16.4s,v16.4s,v17.4s
- add w7,w7,w11
- eor v3.16b,v3.16b,v0.16b
- add w8,w8,w12
- eor v7.16b,v7.16b,v4.16b
- eor w17,w17,w5
- eor v19.16b,v19.16b,v16.16b
- eor w19,w19,w6
- rev32 v3.8h,v3.8h
- eor w20,w20,w7
- rev32 v7.8h,v7.8h
- eor w21,w21,w8
- rev32 v19.8h,v19.8h
- ror w17,w17,#16
- add v2.4s,v2.4s,v3.4s
- ror w19,w19,#16
- add v6.4s,v6.4s,v7.4s
- ror w20,w20,#16
- add v18.4s,v18.4s,v19.4s
- ror w21,w21,#16
- eor v20.16b,v1.16b,v2.16b
- add w13,w13,w17
- eor v21.16b,v5.16b,v6.16b
- add w14,w14,w19
- eor v22.16b,v17.16b,v18.16b
- add w15,w15,w20
- ushr v1.4s,v20.4s,#20
- add w16,w16,w21
- ushr v5.4s,v21.4s,#20
- eor w9,w9,w13
- ushr v17.4s,v22.4s,#20
- eor w10,w10,w14
- sli v1.4s,v20.4s,#12
- eor w11,w11,w15
- sli v5.4s,v21.4s,#12
- eor w12,w12,w16
- sli v17.4s,v22.4s,#12
- ror w9,w9,#20
- add v0.4s,v0.4s,v1.4s
- ror w10,w10,#20
- add v4.4s,v4.4s,v5.4s
- ror w11,w11,#20
- add v16.4s,v16.4s,v17.4s
- ror w12,w12,#20
- eor v20.16b,v3.16b,v0.16b
- add w5,w5,w9
- eor v21.16b,v7.16b,v4.16b
- add w6,w6,w10
- eor v22.16b,v19.16b,v16.16b
- add w7,w7,w11
- ushr v3.4s,v20.4s,#24
- add w8,w8,w12
- ushr v7.4s,v21.4s,#24
- eor w17,w17,w5
- ushr v19.4s,v22.4s,#24
- eor w19,w19,w6
- sli v3.4s,v20.4s,#8
- eor w20,w20,w7
- sli v7.4s,v21.4s,#8
- eor w21,w21,w8
- sli v19.4s,v22.4s,#8
- ror w17,w17,#24
- add v2.4s,v2.4s,v3.4s
- ror w19,w19,#24
- add v6.4s,v6.4s,v7.4s
- ror w20,w20,#24
- add v18.4s,v18.4s,v19.4s
- ror w21,w21,#24
- eor v20.16b,v1.16b,v2.16b
- add w13,w13,w17
- eor v21.16b,v5.16b,v6.16b
- add w14,w14,w19
- eor v22.16b,v17.16b,v18.16b
- add w15,w15,w20
- ushr v1.4s,v20.4s,#25
- add w16,w16,w21
- ushr v5.4s,v21.4s,#25
- eor w9,w9,w13
- ushr v17.4s,v22.4s,#25
- eor w10,w10,w14
- sli v1.4s,v20.4s,#7
- eor w11,w11,w15
- sli v5.4s,v21.4s,#7
- eor w12,w12,w16
- sli v17.4s,v22.4s,#7
- ror w9,w9,#25
- ext v2.16b,v2.16b,v2.16b,#8
- ror w10,w10,#25
- ext v6.16b,v6.16b,v6.16b,#8
- ror w11,w11,#25
- ext v18.16b,v18.16b,v18.16b,#8
- ror w12,w12,#25
- ext v3.16b,v3.16b,v3.16b,#12
- ext v7.16b,v7.16b,v7.16b,#12
- ext v19.16b,v19.16b,v19.16b,#12
- ext v1.16b,v1.16b,v1.16b,#4
- ext v5.16b,v5.16b,v5.16b,#4
- ext v17.16b,v17.16b,v17.16b,#4
- add v0.4s,v0.4s,v1.4s
- add w5,w5,w10
- add v4.4s,v4.4s,v5.4s
- add w6,w6,w11
- add v16.4s,v16.4s,v17.4s
- add w7,w7,w12
- eor v3.16b,v3.16b,v0.16b
- add w8,w8,w9
- eor v7.16b,v7.16b,v4.16b
- eor w21,w21,w5
- eor v19.16b,v19.16b,v16.16b
- eor w17,w17,w6
- rev32 v3.8h,v3.8h
- eor w19,w19,w7
- rev32 v7.8h,v7.8h
- eor w20,w20,w8
- rev32 v19.8h,v19.8h
- ror w21,w21,#16
- add v2.4s,v2.4s,v3.4s
- ror w17,w17,#16
- add v6.4s,v6.4s,v7.4s
- ror w19,w19,#16
- add v18.4s,v18.4s,v19.4s
- ror w20,w20,#16
- eor v20.16b,v1.16b,v2.16b
- add w15,w15,w21
- eor v21.16b,v5.16b,v6.16b
- add w16,w16,w17
- eor v22.16b,v17.16b,v18.16b
- add w13,w13,w19
- ushr v1.4s,v20.4s,#20
- add w14,w14,w20
- ushr v5.4s,v21.4s,#20
- eor w10,w10,w15
- ushr v17.4s,v22.4s,#20
- eor w11,w11,w16
- sli v1.4s,v20.4s,#12
- eor w12,w12,w13
- sli v5.4s,v21.4s,#12
- eor w9,w9,w14
- sli v17.4s,v22.4s,#12
- ror w10,w10,#20
- add v0.4s,v0.4s,v1.4s
- ror w11,w11,#20
- add v4.4s,v4.4s,v5.4s
- ror w12,w12,#20
- add v16.4s,v16.4s,v17.4s
- ror w9,w9,#20
- eor v20.16b,v3.16b,v0.16b
- add w5,w5,w10
- eor v21.16b,v7.16b,v4.16b
- add w6,w6,w11
- eor v22.16b,v19.16b,v16.16b
- add w7,w7,w12
- ushr v3.4s,v20.4s,#24
- add w8,w8,w9
- ushr v7.4s,v21.4s,#24
- eor w21,w21,w5
- ushr v19.4s,v22.4s,#24
- eor w17,w17,w6
- sli v3.4s,v20.4s,#8
- eor w19,w19,w7
- sli v7.4s,v21.4s,#8
- eor w20,w20,w8
- sli v19.4s,v22.4s,#8
- ror w21,w21,#24
- add v2.4s,v2.4s,v3.4s
- ror w17,w17,#24
- add v6.4s,v6.4s,v7.4s
- ror w19,w19,#24
- add v18.4s,v18.4s,v19.4s
- ror w20,w20,#24
- eor v20.16b,v1.16b,v2.16b
- add w15,w15,w21
- eor v21.16b,v5.16b,v6.16b
- add w16,w16,w17
- eor v22.16b,v17.16b,v18.16b
- add w13,w13,w19
- ushr v1.4s,v20.4s,#25
- add w14,w14,w20
- ushr v5.4s,v21.4s,#25
- eor w10,w10,w15
- ushr v17.4s,v22.4s,#25
- eor w11,w11,w16
- sli v1.4s,v20.4s,#7
- eor w12,w12,w13
- sli v5.4s,v21.4s,#7
- eor w9,w9,w14
- sli v17.4s,v22.4s,#7
- ror w10,w10,#25
- ext v2.16b,v2.16b,v2.16b,#8
- ror w11,w11,#25
- ext v6.16b,v6.16b,v6.16b,#8
- ror w12,w12,#25
- ext v18.16b,v18.16b,v18.16b,#8
- ror w9,w9,#25
- ext v3.16b,v3.16b,v3.16b,#4
- ext v7.16b,v7.16b,v7.16b,#4
- ext v19.16b,v19.16b,v19.16b,#4
- ext v1.16b,v1.16b,v1.16b,#12
- ext v5.16b,v5.16b,v5.16b,#12
- ext v17.16b,v17.16b,v17.16b,#12
- cbnz x4,Loop_neon
-
- add w5,w5,w22 // accumulate key block
- add v0.4s,v0.4s,v24.4s
- add x6,x6,x22,lsr#32
- add v4.4s,v4.4s,v24.4s
- add w7,w7,w23
- add v16.4s,v16.4s,v24.4s
- add x8,x8,x23,lsr#32
- add v2.4s,v2.4s,v26.4s
- add w9,w9,w24
- add v6.4s,v6.4s,v26.4s
- add x10,x10,x24,lsr#32
- add v18.4s,v18.4s,v26.4s
- add w11,w11,w25
- add v3.4s,v3.4s,v27.4s
- add x12,x12,x25,lsr#32
- add w13,w13,w26
- add v7.4s,v7.4s,v28.4s
- add x14,x14,x26,lsr#32
- add w15,w15,w27
- add v19.4s,v19.4s,v29.4s
- add x16,x16,x27,lsr#32
- add w17,w17,w28
- add v1.4s,v1.4s,v25.4s
- add x19,x19,x28,lsr#32
- add w20,w20,w30
- add v5.4s,v5.4s,v25.4s
- add x21,x21,x30,lsr#32
- add v17.4s,v17.4s,v25.4s
-
- b.lo Ltail_neon
-
- add x5,x5,x6,lsl#32 // pack
- add x7,x7,x8,lsl#32
- ldp x6,x8,[x1,#0] // load input
- add x9,x9,x10,lsl#32
- add x11,x11,x12,lsl#32
- ldp x10,x12,[x1,#16]
- add x13,x13,x14,lsl#32
- add x15,x15,x16,lsl#32
- ldp x14,x16,[x1,#32]
- add x17,x17,x19,lsl#32
- add x20,x20,x21,lsl#32
- ldp x19,x21,[x1,#48]
- add x1,x1,#64
-#ifdef __AARCH64EB__
- rev x5,x5
- rev x7,x7
- rev x9,x9
- rev x11,x11
- rev x13,x13
- rev x15,x15
- rev x17,x17
- rev x20,x20
-#endif
- ld1 {v20.16b,v21.16b,v22.16b,v23.16b},[x1],#64
- eor x5,x5,x6
- eor x7,x7,x8
- eor x9,x9,x10
- eor x11,x11,x12
- eor x13,x13,x14
- eor v0.16b,v0.16b,v20.16b
- eor x15,x15,x16
- eor v1.16b,v1.16b,v21.16b
- eor x17,x17,x19
- eor v2.16b,v2.16b,v22.16b
- eor x20,x20,x21
- eor v3.16b,v3.16b,v23.16b
- ld1 {v20.16b,v21.16b,v22.16b,v23.16b},[x1],#64
-
- stp x5,x7,[x0,#0] // store output
- add x28,x28,#4 // increment counter
- stp x9,x11,[x0,#16]
- add v27.4s,v27.4s,v31.4s // += 4
- stp x13,x15,[x0,#32]
- add v28.4s,v28.4s,v31.4s
- stp x17,x20,[x0,#48]
- add v29.4s,v29.4s,v31.4s
- add x0,x0,#64
-
- st1 {v0.16b,v1.16b,v2.16b,v3.16b},[x0],#64
- ld1 {v0.16b,v1.16b,v2.16b,v3.16b},[x1],#64
-
- eor v4.16b,v4.16b,v20.16b
- eor v5.16b,v5.16b,v21.16b
- eor v6.16b,v6.16b,v22.16b
- eor v7.16b,v7.16b,v23.16b
- st1 {v4.16b,v5.16b,v6.16b,v7.16b},[x0],#64
-
- eor v16.16b,v16.16b,v0.16b
- eor v17.16b,v17.16b,v1.16b
- eor v18.16b,v18.16b,v2.16b
- eor v19.16b,v19.16b,v3.16b
- st1 {v16.16b,v17.16b,v18.16b,v19.16b},[x0],#64
-
- b.hi Loop_outer_neon
-
- ldp x19,x20,[x29,#16]
- add sp,sp,#64
- ldp x21,x22,[x29,#32]
- ldp x23,x24,[x29,#48]
- ldp x25,x26,[x29,#64]
- ldp x27,x28,[x29,#80]
- ldp x29,x30,[sp],#96
- AARCH64_VALIDATE_LINK_REGISTER
- ret
-
-Ltail_neon:
- add x2,x2,#256
- cmp x2,#64
- b.lo Less_than_64
-
- add x5,x5,x6,lsl#32 // pack
- add x7,x7,x8,lsl#32
- ldp x6,x8,[x1,#0] // load input
- add x9,x9,x10,lsl#32
- add x11,x11,x12,lsl#32
- ldp x10,x12,[x1,#16]
- add x13,x13,x14,lsl#32
- add x15,x15,x16,lsl#32
- ldp x14,x16,[x1,#32]
- add x17,x17,x19,lsl#32
- add x20,x20,x21,lsl#32
- ldp x19,x21,[x1,#48]
- add x1,x1,#64
-#ifdef __AARCH64EB__
- rev x5,x5
- rev x7,x7
- rev x9,x9
- rev x11,x11
- rev x13,x13
- rev x15,x15
- rev x17,x17
- rev x20,x20
-#endif
- eor x5,x5,x6
- eor x7,x7,x8
- eor x9,x9,x10
- eor x11,x11,x12
- eor x13,x13,x14
- eor x15,x15,x16
- eor x17,x17,x19
- eor x20,x20,x21
-
- stp x5,x7,[x0,#0] // store output
- add x28,x28,#4 // increment counter
- stp x9,x11,[x0,#16]
- stp x13,x15,[x0,#32]
- stp x17,x20,[x0,#48]
- add x0,x0,#64
- b.eq Ldone_neon
- sub x2,x2,#64
- cmp x2,#64
- b.lo Less_than_128
-
- ld1 {v20.16b,v21.16b,v22.16b,v23.16b},[x1],#64
- eor v0.16b,v0.16b,v20.16b
- eor v1.16b,v1.16b,v21.16b
- eor v2.16b,v2.16b,v22.16b
- eor v3.16b,v3.16b,v23.16b
- st1 {v0.16b,v1.16b,v2.16b,v3.16b},[x0],#64
- b.eq Ldone_neon
- sub x2,x2,#64
- cmp x2,#64
- b.lo Less_than_192
-
- ld1 {v20.16b,v21.16b,v22.16b,v23.16b},[x1],#64
- eor v4.16b,v4.16b,v20.16b
- eor v5.16b,v5.16b,v21.16b
- eor v6.16b,v6.16b,v22.16b
- eor v7.16b,v7.16b,v23.16b
- st1 {v4.16b,v5.16b,v6.16b,v7.16b},[x0],#64
- b.eq Ldone_neon
- sub x2,x2,#64
-
- st1 {v16.16b,v17.16b,v18.16b,v19.16b},[sp]
- b Last_neon
-
-Less_than_128:
- st1 {v0.16b,v1.16b,v2.16b,v3.16b},[sp]
- b Last_neon
-Less_than_192:
- st1 {v4.16b,v5.16b,v6.16b,v7.16b},[sp]
- b Last_neon
-
-.align 4
-Last_neon:
- sub x0,x0,#1
- add x1,x1,x2
- add x0,x0,x2
- add x4,sp,x2
- neg x2,x2
-
-Loop_tail_neon:
- ldrb w10,[x1,x2]
- ldrb w11,[x4,x2]
- add x2,x2,#1
- eor w10,w10,w11
- strb w10,[x0,x2]
- cbnz x2,Loop_tail_neon
-
- stp xzr,xzr,[sp,#0]
- stp xzr,xzr,[sp,#16]
- stp xzr,xzr,[sp,#32]
- stp xzr,xzr,[sp,#48]
-
-Ldone_neon:
- ldp x19,x20,[x29,#16]
- add sp,sp,#64
- ldp x21,x22,[x29,#32]
- ldp x23,x24,[x29,#48]
- ldp x25,x26,[x29,#64]
- ldp x27,x28,[x29,#80]
- ldp x29,x30,[sp],#96
- AARCH64_VALIDATE_LINK_REGISTER
- ret
-
-.def ChaCha20_512_neon
- .type 32
-.endef
-.align 5
-ChaCha20_512_neon:
- AARCH64_SIGN_LINK_REGISTER
- stp x29,x30,[sp,#-96]!
- add x29,sp,#0
-
- adrp x5,Lsigma
- add x5,x5,:lo12:Lsigma
- stp x19,x20,[sp,#16]
- stp x21,x22,[sp,#32]
- stp x23,x24,[sp,#48]
- stp x25,x26,[sp,#64]
- stp x27,x28,[sp,#80]
-
-L512_or_more_neon:
- sub sp,sp,#128+64
-
- ldp x22,x23,[x5] // load sigma
- ld1 {v24.4s},[x5],#16
- ldp x24,x25,[x3] // load key
- ldp x26,x27,[x3,#16]
- ld1 {v25.4s,v26.4s},[x3]
- ldp x28,x30,[x4] // load counter
- ld1 {v27.4s},[x4]
- ld1 {v31.4s},[x5]
-#ifdef __AARCH64EB__
- rev64 v24.4s,v24.4s
- ror x24,x24,#32
- ror x25,x25,#32
- ror x26,x26,#32
- ror x27,x27,#32
- ror x28,x28,#32
- ror x30,x30,#32
-#endif
- add v27.4s,v27.4s,v31.4s // += 1
- stp q24,q25,[sp,#0] // off-load key block, invariant part
- add v27.4s,v27.4s,v31.4s // not typo
- str q26,[sp,#32]
- add v28.4s,v27.4s,v31.4s
- add v29.4s,v28.4s,v31.4s
- add v30.4s,v29.4s,v31.4s
- shl v31.4s,v31.4s,#2 // 1 -> 4
-
- stp d8,d9,[sp,#128+0] // meet ABI requirements
- stp d10,d11,[sp,#128+16]
- stp d12,d13,[sp,#128+32]
- stp d14,d15,[sp,#128+48]
-
- sub x2,x2,#512 // not typo
-
-Loop_outer_512_neon:
- mov v0.16b,v24.16b
- mov v4.16b,v24.16b
- mov v8.16b,v24.16b
- mov v12.16b,v24.16b
- mov v16.16b,v24.16b
- mov v20.16b,v24.16b
- mov v1.16b,v25.16b
- mov w5,w22 // unpack key block
- mov v5.16b,v25.16b
- lsr x6,x22,#32
- mov v9.16b,v25.16b
- mov w7,w23
- mov v13.16b,v25.16b
- lsr x8,x23,#32
- mov v17.16b,v25.16b
- mov w9,w24
- mov v21.16b,v25.16b
- lsr x10,x24,#32
- mov v3.16b,v27.16b
- mov w11,w25
- mov v7.16b,v28.16b
- lsr x12,x25,#32
- mov v11.16b,v29.16b
- mov w13,w26
- mov v15.16b,v30.16b
- lsr x14,x26,#32
- mov v2.16b,v26.16b
- mov w15,w27
- mov v6.16b,v26.16b
- lsr x16,x27,#32
- add v19.4s,v3.4s,v31.4s // +4
- mov w17,w28
- add v23.4s,v7.4s,v31.4s // +4
- lsr x19,x28,#32
- mov v10.16b,v26.16b
- mov w20,w30
- mov v14.16b,v26.16b
- lsr x21,x30,#32
- mov v18.16b,v26.16b
- stp q27,q28,[sp,#48] // off-load key block, variable part
- mov v22.16b,v26.16b
- str q29,[sp,#80]
-
- mov x4,#5
- subs x2,x2,#512
-Loop_upper_neon:
- sub x4,x4,#1
- add v0.4s,v0.4s,v1.4s
- add w5,w5,w9
- add v4.4s,v4.4s,v5.4s
- add w6,w6,w10
- add v8.4s,v8.4s,v9.4s
- add w7,w7,w11
- add v12.4s,v12.4s,v13.4s
- add w8,w8,w12
- add v16.4s,v16.4s,v17.4s
- eor w17,w17,w5
- add v20.4s,v20.4s,v21.4s
- eor w19,w19,w6
- eor v3.16b,v3.16b,v0.16b
- eor w20,w20,w7
- eor v7.16b,v7.16b,v4.16b
- eor w21,w21,w8
- eor v11.16b,v11.16b,v8.16b
- ror w17,w17,#16
- eor v15.16b,v15.16b,v12.16b
- ror w19,w19,#16
- eor v19.16b,v19.16b,v16.16b
- ror w20,w20,#16
- eor v23.16b,v23.16b,v20.16b
- ror w21,w21,#16
- rev32 v3.8h,v3.8h
- add w13,w13,w17
- rev32 v7.8h,v7.8h
- add w14,w14,w19
- rev32 v11.8h,v11.8h
- add w15,w15,w20
- rev32 v15.8h,v15.8h
- add w16,w16,w21
- rev32 v19.8h,v19.8h
- eor w9,w9,w13
- rev32 v23.8h,v23.8h
- eor w10,w10,w14
- add v2.4s,v2.4s,v3.4s
- eor w11,w11,w15
- add v6.4s,v6.4s,v7.4s
- eor w12,w12,w16
- add v10.4s,v10.4s,v11.4s
- ror w9,w9,#20
- add v14.4s,v14.4s,v15.4s
- ror w10,w10,#20
- add v18.4s,v18.4s,v19.4s
- ror w11,w11,#20
- add v22.4s,v22.4s,v23.4s
- ror w12,w12,#20
- eor v24.16b,v1.16b,v2.16b
- add w5,w5,w9
- eor v25.16b,v5.16b,v6.16b
- add w6,w6,w10
- eor v26.16b,v9.16b,v10.16b
- add w7,w7,w11
- eor v27.16b,v13.16b,v14.16b
- add w8,w8,w12
- eor v28.16b,v17.16b,v18.16b
- eor w17,w17,w5
- eor v29.16b,v21.16b,v22.16b
- eor w19,w19,w6
- ushr v1.4s,v24.4s,#20
- eor w20,w20,w7
- ushr v5.4s,v25.4s,#20
- eor w21,w21,w8
- ushr v9.4s,v26.4s,#20
- ror w17,w17,#24
- ushr v13.4s,v27.4s,#20
- ror w19,w19,#24
- ushr v17.4s,v28.4s,#20
- ror w20,w20,#24
- ushr v21.4s,v29.4s,#20
- ror w21,w21,#24
- sli v1.4s,v24.4s,#12
- add w13,w13,w17
- sli v5.4s,v25.4s,#12
- add w14,w14,w19
- sli v9.4s,v26.4s,#12
- add w15,w15,w20
- sli v13.4s,v27.4s,#12
- add w16,w16,w21
- sli v17.4s,v28.4s,#12
- eor w9,w9,w13
- sli v21.4s,v29.4s,#12
- eor w10,w10,w14
- add v0.4s,v0.4s,v1.4s
- eor w11,w11,w15
- add v4.4s,v4.4s,v5.4s
- eor w12,w12,w16
- add v8.4s,v8.4s,v9.4s
- ror w9,w9,#25
- add v12.4s,v12.4s,v13.4s
- ror w10,w10,#25
- add v16.4s,v16.4s,v17.4s
- ror w11,w11,#25
- add v20.4s,v20.4s,v21.4s
- ror w12,w12,#25
- eor v24.16b,v3.16b,v0.16b
- add w5,w5,w10
- eor v25.16b,v7.16b,v4.16b
- add w6,w6,w11
- eor v26.16b,v11.16b,v8.16b
- add w7,w7,w12
- eor v27.16b,v15.16b,v12.16b
- add w8,w8,w9
- eor v28.16b,v19.16b,v16.16b
- eor w21,w21,w5
- eor v29.16b,v23.16b,v20.16b
- eor w17,w17,w6
- ushr v3.4s,v24.4s,#24
- eor w19,w19,w7
- ushr v7.4s,v25.4s,#24
- eor w20,w20,w8
- ushr v11.4s,v26.4s,#24
- ror w21,w21,#16
- ushr v15.4s,v27.4s,#24
- ror w17,w17,#16
- ushr v19.4s,v28.4s,#24
- ror w19,w19,#16
- ushr v23.4s,v29.4s,#24
- ror w20,w20,#16
- sli v3.4s,v24.4s,#8
- add w15,w15,w21
- sli v7.4s,v25.4s,#8
- add w16,w16,w17
- sli v11.4s,v26.4s,#8
- add w13,w13,w19
- sli v15.4s,v27.4s,#8
- add w14,w14,w20
- sli v19.4s,v28.4s,#8
- eor w10,w10,w15
- sli v23.4s,v29.4s,#8
- eor w11,w11,w16
- add v2.4s,v2.4s,v3.4s
- eor w12,w12,w13
- add v6.4s,v6.4s,v7.4s
- eor w9,w9,w14
- add v10.4s,v10.4s,v11.4s
- ror w10,w10,#20
- add v14.4s,v14.4s,v15.4s
- ror w11,w11,#20
- add v18.4s,v18.4s,v19.4s
- ror w12,w12,#20
- add v22.4s,v22.4s,v23.4s
- ror w9,w9,#20
- eor v24.16b,v1.16b,v2.16b
- add w5,w5,w10
- eor v25.16b,v5.16b,v6.16b
- add w6,w6,w11
- eor v26.16b,v9.16b,v10.16b
- add w7,w7,w12
- eor v27.16b,v13.16b,v14.16b
- add w8,w8,w9
- eor v28.16b,v17.16b,v18.16b
- eor w21,w21,w5
- eor v29.16b,v21.16b,v22.16b
- eor w17,w17,w6
- ushr v1.4s,v24.4s,#25
- eor w19,w19,w7
- ushr v5.4s,v25.4s,#25
- eor w20,w20,w8
- ushr v9.4s,v26.4s,#25
- ror w21,w21,#24
- ushr v13.4s,v27.4s,#25
- ror w17,w17,#24
- ushr v17.4s,v28.4s,#25
- ror w19,w19,#24
- ushr v21.4s,v29.4s,#25
- ror w20,w20,#24
- sli v1.4s,v24.4s,#7
- add w15,w15,w21
- sli v5.4s,v25.4s,#7
- add w16,w16,w17
- sli v9.4s,v26.4s,#7
- add w13,w13,w19
- sli v13.4s,v27.4s,#7
- add w14,w14,w20
- sli v17.4s,v28.4s,#7
- eor w10,w10,w15
- sli v21.4s,v29.4s,#7
- eor w11,w11,w16
- ext v2.16b,v2.16b,v2.16b,#8
- eor w12,w12,w13
- ext v6.16b,v6.16b,v6.16b,#8
- eor w9,w9,w14
- ext v10.16b,v10.16b,v10.16b,#8
- ror w10,w10,#25
- ext v14.16b,v14.16b,v14.16b,#8
- ror w11,w11,#25
- ext v18.16b,v18.16b,v18.16b,#8
- ror w12,w12,#25
- ext v22.16b,v22.16b,v22.16b,#8
- ror w9,w9,#25
- ext v3.16b,v3.16b,v3.16b,#12
- ext v7.16b,v7.16b,v7.16b,#12
- ext v11.16b,v11.16b,v11.16b,#12
- ext v15.16b,v15.16b,v15.16b,#12
- ext v19.16b,v19.16b,v19.16b,#12
- ext v23.16b,v23.16b,v23.16b,#12
- ext v1.16b,v1.16b,v1.16b,#4
- ext v5.16b,v5.16b,v5.16b,#4
- ext v9.16b,v9.16b,v9.16b,#4
- ext v13.16b,v13.16b,v13.16b,#4
- ext v17.16b,v17.16b,v17.16b,#4
- ext v21.16b,v21.16b,v21.16b,#4
- add v0.4s,v0.4s,v1.4s
- add w5,w5,w9
- add v4.4s,v4.4s,v5.4s
- add w6,w6,w10
- add v8.4s,v8.4s,v9.4s
- add w7,w7,w11
- add v12.4s,v12.4s,v13.4s
- add w8,w8,w12
- add v16.4s,v16.4s,v17.4s
- eor w17,w17,w5
- add v20.4s,v20.4s,v21.4s
- eor w19,w19,w6
- eor v3.16b,v3.16b,v0.16b
- eor w20,w20,w7
- eor v7.16b,v7.16b,v4.16b
- eor w21,w21,w8
- eor v11.16b,v11.16b,v8.16b
- ror w17,w17,#16
- eor v15.16b,v15.16b,v12.16b
- ror w19,w19,#16
- eor v19.16b,v19.16b,v16.16b
- ror w20,w20,#16
- eor v23.16b,v23.16b,v20.16b
- ror w21,w21,#16
- rev32 v3.8h,v3.8h
- add w13,w13,w17
- rev32 v7.8h,v7.8h
- add w14,w14,w19
- rev32 v11.8h,v11.8h
- add w15,w15,w20
- rev32 v15.8h,v15.8h
- add w16,w16,w21
- rev32 v19.8h,v19.8h
- eor w9,w9,w13
- rev32 v23.8h,v23.8h
- eor w10,w10,w14
- add v2.4s,v2.4s,v3.4s
- eor w11,w11,w15
- add v6.4s,v6.4s,v7.4s
- eor w12,w12,w16
- add v10.4s,v10.4s,v11.4s
- ror w9,w9,#20
- add v14.4s,v14.4s,v15.4s
- ror w10,w10,#20
- add v18.4s,v18.4s,v19.4s
- ror w11,w11,#20
- add v22.4s,v22.4s,v23.4s
- ror w12,w12,#20
- eor v24.16b,v1.16b,v2.16b
- add w5,w5,w9
- eor v25.16b,v5.16b,v6.16b
- add w6,w6,w10
- eor v26.16b,v9.16b,v10.16b
- add w7,w7,w11
- eor v27.16b,v13.16b,v14.16b
- add w8,w8,w12
- eor v28.16b,v17.16b,v18.16b
- eor w17,w17,w5
- eor v29.16b,v21.16b,v22.16b
- eor w19,w19,w6
- ushr v1.4s,v24.4s,#20
- eor w20,w20,w7
- ushr v5.4s,v25.4s,#20
- eor w21,w21,w8
- ushr v9.4s,v26.4s,#20
- ror w17,w17,#24
- ushr v13.4s,v27.4s,#20
- ror w19,w19,#24
- ushr v17.4s,v28.4s,#20
- ror w20,w20,#24
- ushr v21.4s,v29.4s,#20
- ror w21,w21,#24
- sli v1.4s,v24.4s,#12
- add w13,w13,w17
- sli v5.4s,v25.4s,#12
- add w14,w14,w19
- sli v9.4s,v26.4s,#12
- add w15,w15,w20
- sli v13.4s,v27.4s,#12
- add w16,w16,w21
- sli v17.4s,v28.4s,#12
- eor w9,w9,w13
- sli v21.4s,v29.4s,#12
- eor w10,w10,w14
- add v0.4s,v0.4s,v1.4s
- eor w11,w11,w15
- add v4.4s,v4.4s,v5.4s
- eor w12,w12,w16
- add v8.4s,v8.4s,v9.4s
- ror w9,w9,#25
- add v12.4s,v12.4s,v13.4s
- ror w10,w10,#25
- add v16.4s,v16.4s,v17.4s
- ror w11,w11,#25
- add v20.4s,v20.4s,v21.4s
- ror w12,w12,#25
- eor v24.16b,v3.16b,v0.16b
- add w5,w5,w10
- eor v25.16b,v7.16b,v4.16b
- add w6,w6,w11
- eor v26.16b,v11.16b,v8.16b
- add w7,w7,w12
- eor v27.16b,v15.16b,v12.16b
- add w8,w8,w9
- eor v28.16b,v19.16b,v16.16b
- eor w21,w21,w5
- eor v29.16b,v23.16b,v20.16b
- eor w17,w17,w6
- ushr v3.4s,v24.4s,#24
- eor w19,w19,w7
- ushr v7.4s,v25.4s,#24
- eor w20,w20,w8
- ushr v11.4s,v26.4s,#24
- ror w21,w21,#16
- ushr v15.4s,v27.4s,#24
- ror w17,w17,#16
- ushr v19.4s,v28.4s,#24
- ror w19,w19,#16
- ushr v23.4s,v29.4s,#24
- ror w20,w20,#16
- sli v3.4s,v24.4s,#8
- add w15,w15,w21
- sli v7.4s,v25.4s,#8
- add w16,w16,w17
- sli v11.4s,v26.4s,#8
- add w13,w13,w19
- sli v15.4s,v27.4s,#8
- add w14,w14,w20
- sli v19.4s,v28.4s,#8
- eor w10,w10,w15
- sli v23.4s,v29.4s,#8
- eor w11,w11,w16
- add v2.4s,v2.4s,v3.4s
- eor w12,w12,w13
- add v6.4s,v6.4s,v7.4s
- eor w9,w9,w14
- add v10.4s,v10.4s,v11.4s
- ror w10,w10,#20
- add v14.4s,v14.4s,v15.4s
- ror w11,w11,#20
- add v18.4s,v18.4s,v19.4s
- ror w12,w12,#20
- add v22.4s,v22.4s,v23.4s
- ror w9,w9,#20
- eor v24.16b,v1.16b,v2.16b
- add w5,w5,w10
- eor v25.16b,v5.16b,v6.16b
- add w6,w6,w11
- eor v26.16b,v9.16b,v10.16b
- add w7,w7,w12
- eor v27.16b,v13.16b,v14.16b
- add w8,w8,w9
- eor v28.16b,v17.16b,v18.16b
- eor w21,w21,w5
- eor v29.16b,v21.16b,v22.16b
- eor w17,w17,w6
- ushr v1.4s,v24.4s,#25
- eor w19,w19,w7
- ushr v5.4s,v25.4s,#25
- eor w20,w20,w8
- ushr v9.4s,v26.4s,#25
- ror w21,w21,#24
- ushr v13.4s,v27.4s,#25
- ror w17,w17,#24
- ushr v17.4s,v28.4s,#25
- ror w19,w19,#24
- ushr v21.4s,v29.4s,#25
- ror w20,w20,#24
- sli v1.4s,v24.4s,#7
- add w15,w15,w21
- sli v5.4s,v25.4s,#7
- add w16,w16,w17
- sli v9.4s,v26.4s,#7
- add w13,w13,w19
- sli v13.4s,v27.4s,#7
- add w14,w14,w20
- sli v17.4s,v28.4s,#7
- eor w10,w10,w15
- sli v21.4s,v29.4s,#7
- eor w11,w11,w16
- ext v2.16b,v2.16b,v2.16b,#8
- eor w12,w12,w13
- ext v6.16b,v6.16b,v6.16b,#8
- eor w9,w9,w14
- ext v10.16b,v10.16b,v10.16b,#8
- ror w10,w10,#25
- ext v14.16b,v14.16b,v14.16b,#8
- ror w11,w11,#25
- ext v18.16b,v18.16b,v18.16b,#8
- ror w12,w12,#25
- ext v22.16b,v22.16b,v22.16b,#8
- ror w9,w9,#25
- ext v3.16b,v3.16b,v3.16b,#4
- ext v7.16b,v7.16b,v7.16b,#4
- ext v11.16b,v11.16b,v11.16b,#4
- ext v15.16b,v15.16b,v15.16b,#4
- ext v19.16b,v19.16b,v19.16b,#4
- ext v23.16b,v23.16b,v23.16b,#4
- ext v1.16b,v1.16b,v1.16b,#12
- ext v5.16b,v5.16b,v5.16b,#12
- ext v9.16b,v9.16b,v9.16b,#12
- ext v13.16b,v13.16b,v13.16b,#12
- ext v17.16b,v17.16b,v17.16b,#12
- ext v21.16b,v21.16b,v21.16b,#12
- cbnz x4,Loop_upper_neon
-
- add w5,w5,w22 // accumulate key block
- add x6,x6,x22,lsr#32
- add w7,w7,w23
- add x8,x8,x23,lsr#32
- add w9,w9,w24
- add x10,x10,x24,lsr#32
- add w11,w11,w25
- add x12,x12,x25,lsr#32
- add w13,w13,w26
- add x14,x14,x26,lsr#32
- add w15,w15,w27
- add x16,x16,x27,lsr#32
- add w17,w17,w28
- add x19,x19,x28,lsr#32
- add w20,w20,w30
- add x21,x21,x30,lsr#32
-
- add x5,x5,x6,lsl#32 // pack
- add x7,x7,x8,lsl#32
- ldp x6,x8,[x1,#0] // load input
- add x9,x9,x10,lsl#32
- add x11,x11,x12,lsl#32
- ldp x10,x12,[x1,#16]
- add x13,x13,x14,lsl#32
- add x15,x15,x16,lsl#32
- ldp x14,x16,[x1,#32]
- add x17,x17,x19,lsl#32
- add x20,x20,x21,lsl#32
- ldp x19,x21,[x1,#48]
- add x1,x1,#64
-#ifdef __AARCH64EB__
- rev x5,x5
- rev x7,x7
- rev x9,x9
- rev x11,x11
- rev x13,x13
- rev x15,x15
- rev x17,x17
- rev x20,x20
-#endif
- eor x5,x5,x6
- eor x7,x7,x8
- eor x9,x9,x10
- eor x11,x11,x12
- eor x13,x13,x14
- eor x15,x15,x16
- eor x17,x17,x19
- eor x20,x20,x21
-
- stp x5,x7,[x0,#0] // store output
- add x28,x28,#1 // increment counter
- mov w5,w22 // unpack key block
- lsr x6,x22,#32
- stp x9,x11,[x0,#16]
- mov w7,w23
- lsr x8,x23,#32
- stp x13,x15,[x0,#32]
- mov w9,w24
- lsr x10,x24,#32
- stp x17,x20,[x0,#48]
- add x0,x0,#64
- mov w11,w25
- lsr x12,x25,#32
- mov w13,w26
- lsr x14,x26,#32
- mov w15,w27
- lsr x16,x27,#32
- mov w17,w28
- lsr x19,x28,#32
- mov w20,w30
- lsr x21,x30,#32
-
- mov x4,#5
-Loop_lower_neon:
- sub x4,x4,#1
- add v0.4s,v0.4s,v1.4s
- add w5,w5,w9
- add v4.4s,v4.4s,v5.4s
- add w6,w6,w10
- add v8.4s,v8.4s,v9.4s
- add w7,w7,w11
- add v12.4s,v12.4s,v13.4s
- add w8,w8,w12
- add v16.4s,v16.4s,v17.4s
- eor w17,w17,w5
- add v20.4s,v20.4s,v21.4s
- eor w19,w19,w6
- eor v3.16b,v3.16b,v0.16b
- eor w20,w20,w7
- eor v7.16b,v7.16b,v4.16b
- eor w21,w21,w8
- eor v11.16b,v11.16b,v8.16b
- ror w17,w17,#16
- eor v15.16b,v15.16b,v12.16b
- ror w19,w19,#16
- eor v19.16b,v19.16b,v16.16b
- ror w20,w20,#16
- eor v23.16b,v23.16b,v20.16b
- ror w21,w21,#16
- rev32 v3.8h,v3.8h
- add w13,w13,w17
- rev32 v7.8h,v7.8h
- add w14,w14,w19
- rev32 v11.8h,v11.8h
- add w15,w15,w20
- rev32 v15.8h,v15.8h
- add w16,w16,w21
- rev32 v19.8h,v19.8h
- eor w9,w9,w13
- rev32 v23.8h,v23.8h
- eor w10,w10,w14
- add v2.4s,v2.4s,v3.4s
- eor w11,w11,w15
- add v6.4s,v6.4s,v7.4s
- eor w12,w12,w16
- add v10.4s,v10.4s,v11.4s
- ror w9,w9,#20
- add v14.4s,v14.4s,v15.4s
- ror w10,w10,#20
- add v18.4s,v18.4s,v19.4s
- ror w11,w11,#20
- add v22.4s,v22.4s,v23.4s
- ror w12,w12,#20
- eor v24.16b,v1.16b,v2.16b
- add w5,w5,w9
- eor v25.16b,v5.16b,v6.16b
- add w6,w6,w10
- eor v26.16b,v9.16b,v10.16b
- add w7,w7,w11
- eor v27.16b,v13.16b,v14.16b
- add w8,w8,w12
- eor v28.16b,v17.16b,v18.16b
- eor w17,w17,w5
- eor v29.16b,v21.16b,v22.16b
- eor w19,w19,w6
- ushr v1.4s,v24.4s,#20
- eor w20,w20,w7
- ushr v5.4s,v25.4s,#20
- eor w21,w21,w8
- ushr v9.4s,v26.4s,#20
- ror w17,w17,#24
- ushr v13.4s,v27.4s,#20
- ror w19,w19,#24
- ushr v17.4s,v28.4s,#20
- ror w20,w20,#24
- ushr v21.4s,v29.4s,#20
- ror w21,w21,#24
- sli v1.4s,v24.4s,#12
- add w13,w13,w17
- sli v5.4s,v25.4s,#12
- add w14,w14,w19
- sli v9.4s,v26.4s,#12
- add w15,w15,w20
- sli v13.4s,v27.4s,#12
- add w16,w16,w21
- sli v17.4s,v28.4s,#12
- eor w9,w9,w13
- sli v21.4s,v29.4s,#12
- eor w10,w10,w14
- add v0.4s,v0.4s,v1.4s
- eor w11,w11,w15
- add v4.4s,v4.4s,v5.4s
- eor w12,w12,w16
- add v8.4s,v8.4s,v9.4s
- ror w9,w9,#25
- add v12.4s,v12.4s,v13.4s
- ror w10,w10,#25
- add v16.4s,v16.4s,v17.4s
- ror w11,w11,#25
- add v20.4s,v20.4s,v21.4s
- ror w12,w12,#25
- eor v24.16b,v3.16b,v0.16b
- add w5,w5,w10
- eor v25.16b,v7.16b,v4.16b
- add w6,w6,w11
- eor v26.16b,v11.16b,v8.16b
- add w7,w7,w12
- eor v27.16b,v15.16b,v12.16b
- add w8,w8,w9
- eor v28.16b,v19.16b,v16.16b
- eor w21,w21,w5
- eor v29.16b,v23.16b,v20.16b
- eor w17,w17,w6
- ushr v3.4s,v24.4s,#24
- eor w19,w19,w7
- ushr v7.4s,v25.4s,#24
- eor w20,w20,w8
- ushr v11.4s,v26.4s,#24
- ror w21,w21,#16
- ushr v15.4s,v27.4s,#24
- ror w17,w17,#16
- ushr v19.4s,v28.4s,#24
- ror w19,w19,#16
- ushr v23.4s,v29.4s,#24
- ror w20,w20,#16
- sli v3.4s,v24.4s,#8
- add w15,w15,w21
- sli v7.4s,v25.4s,#8
- add w16,w16,w17
- sli v11.4s,v26.4s,#8
- add w13,w13,w19
- sli v15.4s,v27.4s,#8
- add w14,w14,w20
- sli v19.4s,v28.4s,#8
- eor w10,w10,w15
- sli v23.4s,v29.4s,#8
- eor w11,w11,w16
- add v2.4s,v2.4s,v3.4s
- eor w12,w12,w13
- add v6.4s,v6.4s,v7.4s
- eor w9,w9,w14
- add v10.4s,v10.4s,v11.4s
- ror w10,w10,#20
- add v14.4s,v14.4s,v15.4s
- ror w11,w11,#20
- add v18.4s,v18.4s,v19.4s
- ror w12,w12,#20
- add v22.4s,v22.4s,v23.4s
- ror w9,w9,#20
- eor v24.16b,v1.16b,v2.16b
- add w5,w5,w10
- eor v25.16b,v5.16b,v6.16b
- add w6,w6,w11
- eor v26.16b,v9.16b,v10.16b
- add w7,w7,w12
- eor v27.16b,v13.16b,v14.16b
- add w8,w8,w9
- eor v28.16b,v17.16b,v18.16b
- eor w21,w21,w5
- eor v29.16b,v21.16b,v22.16b
- eor w17,w17,w6
- ushr v1.4s,v24.4s,#25
- eor w19,w19,w7
- ushr v5.4s,v25.4s,#25
- eor w20,w20,w8
- ushr v9.4s,v26.4s,#25
- ror w21,w21,#24
- ushr v13.4s,v27.4s,#25
- ror w17,w17,#24
- ushr v17.4s,v28.4s,#25
- ror w19,w19,#24
- ushr v21.4s,v29.4s,#25
- ror w20,w20,#24
- sli v1.4s,v24.4s,#7
- add w15,w15,w21
- sli v5.4s,v25.4s,#7
- add w16,w16,w17
- sli v9.4s,v26.4s,#7
- add w13,w13,w19
- sli v13.4s,v27.4s,#7
- add w14,w14,w20
- sli v17.4s,v28.4s,#7
- eor w10,w10,w15
- sli v21.4s,v29.4s,#7
- eor w11,w11,w16
- ext v2.16b,v2.16b,v2.16b,#8
- eor w12,w12,w13
- ext v6.16b,v6.16b,v6.16b,#8
- eor w9,w9,w14
- ext v10.16b,v10.16b,v10.16b,#8
- ror w10,w10,#25
- ext v14.16b,v14.16b,v14.16b,#8
- ror w11,w11,#25
- ext v18.16b,v18.16b,v18.16b,#8
- ror w12,w12,#25
- ext v22.16b,v22.16b,v22.16b,#8
- ror w9,w9,#25
- ext v3.16b,v3.16b,v3.16b,#12
- ext v7.16b,v7.16b,v7.16b,#12
- ext v11.16b,v11.16b,v11.16b,#12
- ext v15.16b,v15.16b,v15.16b,#12
- ext v19.16b,v19.16b,v19.16b,#12
- ext v23.16b,v23.16b,v23.16b,#12
- ext v1.16b,v1.16b,v1.16b,#4
- ext v5.16b,v5.16b,v5.16b,#4
- ext v9.16b,v9.16b,v9.16b,#4
- ext v13.16b,v13.16b,v13.16b,#4
- ext v17.16b,v17.16b,v17.16b,#4
- ext v21.16b,v21.16b,v21.16b,#4
- add v0.4s,v0.4s,v1.4s
- add w5,w5,w9
- add v4.4s,v4.4s,v5.4s
- add w6,w6,w10
- add v8.4s,v8.4s,v9.4s
- add w7,w7,w11
- add v12.4s,v12.4s,v13.4s
- add w8,w8,w12
- add v16.4s,v16.4s,v17.4s
- eor w17,w17,w5
- add v20.4s,v20.4s,v21.4s
- eor w19,w19,w6
- eor v3.16b,v3.16b,v0.16b
- eor w20,w20,w7
- eor v7.16b,v7.16b,v4.16b
- eor w21,w21,w8
- eor v11.16b,v11.16b,v8.16b
- ror w17,w17,#16
- eor v15.16b,v15.16b,v12.16b
- ror w19,w19,#16
- eor v19.16b,v19.16b,v16.16b
- ror w20,w20,#16
- eor v23.16b,v23.16b,v20.16b
- ror w21,w21,#16
- rev32 v3.8h,v3.8h
- add w13,w13,w17
- rev32 v7.8h,v7.8h
- add w14,w14,w19
- rev32 v11.8h,v11.8h
- add w15,w15,w20
- rev32 v15.8h,v15.8h
- add w16,w16,w21
- rev32 v19.8h,v19.8h
- eor w9,w9,w13
- rev32 v23.8h,v23.8h
- eor w10,w10,w14
- add v2.4s,v2.4s,v3.4s
- eor w11,w11,w15
- add v6.4s,v6.4s,v7.4s
- eor w12,w12,w16
- add v10.4s,v10.4s,v11.4s
- ror w9,w9,#20
- add v14.4s,v14.4s,v15.4s
- ror w10,w10,#20
- add v18.4s,v18.4s,v19.4s
- ror w11,w11,#20
- add v22.4s,v22.4s,v23.4s
- ror w12,w12,#20
- eor v24.16b,v1.16b,v2.16b
- add w5,w5,w9
- eor v25.16b,v5.16b,v6.16b
- add w6,w6,w10
- eor v26.16b,v9.16b,v10.16b
- add w7,w7,w11
- eor v27.16b,v13.16b,v14.16b
- add w8,w8,w12
- eor v28.16b,v17.16b,v18.16b
- eor w17,w17,w5
- eor v29.16b,v21.16b,v22.16b
- eor w19,w19,w6
- ushr v1.4s,v24.4s,#20
- eor w20,w20,w7
- ushr v5.4s,v25.4s,#20
- eor w21,w21,w8
- ushr v9.4s,v26.4s,#20
- ror w17,w17,#24
- ushr v13.4s,v27.4s,#20
- ror w19,w19,#24
- ushr v17.4s,v28.4s,#20
- ror w20,w20,#24
- ushr v21.4s,v29.4s,#20
- ror w21,w21,#24
- sli v1.4s,v24.4s,#12
- add w13,w13,w17
- sli v5.4s,v25.4s,#12
- add w14,w14,w19
- sli v9.4s,v26.4s,#12
- add w15,w15,w20
- sli v13.4s,v27.4s,#12
- add w16,w16,w21
- sli v17.4s,v28.4s,#12
- eor w9,w9,w13
- sli v21.4s,v29.4s,#12
- eor w10,w10,w14
- add v0.4s,v0.4s,v1.4s
- eor w11,w11,w15
- add v4.4s,v4.4s,v5.4s
- eor w12,w12,w16
- add v8.4s,v8.4s,v9.4s
- ror w9,w9,#25
- add v12.4s,v12.4s,v13.4s
- ror w10,w10,#25
- add v16.4s,v16.4s,v17.4s
- ror w11,w11,#25
- add v20.4s,v20.4s,v21.4s
- ror w12,w12,#25
- eor v24.16b,v3.16b,v0.16b
- add w5,w5,w10
- eor v25.16b,v7.16b,v4.16b
- add w6,w6,w11
- eor v26.16b,v11.16b,v8.16b
- add w7,w7,w12
- eor v27.16b,v15.16b,v12.16b
- add w8,w8,w9
- eor v28.16b,v19.16b,v16.16b
- eor w21,w21,w5
- eor v29.16b,v23.16b,v20.16b
- eor w17,w17,w6
- ushr v3.4s,v24.4s,#24
- eor w19,w19,w7
- ushr v7.4s,v25.4s,#24
- eor w20,w20,w8
- ushr v11.4s,v26.4s,#24
- ror w21,w21,#16
- ushr v15.4s,v27.4s,#24
- ror w17,w17,#16
- ushr v19.4s,v28.4s,#24
- ror w19,w19,#16
- ushr v23.4s,v29.4s,#24
- ror w20,w20,#16
- sli v3.4s,v24.4s,#8
- add w15,w15,w21
- sli v7.4s,v25.4s,#8
- add w16,w16,w17
- sli v11.4s,v26.4s,#8
- add w13,w13,w19
- sli v15.4s,v27.4s,#8
- add w14,w14,w20
- sli v19.4s,v28.4s,#8
- eor w10,w10,w15
- sli v23.4s,v29.4s,#8
- eor w11,w11,w16
- add v2.4s,v2.4s,v3.4s
- eor w12,w12,w13
- add v6.4s,v6.4s,v7.4s
- eor w9,w9,w14
- add v10.4s,v10.4s,v11.4s
- ror w10,w10,#20
- add v14.4s,v14.4s,v15.4s
- ror w11,w11,#20
- add v18.4s,v18.4s,v19.4s
- ror w12,w12,#20
- add v22.4s,v22.4s,v23.4s
- ror w9,w9,#20
- eor v24.16b,v1.16b,v2.16b
- add w5,w5,w10
- eor v25.16b,v5.16b,v6.16b
- add w6,w6,w11
- eor v26.16b,v9.16b,v10.16b
- add w7,w7,w12
- eor v27.16b,v13.16b,v14.16b
- add w8,w8,w9
- eor v28.16b,v17.16b,v18.16b
- eor w21,w21,w5
- eor v29.16b,v21.16b,v22.16b
- eor w17,w17,w6
- ushr v1.4s,v24.4s,#25
- eor w19,w19,w7
- ushr v5.4s,v25.4s,#25
- eor w20,w20,w8
- ushr v9.4s,v26.4s,#25
- ror w21,w21,#24
- ushr v13.4s,v27.4s,#25
- ror w17,w17,#24
- ushr v17.4s,v28.4s,#25
- ror w19,w19,#24
- ushr v21.4s,v29.4s,#25
- ror w20,w20,#24
- sli v1.4s,v24.4s,#7
- add w15,w15,w21
- sli v5.4s,v25.4s,#7
- add w16,w16,w17
- sli v9.4s,v26.4s,#7
- add w13,w13,w19
- sli v13.4s,v27.4s,#7
- add w14,w14,w20
- sli v17.4s,v28.4s,#7
- eor w10,w10,w15
- sli v21.4s,v29.4s,#7
- eor w11,w11,w16
- ext v2.16b,v2.16b,v2.16b,#8
- eor w12,w12,w13
- ext v6.16b,v6.16b,v6.16b,#8
- eor w9,w9,w14
- ext v10.16b,v10.16b,v10.16b,#8
- ror w10,w10,#25
- ext v14.16b,v14.16b,v14.16b,#8
- ror w11,w11,#25
- ext v18.16b,v18.16b,v18.16b,#8
- ror w12,w12,#25
- ext v22.16b,v22.16b,v22.16b,#8
- ror w9,w9,#25
- ext v3.16b,v3.16b,v3.16b,#4
- ext v7.16b,v7.16b,v7.16b,#4
- ext v11.16b,v11.16b,v11.16b,#4
- ext v15.16b,v15.16b,v15.16b,#4
- ext v19.16b,v19.16b,v19.16b,#4
- ext v23.16b,v23.16b,v23.16b,#4
- ext v1.16b,v1.16b,v1.16b,#12
- ext v5.16b,v5.16b,v5.16b,#12
- ext v9.16b,v9.16b,v9.16b,#12
- ext v13.16b,v13.16b,v13.16b,#12
- ext v17.16b,v17.16b,v17.16b,#12
- ext v21.16b,v21.16b,v21.16b,#12
- cbnz x4,Loop_lower_neon
-
- add w5,w5,w22 // accumulate key block
- ldp q24,q25,[sp,#0]
- add x6,x6,x22,lsr#32
- ldp q26,q27,[sp,#32]
- add w7,w7,w23
- ldp q28,q29,[sp,#64]
- add x8,x8,x23,lsr#32
- add v0.4s,v0.4s,v24.4s
- add w9,w9,w24
- add v4.4s,v4.4s,v24.4s
- add x10,x10,x24,lsr#32
- add v8.4s,v8.4s,v24.4s
- add w11,w11,w25
- add v12.4s,v12.4s,v24.4s
- add x12,x12,x25,lsr#32
- add v16.4s,v16.4s,v24.4s
- add w13,w13,w26
- add v20.4s,v20.4s,v24.4s
- add x14,x14,x26,lsr#32
- add v2.4s,v2.4s,v26.4s
- add w15,w15,w27
- add v6.4s,v6.4s,v26.4s
- add x16,x16,x27,lsr#32
- add v10.4s,v10.4s,v26.4s
- add w17,w17,w28
- add v14.4s,v14.4s,v26.4s
- add x19,x19,x28,lsr#32
- add v18.4s,v18.4s,v26.4s
- add w20,w20,w30
- add v22.4s,v22.4s,v26.4s
- add x21,x21,x30,lsr#32
- add v19.4s,v19.4s,v31.4s // +4
- add x5,x5,x6,lsl#32 // pack
- add v23.4s,v23.4s,v31.4s // +4
- add x7,x7,x8,lsl#32
- add v3.4s,v3.4s,v27.4s
- ldp x6,x8,[x1,#0] // load input
- add v7.4s,v7.4s,v28.4s
- add x9,x9,x10,lsl#32
- add v11.4s,v11.4s,v29.4s
- add x11,x11,x12,lsl#32
- add v15.4s,v15.4s,v30.4s
- ldp x10,x12,[x1,#16]
- add v19.4s,v19.4s,v27.4s
- add x13,x13,x14,lsl#32
- add v23.4s,v23.4s,v28.4s
- add x15,x15,x16,lsl#32
- add v1.4s,v1.4s,v25.4s
- ldp x14,x16,[x1,#32]
- add v5.4s,v5.4s,v25.4s
- add x17,x17,x19,lsl#32
- add v9.4s,v9.4s,v25.4s
- add x20,x20,x21,lsl#32
- add v13.4s,v13.4s,v25.4s
- ldp x19,x21,[x1,#48]
- add v17.4s,v17.4s,v25.4s
- add x1,x1,#64
- add v21.4s,v21.4s,v25.4s
-
-#ifdef __AARCH64EB__
- rev x5,x5
- rev x7,x7
- rev x9,x9
- rev x11,x11
- rev x13,x13
- rev x15,x15
- rev x17,x17
- rev x20,x20
-#endif
- ld1 {v24.16b,v25.16b,v26.16b,v27.16b},[x1],#64
- eor x5,x5,x6
- eor x7,x7,x8
- eor x9,x9,x10
- eor x11,x11,x12
- eor x13,x13,x14
- eor v0.16b,v0.16b,v24.16b
- eor x15,x15,x16
- eor v1.16b,v1.16b,v25.16b
- eor x17,x17,x19
- eor v2.16b,v2.16b,v26.16b
- eor x20,x20,x21
- eor v3.16b,v3.16b,v27.16b
- ld1 {v24.16b,v25.16b,v26.16b,v27.16b},[x1],#64
-
- stp x5,x7,[x0,#0] // store output
- add x28,x28,#7 // increment counter
- stp x9,x11,[x0,#16]
- stp x13,x15,[x0,#32]
- stp x17,x20,[x0,#48]
- add x0,x0,#64
- st1 {v0.16b,v1.16b,v2.16b,v3.16b},[x0],#64
-
- ld1 {v0.16b,v1.16b,v2.16b,v3.16b},[x1],#64
- eor v4.16b,v4.16b,v24.16b
- eor v5.16b,v5.16b,v25.16b
- eor v6.16b,v6.16b,v26.16b
- eor v7.16b,v7.16b,v27.16b
- st1 {v4.16b,v5.16b,v6.16b,v7.16b},[x0],#64
-
- ld1 {v4.16b,v5.16b,v6.16b,v7.16b},[x1],#64
- eor v8.16b,v8.16b,v0.16b
- ldp q24,q25,[sp,#0]
- eor v9.16b,v9.16b,v1.16b
- ldp q26,q27,[sp,#32]
- eor v10.16b,v10.16b,v2.16b
- eor v11.16b,v11.16b,v3.16b
- st1 {v8.16b,v9.16b,v10.16b,v11.16b},[x0],#64
-
- ld1 {v8.16b,v9.16b,v10.16b,v11.16b},[x1],#64
- eor v12.16b,v12.16b,v4.16b
- eor v13.16b,v13.16b,v5.16b
- eor v14.16b,v14.16b,v6.16b
- eor v15.16b,v15.16b,v7.16b
- st1 {v12.16b,v13.16b,v14.16b,v15.16b},[x0],#64
-
- ld1 {v12.16b,v13.16b,v14.16b,v15.16b},[x1],#64
- eor v16.16b,v16.16b,v8.16b
- eor v17.16b,v17.16b,v9.16b
- eor v18.16b,v18.16b,v10.16b
- eor v19.16b,v19.16b,v11.16b
- st1 {v16.16b,v17.16b,v18.16b,v19.16b},[x0],#64
-
- shl v0.4s,v31.4s,#1 // 4 -> 8
- eor v20.16b,v20.16b,v12.16b
- eor v21.16b,v21.16b,v13.16b
- eor v22.16b,v22.16b,v14.16b
- eor v23.16b,v23.16b,v15.16b
- st1 {v20.16b,v21.16b,v22.16b,v23.16b},[x0],#64
-
- add v27.4s,v27.4s,v0.4s // += 8
- add v28.4s,v28.4s,v0.4s
- add v29.4s,v29.4s,v0.4s
- add v30.4s,v30.4s,v0.4s
-
- b.hs Loop_outer_512_neon
-
- adds x2,x2,#512
- ushr v0.4s,v31.4s,#2 // 4 -> 1
-
- ldp d8,d9,[sp,#128+0] // meet ABI requirements
- ldp d10,d11,[sp,#128+16]
- ldp d12,d13,[sp,#128+32]
- ldp d14,d15,[sp,#128+48]
-
- stp q24,q31,[sp,#0] // wipe off-load area
- stp q24,q31,[sp,#32]
- stp q24,q31,[sp,#64]
-
- b.eq Ldone_512_neon
-
- cmp x2,#192
- sub v27.4s,v27.4s,v0.4s // -= 1
- sub v28.4s,v28.4s,v0.4s
- sub v29.4s,v29.4s,v0.4s
- add sp,sp,#128
- b.hs Loop_outer_neon
-
- eor v25.16b,v25.16b,v25.16b
- eor v26.16b,v26.16b,v26.16b
- eor v27.16b,v27.16b,v27.16b
- eor v28.16b,v28.16b,v28.16b
- eor v29.16b,v29.16b,v29.16b
- eor v30.16b,v30.16b,v30.16b
- b Loop_outer
-
-Ldone_512_neon:
- ldp x19,x20,[x29,#16]
- add sp,sp,#128+64
- ldp x21,x22,[x29,#32]
- ldp x23,x24,[x29,#48]
- ldp x25,x26,[x29,#64]
- ldp x27,x28,[x29,#80]
- ldp x29,x30,[sp],#96
- AARCH64_VALIDATE_LINK_REGISTER
- ret
-
-#endif // !OPENSSL_NO_ASM && defined(OPENSSL_AARCH64) && defined(_WIN32)
diff --git a/win-aarch64/crypto/cipher_extra/chacha20_poly1305_armv8-win.S b/win-aarch64/crypto/cipher_extra/chacha20_poly1305_armv8-win.S
deleted file mode 100644
index 3314f2c5..00000000
--- a/win-aarch64/crypto/cipher_extra/chacha20_poly1305_armv8-win.S
+++ /dev/null
@@ -1,3015 +0,0 @@
-// This file is generated from a similarly-named Perl script in the BoringSSL
-// source tree. Do not edit by hand.
-
-#include <openssl/asm_base.h>
-
-#if !defined(OPENSSL_NO_ASM) && defined(OPENSSL_AARCH64) && defined(_WIN32)
-#include <openssl/arm_arch.h>
-.section .rodata
-
-.align 7
-Lchacha20_consts:
-.byte 'e','x','p','a','n','d',' ','3','2','-','b','y','t','e',' ','k'
-Linc:
-.long 1,2,3,4
-Lrol8:
-.byte 3,0,1,2, 7,4,5,6, 11,8,9,10, 15,12,13,14
-Lclamp:
-.quad 0x0FFFFFFC0FFFFFFF, 0x0FFFFFFC0FFFFFFC
-
-.text
-
-.def Lpoly_hash_ad_internal
- .type 32
-.endef
-.align 6
-Lpoly_hash_ad_internal:
-.cfi_startproc
- cbnz x4, Lpoly_hash_intro
- ret
-
-Lpoly_hash_intro:
- cmp x4, #16
- b.lt Lpoly_hash_ad_tail
- ldp x11, x12, [x3], 16
- adds x8, x8, x11
- adcs x9, x9, x12
- adc x10, x10, x15
- mul x11, x8, x16 // [t2:t1:t0] = [acc2:acc1:acc0] * r0
- umulh x12, x8, x16
- mul x13, x9, x16
- umulh x14, x9, x16
- adds x12, x12, x13
- mul x13, x10, x16
- adc x13, x13, x14
- mul x14, x8, x17 // [t3:t2:t1:t0] = [acc2:acc1:acc0] * [r1:r0]
- umulh x8, x8, x17
- adds x12, x12, x14
- mul x14, x9, x17
- umulh x9, x9, x17
- adcs x14, x14, x8
- mul x10, x10, x17
- adc x10, x10, x9
- adds x13, x13, x14
- adc x14, x10, xzr
- and x10, x13, #3 // At this point acc2 is 2 bits at most (value of 3)
- and x8, x13, #-4
- extr x13, x14, x13, #2
- adds x8, x8, x11
- lsr x11, x14, #2
- adc x9, x14, x11 // No carry out since t0 is 61 bits and t3 is 63 bits
- adds x8, x8, x13
- adcs x9, x9, x12
- adc x10, x10, xzr // At this point acc2 has the value of 4 at most
- sub x4, x4, #16
- b Lpoly_hash_ad_internal
-
-Lpoly_hash_ad_tail:
- cbz x4, Lpoly_hash_ad_ret
-
- eor v20.16b, v20.16b, v20.16b // Use T0 to load the AAD
- sub x4, x4, #1
-
-Lpoly_hash_tail_16_compose:
- ext v20.16b, v20.16b, v20.16b, #15
- ldrb w11, [x3, x4]
- mov v20.b[0], w11
- subs x4, x4, #1
- b.ge Lpoly_hash_tail_16_compose
- mov x11, v20.d[0]
- mov x12, v20.d[1]
- adds x8, x8, x11
- adcs x9, x9, x12
- adc x10, x10, x15
- mul x11, x8, x16 // [t2:t1:t0] = [acc2:acc1:acc0] * r0
- umulh x12, x8, x16
- mul x13, x9, x16
- umulh x14, x9, x16
- adds x12, x12, x13
- mul x13, x10, x16
- adc x13, x13, x14
- mul x14, x8, x17 // [t3:t2:t1:t0] = [acc2:acc1:acc0] * [r1:r0]
- umulh x8, x8, x17
- adds x12, x12, x14
- mul x14, x9, x17
- umulh x9, x9, x17
- adcs x14, x14, x8
- mul x10, x10, x17
- adc x10, x10, x9
- adds x13, x13, x14
- adc x14, x10, xzr
- and x10, x13, #3 // At this point acc2 is 2 bits at most (value of 3)
- and x8, x13, #-4
- extr x13, x14, x13, #2
- adds x8, x8, x11
- lsr x11, x14, #2
- adc x9, x14, x11 // No carry out since t0 is 61 bits and t3 is 63 bits
- adds x8, x8, x13
- adcs x9, x9, x12
- adc x10, x10, xzr // At this point acc2 has the value of 4 at most
-
-Lpoly_hash_ad_ret:
- ret
-.cfi_endproc
-
-
-/////////////////////////////////
-//
-// void chacha20_poly1305_seal(uint8_t *pt, uint8_t *ct, size_t len_in, uint8_t *ad, size_t len_ad, union open_data *seal_data);
-//
-.globl chacha20_poly1305_seal
-
-.def chacha20_poly1305_seal
- .type 32
-.endef
-.align 6
-chacha20_poly1305_seal:
- AARCH64_SIGN_LINK_REGISTER
-.cfi_startproc
- stp x29, x30, [sp, #-80]!
-.cfi_def_cfa_offset 80
-.cfi_offset w30, -72
-.cfi_offset w29, -80
- mov x29, sp
- // We probably could do .cfi_def_cfa w29, 80 at this point, but since
- // we don't actually use the frame pointer like that, it's probably not
- // worth bothering.
- stp d8, d9, [sp, #16]
- stp d10, d11, [sp, #32]
- stp d12, d13, [sp, #48]
- stp d14, d15, [sp, #64]
-.cfi_offset b15, -8
-.cfi_offset b14, -16
-.cfi_offset b13, -24
-.cfi_offset b12, -32
-.cfi_offset b11, -40
-.cfi_offset b10, -48
-.cfi_offset b9, -56
-.cfi_offset b8, -64
-
- adrp x11, Lchacha20_consts
- add x11, x11, :lo12:Lchacha20_consts
-
- ld1 {v24.16b - v27.16b}, [x11] // Load the CONSTS, INC, ROL8 and CLAMP values
- ld1 {v28.16b - v30.16b}, [x5]
-
- mov x15, #1 // Prepare the Poly1305 state
- mov x8, #0
- mov x9, #0
- mov x10, #0
-
- ldr x12, [x5, #56] // The total cipher text length includes extra_in_len
- add x12, x12, x2
- mov v31.d[0], x4 // Store the input and aad lengths
- mov v31.d[1], x12
-
- cmp x2, #128
- b.le Lseal_128 // Optimization for smaller buffers
-
- // Initially we prepare 5 ChaCha20 blocks. Four to encrypt up to 4 blocks (256 bytes) of plaintext,
- // and one for the Poly1305 R and S keys. The first four blocks (A0-A3..D0-D3) are computed vertically,
- // the fifth block (A4-D4) horizontally.
- ld4r {v0.4s,v1.4s,v2.4s,v3.4s}, [x11]
- mov v4.16b, v24.16b
-
- ld4r {v5.4s,v6.4s,v7.4s,v8.4s}, [x5], #16
- mov v9.16b, v28.16b
-
- ld4r {v10.4s,v11.4s,v12.4s,v13.4s}, [x5], #16
- mov v14.16b, v29.16b
-
- ld4r {v15.4s,v16.4s,v17.4s,v18.4s}, [x5]
- add v15.4s, v15.4s, v25.4s
- mov v19.16b, v30.16b
-
- sub x5, x5, #32
-
- mov x6, #10
-
-.align 5
-Lseal_init_rounds:
- add v0.4s, v0.4s, v5.4s
- add v1.4s, v1.4s, v6.4s
- add v2.4s, v2.4s, v7.4s
- add v3.4s, v3.4s, v8.4s
- add v4.4s, v4.4s, v9.4s
-
- eor v15.16b, v15.16b, v0.16b
- eor v16.16b, v16.16b, v1.16b
- eor v17.16b, v17.16b, v2.16b
- eor v18.16b, v18.16b, v3.16b
- eor v19.16b, v19.16b, v4.16b
-
- rev32 v15.8h, v15.8h
- rev32 v16.8h, v16.8h
- rev32 v17.8h, v17.8h
- rev32 v18.8h, v18.8h
- rev32 v19.8h, v19.8h
-
- add v10.4s, v10.4s, v15.4s
- add v11.4s, v11.4s, v16.4s
- add v12.4s, v12.4s, v17.4s
- add v13.4s, v13.4s, v18.4s
- add v14.4s, v14.4s, v19.4s
-
- eor v5.16b, v5.16b, v10.16b
- eor v6.16b, v6.16b, v11.16b
- eor v7.16b, v7.16b, v12.16b
- eor v8.16b, v8.16b, v13.16b
- eor v9.16b, v9.16b, v14.16b
-
- ushr v20.4s, v5.4s, #20
- sli v20.4s, v5.4s, #12
- ushr v5.4s, v6.4s, #20
- sli v5.4s, v6.4s, #12
- ushr v6.4s, v7.4s, #20
- sli v6.4s, v7.4s, #12
- ushr v7.4s, v8.4s, #20
- sli v7.4s, v8.4s, #12
- ushr v8.4s, v9.4s, #20
- sli v8.4s, v9.4s, #12
-
- add v0.4s, v0.4s, v20.4s
- add v1.4s, v1.4s, v5.4s
- add v2.4s, v2.4s, v6.4s
- add v3.4s, v3.4s, v7.4s
- add v4.4s, v4.4s, v8.4s
-
- eor v15.16b, v15.16b, v0.16b
- eor v16.16b, v16.16b, v1.16b
- eor v17.16b, v17.16b, v2.16b
- eor v18.16b, v18.16b, v3.16b
- eor v19.16b, v19.16b, v4.16b
-
- tbl v15.16b, {v15.16b}, v26.16b
- tbl v16.16b, {v16.16b}, v26.16b
- tbl v17.16b, {v17.16b}, v26.16b
- tbl v18.16b, {v18.16b}, v26.16b
- tbl v19.16b, {v19.16b}, v26.16b
-
- add v10.4s, v10.4s, v15.4s
- add v11.4s, v11.4s, v16.4s
- add v12.4s, v12.4s, v17.4s
- add v13.4s, v13.4s, v18.4s
- add v14.4s, v14.4s, v19.4s
-
- eor v20.16b, v20.16b, v10.16b
- eor v5.16b, v5.16b, v11.16b
- eor v6.16b, v6.16b, v12.16b
- eor v7.16b, v7.16b, v13.16b
- eor v8.16b, v8.16b, v14.16b
-
- ushr v9.4s, v8.4s, #25
- sli v9.4s, v8.4s, #7
- ushr v8.4s, v7.4s, #25
- sli v8.4s, v7.4s, #7
- ushr v7.4s, v6.4s, #25
- sli v7.4s, v6.4s, #7
- ushr v6.4s, v5.4s, #25
- sli v6.4s, v5.4s, #7
- ushr v5.4s, v20.4s, #25
- sli v5.4s, v20.4s, #7
-
- ext v9.16b, v9.16b, v9.16b, #4
- ext v14.16b, v14.16b, v14.16b, #8
- ext v19.16b, v19.16b, v19.16b, #12
- add v0.4s, v0.4s, v6.4s
- add v1.4s, v1.4s, v7.4s
- add v2.4s, v2.4s, v8.4s
- add v3.4s, v3.4s, v5.4s
- add v4.4s, v4.4s, v9.4s
-
- eor v18.16b, v18.16b, v0.16b
- eor v15.16b, v15.16b, v1.16b
- eor v16.16b, v16.16b, v2.16b
- eor v17.16b, v17.16b, v3.16b
- eor v19.16b, v19.16b, v4.16b
-
- rev32 v18.8h, v18.8h
- rev32 v15.8h, v15.8h
- rev32 v16.8h, v16.8h
- rev32 v17.8h, v17.8h
- rev32 v19.8h, v19.8h
-
- add v12.4s, v12.4s, v18.4s
- add v13.4s, v13.4s, v15.4s
- add v10.4s, v10.4s, v16.4s
- add v11.4s, v11.4s, v17.4s
- add v14.4s, v14.4s, v19.4s
-
- eor v6.16b, v6.16b, v12.16b
- eor v7.16b, v7.16b, v13.16b
- eor v8.16b, v8.16b, v10.16b
- eor v5.16b, v5.16b, v11.16b
- eor v9.16b, v9.16b, v14.16b
-
- ushr v20.4s, v6.4s, #20
- sli v20.4s, v6.4s, #12
- ushr v6.4s, v7.4s, #20
- sli v6.4s, v7.4s, #12
- ushr v7.4s, v8.4s, #20
- sli v7.4s, v8.4s, #12
- ushr v8.4s, v5.4s, #20
- sli v8.4s, v5.4s, #12
- ushr v5.4s, v9.4s, #20
- sli v5.4s, v9.4s, #12
-
- add v0.4s, v0.4s, v20.4s
- add v1.4s, v1.4s, v6.4s
- add v2.4s, v2.4s, v7.4s
- add v3.4s, v3.4s, v8.4s
- add v4.4s, v4.4s, v5.4s
-
- eor v18.16b, v18.16b, v0.16b
- eor v15.16b, v15.16b, v1.16b
- eor v16.16b, v16.16b, v2.16b
- eor v17.16b, v17.16b, v3.16b
- eor v19.16b, v19.16b, v4.16b
-
- tbl v18.16b, {v18.16b}, v26.16b
- tbl v15.16b, {v15.16b}, v26.16b
- tbl v16.16b, {v16.16b}, v26.16b
- tbl v17.16b, {v17.16b}, v26.16b
- tbl v19.16b, {v19.16b}, v26.16b
-
- add v12.4s, v12.4s, v18.4s
- add v13.4s, v13.4s, v15.4s
- add v10.4s, v10.4s, v16.4s
- add v11.4s, v11.4s, v17.4s
- add v14.4s, v14.4s, v19.4s
-
- eor v20.16b, v20.16b, v12.16b
- eor v6.16b, v6.16b, v13.16b
- eor v7.16b, v7.16b, v10.16b
- eor v8.16b, v8.16b, v11.16b
- eor v5.16b, v5.16b, v14.16b
-
- ushr v9.4s, v5.4s, #25
- sli v9.4s, v5.4s, #7
- ushr v5.4s, v8.4s, #25
- sli v5.4s, v8.4s, #7
- ushr v8.4s, v7.4s, #25
- sli v8.4s, v7.4s, #7
- ushr v7.4s, v6.4s, #25
- sli v7.4s, v6.4s, #7
- ushr v6.4s, v20.4s, #25
- sli v6.4s, v20.4s, #7
-
- ext v9.16b, v9.16b, v9.16b, #12
- ext v14.16b, v14.16b, v14.16b, #8
- ext v19.16b, v19.16b, v19.16b, #4
- subs x6, x6, #1
- b.hi Lseal_init_rounds
-
- add v15.4s, v15.4s, v25.4s
- mov x11, #4
- dup v20.4s, w11
- add v25.4s, v25.4s, v20.4s
-
- zip1 v20.4s, v0.4s, v1.4s
- zip2 v21.4s, v0.4s, v1.4s
- zip1 v22.4s, v2.4s, v3.4s
- zip2 v23.4s, v2.4s, v3.4s
-
- zip1 v0.2d, v20.2d, v22.2d
- zip2 v1.2d, v20.2d, v22.2d
- zip1 v2.2d, v21.2d, v23.2d
- zip2 v3.2d, v21.2d, v23.2d
-
- zip1 v20.4s, v5.4s, v6.4s
- zip2 v21.4s, v5.4s, v6.4s
- zip1 v22.4s, v7.4s, v8.4s
- zip2 v23.4s, v7.4s, v8.4s
-
- zip1 v5.2d, v20.2d, v22.2d
- zip2 v6.2d, v20.2d, v22.2d
- zip1 v7.2d, v21.2d, v23.2d
- zip2 v8.2d, v21.2d, v23.2d
-
- zip1 v20.4s, v10.4s, v11.4s
- zip2 v21.4s, v10.4s, v11.4s
- zip1 v22.4s, v12.4s, v13.4s
- zip2 v23.4s, v12.4s, v13.4s
-
- zip1 v10.2d, v20.2d, v22.2d
- zip2 v11.2d, v20.2d, v22.2d
- zip1 v12.2d, v21.2d, v23.2d
- zip2 v13.2d, v21.2d, v23.2d
-
- zip1 v20.4s, v15.4s, v16.4s
- zip2 v21.4s, v15.4s, v16.4s
- zip1 v22.4s, v17.4s, v18.4s
- zip2 v23.4s, v17.4s, v18.4s
-
- zip1 v15.2d, v20.2d, v22.2d
- zip2 v16.2d, v20.2d, v22.2d
- zip1 v17.2d, v21.2d, v23.2d
- zip2 v18.2d, v21.2d, v23.2d
-
- add v4.4s, v4.4s, v24.4s
- add v9.4s, v9.4s, v28.4s
- and v4.16b, v4.16b, v27.16b
-
- add v0.4s, v0.4s, v24.4s
- add v5.4s, v5.4s, v28.4s
- add v10.4s, v10.4s, v29.4s
- add v15.4s, v15.4s, v30.4s
-
- add v1.4s, v1.4s, v24.4s
- add v6.4s, v6.4s, v28.4s
- add v11.4s, v11.4s, v29.4s
- add v16.4s, v16.4s, v30.4s
-
- add v2.4s, v2.4s, v24.4s
- add v7.4s, v7.4s, v28.4s
- add v12.4s, v12.4s, v29.4s
- add v17.4s, v17.4s, v30.4s
-
- add v3.4s, v3.4s, v24.4s
- add v8.4s, v8.4s, v28.4s
- add v13.4s, v13.4s, v29.4s
- add v18.4s, v18.4s, v30.4s
-
- mov x16, v4.d[0] // Move the R key to GPRs
- mov x17, v4.d[1]
- mov v27.16b, v9.16b // Store the S key
-
- bl Lpoly_hash_ad_internal
-
- mov x3, x0
- cmp x2, #256
- b.le Lseal_tail
-
- ld1 {v20.16b - v23.16b}, [x1], #64
- eor v20.16b, v20.16b, v0.16b
- eor v21.16b, v21.16b, v5.16b
- eor v22.16b, v22.16b, v10.16b
- eor v23.16b, v23.16b, v15.16b
- st1 {v20.16b - v23.16b}, [x0], #64
-
- ld1 {v20.16b - v23.16b}, [x1], #64
- eor v20.16b, v20.16b, v1.16b
- eor v21.16b, v21.16b, v6.16b
- eor v22.16b, v22.16b, v11.16b
- eor v23.16b, v23.16b, v16.16b
- st1 {v20.16b - v23.16b}, [x0], #64
-
- ld1 {v20.16b - v23.16b}, [x1], #64
- eor v20.16b, v20.16b, v2.16b
- eor v21.16b, v21.16b, v7.16b
- eor v22.16b, v22.16b, v12.16b
- eor v23.16b, v23.16b, v17.16b
- st1 {v20.16b - v23.16b}, [x0], #64
-
- ld1 {v20.16b - v23.16b}, [x1], #64
- eor v20.16b, v20.16b, v3.16b
- eor v21.16b, v21.16b, v8.16b
- eor v22.16b, v22.16b, v13.16b
- eor v23.16b, v23.16b, v18.16b
- st1 {v20.16b - v23.16b}, [x0], #64
-
- sub x2, x2, #256
-
- mov x6, #4 // In the first run of the loop we need to hash 256 bytes, therefore we hash one block for the first 4 rounds
- mov x7, #6 // and two blocks for the remaining 6, for a total of (1 * 4 + 2 * 6) * 16 = 256
-
-Lseal_main_loop:
- adrp x11, Lchacha20_consts
- add x11, x11, :lo12:Lchacha20_consts
-
- ld4r {v0.4s,v1.4s,v2.4s,v3.4s}, [x11]
- mov v4.16b, v24.16b
-
- ld4r {v5.4s,v6.4s,v7.4s,v8.4s}, [x5], #16
- mov v9.16b, v28.16b
-
- ld4r {v10.4s,v11.4s,v12.4s,v13.4s}, [x5], #16
- mov v14.16b, v29.16b
-
- ld4r {v15.4s,v16.4s,v17.4s,v18.4s}, [x5]
- add v15.4s, v15.4s, v25.4s
- mov v19.16b, v30.16b
-
- eor v20.16b, v20.16b, v20.16b //zero
- not v21.16b, v20.16b // -1
- sub v21.4s, v25.4s, v21.4s // Add +1
- ext v20.16b, v21.16b, v20.16b, #12 // Get the last element (counter)
- add v19.4s, v19.4s, v20.4s
-
- sub x5, x5, #32
-.align 5
-Lseal_main_loop_rounds:
- add v0.4s, v0.4s, v5.4s
- add v1.4s, v1.4s, v6.4s
- add v2.4s, v2.4s, v7.4s
- add v3.4s, v3.4s, v8.4s
- add v4.4s, v4.4s, v9.4s
-
- eor v15.16b, v15.16b, v0.16b
- eor v16.16b, v16.16b, v1.16b
- eor v17.16b, v17.16b, v2.16b
- eor v18.16b, v18.16b, v3.16b
- eor v19.16b, v19.16b, v4.16b
-
- rev32 v15.8h, v15.8h
- rev32 v16.8h, v16.8h
- rev32 v17.8h, v17.8h
- rev32 v18.8h, v18.8h
- rev32 v19.8h, v19.8h
-
- add v10.4s, v10.4s, v15.4s
- add v11.4s, v11.4s, v16.4s
- add v12.4s, v12.4s, v17.4s
- add v13.4s, v13.4s, v18.4s
- add v14.4s, v14.4s, v19.4s
-
- eor v5.16b, v5.16b, v10.16b
- eor v6.16b, v6.16b, v11.16b
- eor v7.16b, v7.16b, v12.16b
- eor v8.16b, v8.16b, v13.16b
- eor v9.16b, v9.16b, v14.16b
-
- ushr v20.4s, v5.4s, #20
- sli v20.4s, v5.4s, #12
- ushr v5.4s, v6.4s, #20
- sli v5.4s, v6.4s, #12
- ushr v6.4s, v7.4s, #20
- sli v6.4s, v7.4s, #12
- ushr v7.4s, v8.4s, #20
- sli v7.4s, v8.4s, #12
- ushr v8.4s, v9.4s, #20
- sli v8.4s, v9.4s, #12
-
- add v0.4s, v0.4s, v20.4s
- add v1.4s, v1.4s, v5.4s
- add v2.4s, v2.4s, v6.4s
- add v3.4s, v3.4s, v7.4s
- add v4.4s, v4.4s, v8.4s
-
- eor v15.16b, v15.16b, v0.16b
- eor v16.16b, v16.16b, v1.16b
- eor v17.16b, v17.16b, v2.16b
- eor v18.16b, v18.16b, v3.16b
- eor v19.16b, v19.16b, v4.16b
-
- tbl v15.16b, {v15.16b}, v26.16b
- tbl v16.16b, {v16.16b}, v26.16b
- tbl v17.16b, {v17.16b}, v26.16b
- tbl v18.16b, {v18.16b}, v26.16b
- tbl v19.16b, {v19.16b}, v26.16b
-
- add v10.4s, v10.4s, v15.4s
- add v11.4s, v11.4s, v16.4s
- add v12.4s, v12.4s, v17.4s
- add v13.4s, v13.4s, v18.4s
- add v14.4s, v14.4s, v19.4s
-
- eor v20.16b, v20.16b, v10.16b
- eor v5.16b, v5.16b, v11.16b
- eor v6.16b, v6.16b, v12.16b
- eor v7.16b, v7.16b, v13.16b
- eor v8.16b, v8.16b, v14.16b
-
- ushr v9.4s, v8.4s, #25
- sli v9.4s, v8.4s, #7
- ushr v8.4s, v7.4s, #25
- sli v8.4s, v7.4s, #7
- ushr v7.4s, v6.4s, #25
- sli v7.4s, v6.4s, #7
- ushr v6.4s, v5.4s, #25
- sli v6.4s, v5.4s, #7
- ushr v5.4s, v20.4s, #25
- sli v5.4s, v20.4s, #7
-
- ext v9.16b, v9.16b, v9.16b, #4
- ext v14.16b, v14.16b, v14.16b, #8
- ext v19.16b, v19.16b, v19.16b, #12
- ldp x11, x12, [x3], 16
- adds x8, x8, x11
- adcs x9, x9, x12
- adc x10, x10, x15
- mul x11, x8, x16 // [t2:t1:t0] = [acc2:acc1:acc0] * r0
- umulh x12, x8, x16
- mul x13, x9, x16
- umulh x14, x9, x16
- adds x12, x12, x13
- mul x13, x10, x16
- adc x13, x13, x14
- mul x14, x8, x17 // [t3:t2:t1:t0] = [acc2:acc1:acc0] * [r1:r0]
- umulh x8, x8, x17
- adds x12, x12, x14
- mul x14, x9, x17
- umulh x9, x9, x17
- adcs x14, x14, x8
- mul x10, x10, x17
- adc x10, x10, x9
- adds x13, x13, x14
- adc x14, x10, xzr
- and x10, x13, #3 // At this point acc2 is 2 bits at most (value of 3)
- and x8, x13, #-4
- extr x13, x14, x13, #2
- adds x8, x8, x11
- lsr x11, x14, #2
- adc x9, x14, x11 // No carry out since t0 is 61 bits and t3 is 63 bits
- adds x8, x8, x13
- adcs x9, x9, x12
- adc x10, x10, xzr // At this point acc2 has the value of 4 at most
- add v0.4s, v0.4s, v6.4s
- add v1.4s, v1.4s, v7.4s
- add v2.4s, v2.4s, v8.4s
- add v3.4s, v3.4s, v5.4s
- add v4.4s, v4.4s, v9.4s
-
- eor v18.16b, v18.16b, v0.16b
- eor v15.16b, v15.16b, v1.16b
- eor v16.16b, v16.16b, v2.16b
- eor v17.16b, v17.16b, v3.16b
- eor v19.16b, v19.16b, v4.16b
-
- rev32 v18.8h, v18.8h
- rev32 v15.8h, v15.8h
- rev32 v16.8h, v16.8h
- rev32 v17.8h, v17.8h
- rev32 v19.8h, v19.8h
-
- add v12.4s, v12.4s, v18.4s
- add v13.4s, v13.4s, v15.4s
- add v10.4s, v10.4s, v16.4s
- add v11.4s, v11.4s, v17.4s
- add v14.4s, v14.4s, v19.4s
-
- eor v6.16b, v6.16b, v12.16b
- eor v7.16b, v7.16b, v13.16b
- eor v8.16b, v8.16b, v10.16b
- eor v5.16b, v5.16b, v11.16b
- eor v9.16b, v9.16b, v14.16b
-
- ushr v20.4s, v6.4s, #20
- sli v20.4s, v6.4s, #12
- ushr v6.4s, v7.4s, #20
- sli v6.4s, v7.4s, #12
- ushr v7.4s, v8.4s, #20
- sli v7.4s, v8.4s, #12
- ushr v8.4s, v5.4s, #20
- sli v8.4s, v5.4s, #12
- ushr v5.4s, v9.4s, #20
- sli v5.4s, v9.4s, #12
-
- add v0.4s, v0.4s, v20.4s
- add v1.4s, v1.4s, v6.4s
- add v2.4s, v2.4s, v7.4s
- add v3.4s, v3.4s, v8.4s
- add v4.4s, v4.4s, v5.4s
-
- eor v18.16b, v18.16b, v0.16b
- eor v15.16b, v15.16b, v1.16b
- eor v16.16b, v16.16b, v2.16b
- eor v17.16b, v17.16b, v3.16b
- eor v19.16b, v19.16b, v4.16b
-
- tbl v18.16b, {v18.16b}, v26.16b
- tbl v15.16b, {v15.16b}, v26.16b
- tbl v16.16b, {v16.16b}, v26.16b
- tbl v17.16b, {v17.16b}, v26.16b
- tbl v19.16b, {v19.16b}, v26.16b
-
- add v12.4s, v12.4s, v18.4s
- add v13.4s, v13.4s, v15.4s
- add v10.4s, v10.4s, v16.4s
- add v11.4s, v11.4s, v17.4s
- add v14.4s, v14.4s, v19.4s
-
- eor v20.16b, v20.16b, v12.16b
- eor v6.16b, v6.16b, v13.16b
- eor v7.16b, v7.16b, v10.16b
- eor v8.16b, v8.16b, v11.16b
- eor v5.16b, v5.16b, v14.16b
-
- ushr v9.4s, v5.4s, #25
- sli v9.4s, v5.4s, #7
- ushr v5.4s, v8.4s, #25
- sli v5.4s, v8.4s, #7
- ushr v8.4s, v7.4s, #25
- sli v8.4s, v7.4s, #7
- ushr v7.4s, v6.4s, #25
- sli v7.4s, v6.4s, #7
- ushr v6.4s, v20.4s, #25
- sli v6.4s, v20.4s, #7
-
- ext v9.16b, v9.16b, v9.16b, #12
- ext v14.16b, v14.16b, v14.16b, #8
- ext v19.16b, v19.16b, v19.16b, #4
- subs x6, x6, #1
- b.ge Lseal_main_loop_rounds
- ldp x11, x12, [x3], 16
- adds x8, x8, x11
- adcs x9, x9, x12
- adc x10, x10, x15
- mul x11, x8, x16 // [t2:t1:t0] = [acc2:acc1:acc0] * r0
- umulh x12, x8, x16
- mul x13, x9, x16
- umulh x14, x9, x16
- adds x12, x12, x13
- mul x13, x10, x16
- adc x13, x13, x14
- mul x14, x8, x17 // [t3:t2:t1:t0] = [acc2:acc1:acc0] * [r1:r0]
- umulh x8, x8, x17
- adds x12, x12, x14
- mul x14, x9, x17
- umulh x9, x9, x17
- adcs x14, x14, x8
- mul x10, x10, x17
- adc x10, x10, x9
- adds x13, x13, x14
- adc x14, x10, xzr
- and x10, x13, #3 // At this point acc2 is 2 bits at most (value of 3)
- and x8, x13, #-4
- extr x13, x14, x13, #2
- adds x8, x8, x11
- lsr x11, x14, #2
- adc x9, x14, x11 // No carry out since t0 is 61 bits and t3 is 63 bits
- adds x8, x8, x13
- adcs x9, x9, x12
- adc x10, x10, xzr // At this point acc2 has the value of 4 at most
- subs x7, x7, #1
- b.gt Lseal_main_loop_rounds
-
- eor v20.16b, v20.16b, v20.16b //zero
- not v21.16b, v20.16b // -1
- sub v21.4s, v25.4s, v21.4s // Add +1
- ext v20.16b, v21.16b, v20.16b, #12 // Get the last element (counter)
- add v19.4s, v19.4s, v20.4s
-
- add v15.4s, v15.4s, v25.4s
- mov x11, #5
- dup v20.4s, w11
- add v25.4s, v25.4s, v20.4s
-
- zip1 v20.4s, v0.4s, v1.4s
- zip2 v21.4s, v0.4s, v1.4s
- zip1 v22.4s, v2.4s, v3.4s
- zip2 v23.4s, v2.4s, v3.4s
-
- zip1 v0.2d, v20.2d, v22.2d
- zip2 v1.2d, v20.2d, v22.2d
- zip1 v2.2d, v21.2d, v23.2d
- zip2 v3.2d, v21.2d, v23.2d
-
- zip1 v20.4s, v5.4s, v6.4s
- zip2 v21.4s, v5.4s, v6.4s
- zip1 v22.4s, v7.4s, v8.4s
- zip2 v23.4s, v7.4s, v8.4s
-
- zip1 v5.2d, v20.2d, v22.2d
- zip2 v6.2d, v20.2d, v22.2d
- zip1 v7.2d, v21.2d, v23.2d
- zip2 v8.2d, v21.2d, v23.2d
-
- zip1 v20.4s, v10.4s, v11.4s
- zip2 v21.4s, v10.4s, v11.4s
- zip1 v22.4s, v12.4s, v13.4s
- zip2 v23.4s, v12.4s, v13.4s
-
- zip1 v10.2d, v20.2d, v22.2d
- zip2 v11.2d, v20.2d, v22.2d
- zip1 v12.2d, v21.2d, v23.2d
- zip2 v13.2d, v21.2d, v23.2d
-
- zip1 v20.4s, v15.4s, v16.4s
- zip2 v21.4s, v15.4s, v16.4s
- zip1 v22.4s, v17.4s, v18.4s
- zip2 v23.4s, v17.4s, v18.4s
-
- zip1 v15.2d, v20.2d, v22.2d
- zip2 v16.2d, v20.2d, v22.2d
- zip1 v17.2d, v21.2d, v23.2d
- zip2 v18.2d, v21.2d, v23.2d
-
- add v0.4s, v0.4s, v24.4s
- add v5.4s, v5.4s, v28.4s
- add v10.4s, v10.4s, v29.4s
- add v15.4s, v15.4s, v30.4s
-
- add v1.4s, v1.4s, v24.4s
- add v6.4s, v6.4s, v28.4s
- add v11.4s, v11.4s, v29.4s
- add v16.4s, v16.4s, v30.4s
-
- add v2.4s, v2.4s, v24.4s
- add v7.4s, v7.4s, v28.4s
- add v12.4s, v12.4s, v29.4s
- add v17.4s, v17.4s, v30.4s
-
- add v3.4s, v3.4s, v24.4s
- add v8.4s, v8.4s, v28.4s
- add v13.4s, v13.4s, v29.4s
- add v18.4s, v18.4s, v30.4s
-
- add v4.4s, v4.4s, v24.4s
- add v9.4s, v9.4s, v28.4s
- add v14.4s, v14.4s, v29.4s
- add v19.4s, v19.4s, v30.4s
-
- cmp x2, #320
- b.le Lseal_tail
-
- ld1 {v20.16b - v23.16b}, [x1], #64
- eor v20.16b, v20.16b, v0.16b
- eor v21.16b, v21.16b, v5.16b
- eor v22.16b, v22.16b, v10.16b
- eor v23.16b, v23.16b, v15.16b
- st1 {v20.16b - v23.16b}, [x0], #64
-
- ld1 {v20.16b - v23.16b}, [x1], #64
- eor v20.16b, v20.16b, v1.16b
- eor v21.16b, v21.16b, v6.16b
- eor v22.16b, v22.16b, v11.16b
- eor v23.16b, v23.16b, v16.16b
- st1 {v20.16b - v23.16b}, [x0], #64
-
- ld1 {v20.16b - v23.16b}, [x1], #64
- eor v20.16b, v20.16b, v2.16b
- eor v21.16b, v21.16b, v7.16b
- eor v22.16b, v22.16b, v12.16b
- eor v23.16b, v23.16b, v17.16b
- st1 {v20.16b - v23.16b}, [x0], #64
-
- ld1 {v20.16b - v23.16b}, [x1], #64
- eor v20.16b, v20.16b, v3.16b
- eor v21.16b, v21.16b, v8.16b
- eor v22.16b, v22.16b, v13.16b
- eor v23.16b, v23.16b, v18.16b
- st1 {v20.16b - v23.16b}, [x0], #64
-
- ld1 {v20.16b - v23.16b}, [x1], #64
- eor v20.16b, v20.16b, v4.16b
- eor v21.16b, v21.16b, v9.16b
- eor v22.16b, v22.16b, v14.16b
- eor v23.16b, v23.16b, v19.16b
- st1 {v20.16b - v23.16b}, [x0], #64
-
- sub x2, x2, #320
-
- mov x6, #0
- mov x7, #10 // For the remainder of the loop we always hash and encrypt 320 bytes per iteration
-
- b Lseal_main_loop
-
-Lseal_tail:
- // This part of the function handles the storage and authentication of the last [0,320) bytes
- // We assume A0-A4 ... D0-D4 hold at least inl (320 max) bytes of the stream data.
- cmp x2, #64
- b.lt Lseal_tail_64
-
- // Store and authenticate 64B blocks per iteration
- ld1 {v20.16b - v23.16b}, [x1], #64
-
- eor v20.16b, v20.16b, v0.16b
- eor v21.16b, v21.16b, v5.16b
- eor v22.16b, v22.16b, v10.16b
- eor v23.16b, v23.16b, v15.16b
- mov x11, v20.d[0]
- mov x12, v20.d[1]
- adds x8, x8, x11
- adcs x9, x9, x12
- adc x10, x10, x15
- mul x11, x8, x16 // [t2:t1:t0] = [acc2:acc1:acc0] * r0
- umulh x12, x8, x16
- mul x13, x9, x16
- umulh x14, x9, x16
- adds x12, x12, x13
- mul x13, x10, x16
- adc x13, x13, x14
- mul x14, x8, x17 // [t3:t2:t1:t0] = [acc2:acc1:acc0] * [r1:r0]
- umulh x8, x8, x17
- adds x12, x12, x14
- mul x14, x9, x17
- umulh x9, x9, x17
- adcs x14, x14, x8
- mul x10, x10, x17
- adc x10, x10, x9
- adds x13, x13, x14
- adc x14, x10, xzr
- and x10, x13, #3 // At this point acc2 is 2 bits at most (value of 3)
- and x8, x13, #-4
- extr x13, x14, x13, #2
- adds x8, x8, x11
- lsr x11, x14, #2
- adc x9, x14, x11 // No carry out since t0 is 61 bits and t3 is 63 bits
- adds x8, x8, x13
- adcs x9, x9, x12
- adc x10, x10, xzr // At this point acc2 has the value of 4 at most
- mov x11, v21.d[0]
- mov x12, v21.d[1]
- adds x8, x8, x11
- adcs x9, x9, x12
- adc x10, x10, x15
- mul x11, x8, x16 // [t2:t1:t0] = [acc2:acc1:acc0] * r0
- umulh x12, x8, x16
- mul x13, x9, x16
- umulh x14, x9, x16
- adds x12, x12, x13
- mul x13, x10, x16
- adc x13, x13, x14
- mul x14, x8, x17 // [t3:t2:t1:t0] = [acc2:acc1:acc0] * [r1:r0]
- umulh x8, x8, x17
- adds x12, x12, x14
- mul x14, x9, x17
- umulh x9, x9, x17
- adcs x14, x14, x8
- mul x10, x10, x17
- adc x10, x10, x9
- adds x13, x13, x14
- adc x14, x10, xzr
- and x10, x13, #3 // At this point acc2 is 2 bits at most (value of 3)
- and x8, x13, #-4
- extr x13, x14, x13, #2
- adds x8, x8, x11
- lsr x11, x14, #2
- adc x9, x14, x11 // No carry out since t0 is 61 bits and t3 is 63 bits
- adds x8, x8, x13
- adcs x9, x9, x12
- adc x10, x10, xzr // At this point acc2 has the value of 4 at most
- mov x11, v22.d[0]
- mov x12, v22.d[1]
- adds x8, x8, x11
- adcs x9, x9, x12
- adc x10, x10, x15
- mul x11, x8, x16 // [t2:t1:t0] = [acc2:acc1:acc0] * r0
- umulh x12, x8, x16
- mul x13, x9, x16
- umulh x14, x9, x16
- adds x12, x12, x13
- mul x13, x10, x16
- adc x13, x13, x14
- mul x14, x8, x17 // [t3:t2:t1:t0] = [acc2:acc1:acc0] * [r1:r0]
- umulh x8, x8, x17
- adds x12, x12, x14
- mul x14, x9, x17
- umulh x9, x9, x17
- adcs x14, x14, x8
- mul x10, x10, x17
- adc x10, x10, x9
- adds x13, x13, x14
- adc x14, x10, xzr
- and x10, x13, #3 // At this point acc2 is 2 bits at most (value of 3)
- and x8, x13, #-4
- extr x13, x14, x13, #2
- adds x8, x8, x11
- lsr x11, x14, #2
- adc x9, x14, x11 // No carry out since t0 is 61 bits and t3 is 63 bits
- adds x8, x8, x13
- adcs x9, x9, x12
- adc x10, x10, xzr // At this point acc2 has the value of 4 at most
- mov x11, v23.d[0]
- mov x12, v23.d[1]
- adds x8, x8, x11
- adcs x9, x9, x12
- adc x10, x10, x15
- mul x11, x8, x16 // [t2:t1:t0] = [acc2:acc1:acc0] * r0
- umulh x12, x8, x16
- mul x13, x9, x16
- umulh x14, x9, x16
- adds x12, x12, x13
- mul x13, x10, x16
- adc x13, x13, x14
- mul x14, x8, x17 // [t3:t2:t1:t0] = [acc2:acc1:acc0] * [r1:r0]
- umulh x8, x8, x17
- adds x12, x12, x14
- mul x14, x9, x17
- umulh x9, x9, x17
- adcs x14, x14, x8
- mul x10, x10, x17
- adc x10, x10, x9
- adds x13, x13, x14
- adc x14, x10, xzr
- and x10, x13, #3 // At this point acc2 is 2 bits at most (value of 3)
- and x8, x13, #-4
- extr x13, x14, x13, #2
- adds x8, x8, x11
- lsr x11, x14, #2
- adc x9, x14, x11 // No carry out since t0 is 61 bits and t3 is 63 bits
- adds x8, x8, x13
- adcs x9, x9, x12
- adc x10, x10, xzr // At this point acc2 has the value of 4 at most
- st1 {v20.16b - v23.16b}, [x0], #64
- sub x2, x2, #64
-
- // Shift the state left by 64 bytes for the next iteration of the loop
- mov v0.16b, v1.16b
- mov v5.16b, v6.16b
- mov v10.16b, v11.16b
- mov v15.16b, v16.16b
-
- mov v1.16b, v2.16b
- mov v6.16b, v7.16b
- mov v11.16b, v12.16b
- mov v16.16b, v17.16b
-
- mov v2.16b, v3.16b
- mov v7.16b, v8.16b
- mov v12.16b, v13.16b
- mov v17.16b, v18.16b
-
- mov v3.16b, v4.16b
- mov v8.16b, v9.16b
- mov v13.16b, v14.16b
- mov v18.16b, v19.16b
-
- b Lseal_tail
-
-Lseal_tail_64:
- ldp x3, x4, [x5, #48] // extra_in_len and extra_in_ptr
-
- // Here we handle the last [0,64) bytes of plaintext
- cmp x2, #16
- b.lt Lseal_tail_16
- // Each iteration encrypt and authenticate a 16B block
- ld1 {v20.16b}, [x1], #16
- eor v20.16b, v20.16b, v0.16b
- mov x11, v20.d[0]
- mov x12, v20.d[1]
- adds x8, x8, x11
- adcs x9, x9, x12
- adc x10, x10, x15
- mul x11, x8, x16 // [t2:t1:t0] = [acc2:acc1:acc0] * r0
- umulh x12, x8, x16
- mul x13, x9, x16
- umulh x14, x9, x16
- adds x12, x12, x13
- mul x13, x10, x16
- adc x13, x13, x14
- mul x14, x8, x17 // [t3:t2:t1:t0] = [acc2:acc1:acc0] * [r1:r0]
- umulh x8, x8, x17
- adds x12, x12, x14
- mul x14, x9, x17
- umulh x9, x9, x17
- adcs x14, x14, x8
- mul x10, x10, x17
- adc x10, x10, x9
- adds x13, x13, x14
- adc x14, x10, xzr
- and x10, x13, #3 // At this point acc2 is 2 bits at most (value of 3)
- and x8, x13, #-4
- extr x13, x14, x13, #2
- adds x8, x8, x11
- lsr x11, x14, #2
- adc x9, x14, x11 // No carry out since t0 is 61 bits and t3 is 63 bits
- adds x8, x8, x13
- adcs x9, x9, x12
- adc x10, x10, xzr // At this point acc2 has the value of 4 at most
- st1 {v20.16b}, [x0], #16
-
- sub x2, x2, #16
-
- // Shift the state left by 16 bytes for the next iteration of the loop
- mov v0.16b, v5.16b
- mov v5.16b, v10.16b
- mov v10.16b, v15.16b
-
- b Lseal_tail_64
-
-Lseal_tail_16:
- // Here we handle the last [0,16) bytes of ciphertext that require a padded block
- cbz x2, Lseal_hash_extra
-
- eor v20.16b, v20.16b, v20.16b // Use T0 to load the plaintext/extra in
- eor v21.16b, v21.16b, v21.16b // Use T1 to generate an AND mask that will only mask the ciphertext bytes
- not v22.16b, v20.16b
-
- mov x6, x2
- add x1, x1, x2
-
- cbz x4, Lseal_tail_16_compose // No extra data to pad with, zero padding
-
- mov x7, #16 // We need to load some extra_in first for padding
- sub x7, x7, x2
- cmp x4, x7
- csel x7, x4, x7, lt // Load the minimum of extra_in_len and the amount needed to fill the register
- mov x12, x7
- add x3, x3, x7
- sub x4, x4, x7
-
-Lseal_tail16_compose_extra_in:
- ext v20.16b, v20.16b, v20.16b, #15
- ldrb w11, [x3, #-1]!
- mov v20.b[0], w11
- subs x7, x7, #1
- b.gt Lseal_tail16_compose_extra_in
-
- add x3, x3, x12
-
-Lseal_tail_16_compose:
- ext v20.16b, v20.16b, v20.16b, #15
- ldrb w11, [x1, #-1]!
- mov v20.b[0], w11
- ext v21.16b, v22.16b, v21.16b, #15
- subs x2, x2, #1
- b.gt Lseal_tail_16_compose
-
- and v0.16b, v0.16b, v21.16b
- eor v20.16b, v20.16b, v0.16b
- mov v21.16b, v20.16b
-
-Lseal_tail_16_store:
- umov w11, v20.b[0]
- strb w11, [x0], #1
- ext v20.16b, v20.16b, v20.16b, #1
- subs x6, x6, #1
- b.gt Lseal_tail_16_store
-
- // Hash in the final ct block concatenated with extra_in
- mov x11, v21.d[0]
- mov x12, v21.d[1]
- adds x8, x8, x11
- adcs x9, x9, x12
- adc x10, x10, x15
- mul x11, x8, x16 // [t2:t1:t0] = [acc2:acc1:acc0] * r0
- umulh x12, x8, x16
- mul x13, x9, x16
- umulh x14, x9, x16
- adds x12, x12, x13
- mul x13, x10, x16
- adc x13, x13, x14
- mul x14, x8, x17 // [t3:t2:t1:t0] = [acc2:acc1:acc0] * [r1:r0]
- umulh x8, x8, x17
- adds x12, x12, x14
- mul x14, x9, x17
- umulh x9, x9, x17
- adcs x14, x14, x8
- mul x10, x10, x17
- adc x10, x10, x9
- adds x13, x13, x14
- adc x14, x10, xzr
- and x10, x13, #3 // At this point acc2 is 2 bits at most (value of 3)
- and x8, x13, #-4
- extr x13, x14, x13, #2
- adds x8, x8, x11
- lsr x11, x14, #2
- adc x9, x14, x11 // No carry out since t0 is 61 bits and t3 is 63 bits
- adds x8, x8, x13
- adcs x9, x9, x12
- adc x10, x10, xzr // At this point acc2 has the value of 4 at most
-
-Lseal_hash_extra:
- cbz x4, Lseal_finalize
-
-Lseal_hash_extra_loop:
- cmp x4, #16
- b.lt Lseal_hash_extra_tail
- ld1 {v20.16b}, [x3], #16
- mov x11, v20.d[0]
- mov x12, v20.d[1]
- adds x8, x8, x11
- adcs x9, x9, x12
- adc x10, x10, x15
- mul x11, x8, x16 // [t2:t1:t0] = [acc2:acc1:acc0] * r0
- umulh x12, x8, x16
- mul x13, x9, x16
- umulh x14, x9, x16
- adds x12, x12, x13
- mul x13, x10, x16
- adc x13, x13, x14
- mul x14, x8, x17 // [t3:t2:t1:t0] = [acc2:acc1:acc0] * [r1:r0]
- umulh x8, x8, x17
- adds x12, x12, x14
- mul x14, x9, x17
- umulh x9, x9, x17
- adcs x14, x14, x8
- mul x10, x10, x17
- adc x10, x10, x9
- adds x13, x13, x14
- adc x14, x10, xzr
- and x10, x13, #3 // At this point acc2 is 2 bits at most (value of 3)
- and x8, x13, #-4
- extr x13, x14, x13, #2
- adds x8, x8, x11
- lsr x11, x14, #2
- adc x9, x14, x11 // No carry out since t0 is 61 bits and t3 is 63 bits
- adds x8, x8, x13
- adcs x9, x9, x12
- adc x10, x10, xzr // At this point acc2 has the value of 4 at most
- sub x4, x4, #16
- b Lseal_hash_extra_loop
-
-Lseal_hash_extra_tail:
- cbz x4, Lseal_finalize
- eor v20.16b, v20.16b, v20.16b // Use T0 to load the remaining extra ciphertext
- add x3, x3, x4
-
-Lseal_hash_extra_load:
- ext v20.16b, v20.16b, v20.16b, #15
- ldrb w11, [x3, #-1]!
- mov v20.b[0], w11
- subs x4, x4, #1
- b.gt Lseal_hash_extra_load
-
- // Hash in the final padded extra_in blcok
- mov x11, v20.d[0]
- mov x12, v20.d[1]
- adds x8, x8, x11
- adcs x9, x9, x12
- adc x10, x10, x15
- mul x11, x8, x16 // [t2:t1:t0] = [acc2:acc1:acc0] * r0
- umulh x12, x8, x16
- mul x13, x9, x16
- umulh x14, x9, x16
- adds x12, x12, x13
- mul x13, x10, x16
- adc x13, x13, x14
- mul x14, x8, x17 // [t3:t2:t1:t0] = [acc2:acc1:acc0] * [r1:r0]
- umulh x8, x8, x17
- adds x12, x12, x14
- mul x14, x9, x17
- umulh x9, x9, x17
- adcs x14, x14, x8
- mul x10, x10, x17
- adc x10, x10, x9
- adds x13, x13, x14
- adc x14, x10, xzr
- and x10, x13, #3 // At this point acc2 is 2 bits at most (value of 3)
- and x8, x13, #-4
- extr x13, x14, x13, #2
- adds x8, x8, x11
- lsr x11, x14, #2
- adc x9, x14, x11 // No carry out since t0 is 61 bits and t3 is 63 bits
- adds x8, x8, x13
- adcs x9, x9, x12
- adc x10, x10, xzr // At this point acc2 has the value of 4 at most
-
-Lseal_finalize:
- mov x11, v31.d[0]
- mov x12, v31.d[1]
- adds x8, x8, x11
- adcs x9, x9, x12
- adc x10, x10, x15
- mul x11, x8, x16 // [t2:t1:t0] = [acc2:acc1:acc0] * r0
- umulh x12, x8, x16
- mul x13, x9, x16
- umulh x14, x9, x16
- adds x12, x12, x13
- mul x13, x10, x16
- adc x13, x13, x14
- mul x14, x8, x17 // [t3:t2:t1:t0] = [acc2:acc1:acc0] * [r1:r0]
- umulh x8, x8, x17
- adds x12, x12, x14
- mul x14, x9, x17
- umulh x9, x9, x17
- adcs x14, x14, x8
- mul x10, x10, x17
- adc x10, x10, x9
- adds x13, x13, x14
- adc x14, x10, xzr
- and x10, x13, #3 // At this point acc2 is 2 bits at most (value of 3)
- and x8, x13, #-4
- extr x13, x14, x13, #2
- adds x8, x8, x11
- lsr x11, x14, #2
- adc x9, x14, x11 // No carry out since t0 is 61 bits and t3 is 63 bits
- adds x8, x8, x13
- adcs x9, x9, x12
- adc x10, x10, xzr // At this point acc2 has the value of 4 at most
- // Final reduction step
- sub x12, xzr, x15
- orr x13, xzr, #3
- subs x11, x8, #-5
- sbcs x12, x9, x12
- sbcs x13, x10, x13
- csel x8, x11, x8, cs
- csel x9, x12, x9, cs
- csel x10, x13, x10, cs
- mov x11, v27.d[0]
- mov x12, v27.d[1]
- adds x8, x8, x11
- adcs x9, x9, x12
- adc x10, x10, x15
-
- stp x8, x9, [x5]
-
- ldp d8, d9, [sp, #16]
- ldp d10, d11, [sp, #32]
- ldp d12, d13, [sp, #48]
- ldp d14, d15, [sp, #64]
-.cfi_restore b15
-.cfi_restore b14
-.cfi_restore b13
-.cfi_restore b12
-.cfi_restore b11
-.cfi_restore b10
-.cfi_restore b9
-.cfi_restore b8
- ldp x29, x30, [sp], 80
-.cfi_restore w29
-.cfi_restore w30
-.cfi_def_cfa_offset 0
- AARCH64_VALIDATE_LINK_REGISTER
- ret
-
-Lseal_128:
- // On some architectures preparing 5 blocks for small buffers is wasteful
- eor v25.16b, v25.16b, v25.16b
- mov x11, #1
- mov v25.s[0], w11
- mov v0.16b, v24.16b
- mov v1.16b, v24.16b
- mov v2.16b, v24.16b
- mov v5.16b, v28.16b
- mov v6.16b, v28.16b
- mov v7.16b, v28.16b
- mov v10.16b, v29.16b
- mov v11.16b, v29.16b
- mov v12.16b, v29.16b
- mov v17.16b, v30.16b
- add v15.4s, v17.4s, v25.4s
- add v16.4s, v15.4s, v25.4s
-
- mov x6, #10
-
-Lseal_128_rounds:
- add v0.4s, v0.4s, v5.4s
- add v1.4s, v1.4s, v6.4s
- add v2.4s, v2.4s, v7.4s
- eor v15.16b, v15.16b, v0.16b
- eor v16.16b, v16.16b, v1.16b
- eor v17.16b, v17.16b, v2.16b
- rev32 v15.8h, v15.8h
- rev32 v16.8h, v16.8h
- rev32 v17.8h, v17.8h
-
- add v10.4s, v10.4s, v15.4s
- add v11.4s, v11.4s, v16.4s
- add v12.4s, v12.4s, v17.4s
- eor v5.16b, v5.16b, v10.16b
- eor v6.16b, v6.16b, v11.16b
- eor v7.16b, v7.16b, v12.16b
- ushr v20.4s, v5.4s, #20
- sli v20.4s, v5.4s, #12
- ushr v5.4s, v6.4s, #20
- sli v5.4s, v6.4s, #12
- ushr v6.4s, v7.4s, #20
- sli v6.4s, v7.4s, #12
-
- add v0.4s, v0.4s, v20.4s
- add v1.4s, v1.4s, v5.4s
- add v2.4s, v2.4s, v6.4s
- eor v15.16b, v15.16b, v0.16b
- eor v16.16b, v16.16b, v1.16b
- eor v17.16b, v17.16b, v2.16b
- tbl v15.16b, {v15.16b}, v26.16b
- tbl v16.16b, {v16.16b}, v26.16b
- tbl v17.16b, {v17.16b}, v26.16b
-
- add v10.4s, v10.4s, v15.4s
- add v11.4s, v11.4s, v16.4s
- add v12.4s, v12.4s, v17.4s
- eor v20.16b, v20.16b, v10.16b
- eor v5.16b, v5.16b, v11.16b
- eor v6.16b, v6.16b, v12.16b
- ushr v7.4s, v6.4s, #25
- sli v7.4s, v6.4s, #7
- ushr v6.4s, v5.4s, #25
- sli v6.4s, v5.4s, #7
- ushr v5.4s, v20.4s, #25
- sli v5.4s, v20.4s, #7
-
- ext v5.16b, v5.16b, v5.16b, #4
- ext v6.16b, v6.16b, v6.16b, #4
- ext v7.16b, v7.16b, v7.16b, #4
-
- ext v10.16b, v10.16b, v10.16b, #8
- ext v11.16b, v11.16b, v11.16b, #8
- ext v12.16b, v12.16b, v12.16b, #8
-
- ext v15.16b, v15.16b, v15.16b, #12
- ext v16.16b, v16.16b, v16.16b, #12
- ext v17.16b, v17.16b, v17.16b, #12
- add v0.4s, v0.4s, v5.4s
- add v1.4s, v1.4s, v6.4s
- add v2.4s, v2.4s, v7.4s
- eor v15.16b, v15.16b, v0.16b
- eor v16.16b, v16.16b, v1.16b
- eor v17.16b, v17.16b, v2.16b
- rev32 v15.8h, v15.8h
- rev32 v16.8h, v16.8h
- rev32 v17.8h, v17.8h
-
- add v10.4s, v10.4s, v15.4s
- add v11.4s, v11.4s, v16.4s
- add v12.4s, v12.4s, v17.4s
- eor v5.16b, v5.16b, v10.16b
- eor v6.16b, v6.16b, v11.16b
- eor v7.16b, v7.16b, v12.16b
- ushr v20.4s, v5.4s, #20
- sli v20.4s, v5.4s, #12
- ushr v5.4s, v6.4s, #20
- sli v5.4s, v6.4s, #12
- ushr v6.4s, v7.4s, #20
- sli v6.4s, v7.4s, #12
-
- add v0.4s, v0.4s, v20.4s
- add v1.4s, v1.4s, v5.4s
- add v2.4s, v2.4s, v6.4s
- eor v15.16b, v15.16b, v0.16b
- eor v16.16b, v16.16b, v1.16b
- eor v17.16b, v17.16b, v2.16b
- tbl v15.16b, {v15.16b}, v26.16b
- tbl v16.16b, {v16.16b}, v26.16b
- tbl v17.16b, {v17.16b}, v26.16b
-
- add v10.4s, v10.4s, v15.4s
- add v11.4s, v11.4s, v16.4s
- add v12.4s, v12.4s, v17.4s
- eor v20.16b, v20.16b, v10.16b
- eor v5.16b, v5.16b, v11.16b
- eor v6.16b, v6.16b, v12.16b
- ushr v7.4s, v6.4s, #25
- sli v7.4s, v6.4s, #7
- ushr v6.4s, v5.4s, #25
- sli v6.4s, v5.4s, #7
- ushr v5.4s, v20.4s, #25
- sli v5.4s, v20.4s, #7
-
- ext v5.16b, v5.16b, v5.16b, #12
- ext v6.16b, v6.16b, v6.16b, #12
- ext v7.16b, v7.16b, v7.16b, #12
-
- ext v10.16b, v10.16b, v10.16b, #8
- ext v11.16b, v11.16b, v11.16b, #8
- ext v12.16b, v12.16b, v12.16b, #8
-
- ext v15.16b, v15.16b, v15.16b, #4
- ext v16.16b, v16.16b, v16.16b, #4
- ext v17.16b, v17.16b, v17.16b, #4
- subs x6, x6, #1
- b.hi Lseal_128_rounds
-
- add v0.4s, v0.4s, v24.4s
- add v1.4s, v1.4s, v24.4s
- add v2.4s, v2.4s, v24.4s
-
- add v5.4s, v5.4s, v28.4s
- add v6.4s, v6.4s, v28.4s
- add v7.4s, v7.4s, v28.4s
-
- // Only the first 32 bytes of the third block (counter = 0) are needed,
- // so skip updating v12 and v17.
- add v10.4s, v10.4s, v29.4s
- add v11.4s, v11.4s, v29.4s
-
- add v30.4s, v30.4s, v25.4s
- add v15.4s, v15.4s, v30.4s
- add v30.4s, v30.4s, v25.4s
- add v16.4s, v16.4s, v30.4s
-
- and v2.16b, v2.16b, v27.16b
- mov x16, v2.d[0] // Move the R key to GPRs
- mov x17, v2.d[1]
- mov v27.16b, v7.16b // Store the S key
-
- bl Lpoly_hash_ad_internal
- b Lseal_tail
-.cfi_endproc
-
-
-/////////////////////////////////
-//
-// void chacha20_poly1305_open(uint8_t *pt, uint8_t *ct, size_t len_in, uint8_t *ad, size_t len_ad, union open_data *aead_data);
-//
-.globl chacha20_poly1305_open
-
-.def chacha20_poly1305_open
- .type 32
-.endef
-.align 6
-chacha20_poly1305_open:
- AARCH64_SIGN_LINK_REGISTER
-.cfi_startproc
- stp x29, x30, [sp, #-80]!
-.cfi_def_cfa_offset 80
-.cfi_offset w30, -72
-.cfi_offset w29, -80
- mov x29, sp
- // We probably could do .cfi_def_cfa w29, 80 at this point, but since
- // we don't actually use the frame pointer like that, it's probably not
- // worth bothering.
- stp d8, d9, [sp, #16]
- stp d10, d11, [sp, #32]
- stp d12, d13, [sp, #48]
- stp d14, d15, [sp, #64]
-.cfi_offset b15, -8
-.cfi_offset b14, -16
-.cfi_offset b13, -24
-.cfi_offset b12, -32
-.cfi_offset b11, -40
-.cfi_offset b10, -48
-.cfi_offset b9, -56
-.cfi_offset b8, -64
-
- adrp x11, Lchacha20_consts
- add x11, x11, :lo12:Lchacha20_consts
-
- ld1 {v24.16b - v27.16b}, [x11] // Load the CONSTS, INC, ROL8 and CLAMP values
- ld1 {v28.16b - v30.16b}, [x5]
-
- mov x15, #1 // Prepare the Poly1305 state
- mov x8, #0
- mov x9, #0
- mov x10, #0
-
- mov v31.d[0], x4 // Store the input and aad lengths
- mov v31.d[1], x2
-
- cmp x2, #128
- b.le Lopen_128 // Optimization for smaller buffers
-
- // Initially we prepare a single ChaCha20 block for the Poly1305 R and S keys
- mov v0.16b, v24.16b
- mov v5.16b, v28.16b
- mov v10.16b, v29.16b
- mov v15.16b, v30.16b
-
- mov x6, #10
-
-.align 5
-Lopen_init_rounds:
- add v0.4s, v0.4s, v5.4s
- eor v15.16b, v15.16b, v0.16b
- rev32 v15.8h, v15.8h
-
- add v10.4s, v10.4s, v15.4s
- eor v5.16b, v5.16b, v10.16b
- ushr v20.4s, v5.4s, #20
- sli v20.4s, v5.4s, #12
- add v0.4s, v0.4s, v20.4s
- eor v15.16b, v15.16b, v0.16b
- tbl v15.16b, {v15.16b}, v26.16b
-
- add v10.4s, v10.4s, v15.4s
- eor v20.16b, v20.16b, v10.16b
- ushr v5.4s, v20.4s, #25
- sli v5.4s, v20.4s, #7
- ext v5.16b, v5.16b, v5.16b, #4
- ext v10.16b, v10.16b, v10.16b, #8
- ext v15.16b, v15.16b, v15.16b, #12
- add v0.4s, v0.4s, v5.4s
- eor v15.16b, v15.16b, v0.16b
- rev32 v15.8h, v15.8h
-
- add v10.4s, v10.4s, v15.4s
- eor v5.16b, v5.16b, v10.16b
- ushr v20.4s, v5.4s, #20
- sli v20.4s, v5.4s, #12
- add v0.4s, v0.4s, v20.4s
- eor v15.16b, v15.16b, v0.16b
- tbl v15.16b, {v15.16b}, v26.16b
-
- add v10.4s, v10.4s, v15.4s
- eor v20.16b, v20.16b, v10.16b
- ushr v5.4s, v20.4s, #25
- sli v5.4s, v20.4s, #7
- ext v5.16b, v5.16b, v5.16b, #12
- ext v10.16b, v10.16b, v10.16b, #8
- ext v15.16b, v15.16b, v15.16b, #4
- subs x6, x6, #1
- b.hi Lopen_init_rounds
-
- add v0.4s, v0.4s, v24.4s
- add v5.4s, v5.4s, v28.4s
-
- and v0.16b, v0.16b, v27.16b
- mov x16, v0.d[0] // Move the R key to GPRs
- mov x17, v0.d[1]
- mov v27.16b, v5.16b // Store the S key
-
- bl Lpoly_hash_ad_internal
-
-Lopen_ad_done:
- mov x3, x1
-
-// Each iteration of the loop hash 320 bytes, and prepare stream for 320 bytes
-Lopen_main_loop:
-
- cmp x2, #192
- b.lt Lopen_tail
-
- adrp x11, Lchacha20_consts
- add x11, x11, :lo12:Lchacha20_consts
-
- ld4r {v0.4s,v1.4s,v2.4s,v3.4s}, [x11]
- mov v4.16b, v24.16b
-
- ld4r {v5.4s,v6.4s,v7.4s,v8.4s}, [x5], #16
- mov v9.16b, v28.16b
-
- ld4r {v10.4s,v11.4s,v12.4s,v13.4s}, [x5], #16
- mov v14.16b, v29.16b
-
- ld4r {v15.4s,v16.4s,v17.4s,v18.4s}, [x5]
- sub x5, x5, #32
- add v15.4s, v15.4s, v25.4s
- mov v19.16b, v30.16b
-
- eor v20.16b, v20.16b, v20.16b //zero
- not v21.16b, v20.16b // -1
- sub v21.4s, v25.4s, v21.4s // Add +1
- ext v20.16b, v21.16b, v20.16b, #12 // Get the last element (counter)
- add v19.4s, v19.4s, v20.4s
-
- lsr x4, x2, #4 // How many whole blocks we have to hash, will always be at least 12
- sub x4, x4, #10
-
- mov x7, #10
- subs x6, x7, x4
- subs x6, x7, x4 // itr1 can be negative if we have more than 320 bytes to hash
- csel x7, x7, x4, le // if itr1 is zero or less, itr2 should be 10 to indicate all 10 rounds are full
-
- cbz x7, Lopen_main_loop_rounds_short
-
-.align 5
-Lopen_main_loop_rounds:
- ldp x11, x12, [x3], 16
- adds x8, x8, x11
- adcs x9, x9, x12
- adc x10, x10, x15
- mul x11, x8, x16 // [t2:t1:t0] = [acc2:acc1:acc0] * r0
- umulh x12, x8, x16
- mul x13, x9, x16
- umulh x14, x9, x16
- adds x12, x12, x13
- mul x13, x10, x16
- adc x13, x13, x14
- mul x14, x8, x17 // [t3:t2:t1:t0] = [acc2:acc1:acc0] * [r1:r0]
- umulh x8, x8, x17
- adds x12, x12, x14
- mul x14, x9, x17
- umulh x9, x9, x17
- adcs x14, x14, x8
- mul x10, x10, x17
- adc x10, x10, x9
- adds x13, x13, x14
- adc x14, x10, xzr
- and x10, x13, #3 // At this point acc2 is 2 bits at most (value of 3)
- and x8, x13, #-4
- extr x13, x14, x13, #2
- adds x8, x8, x11
- lsr x11, x14, #2
- adc x9, x14, x11 // No carry out since t0 is 61 bits and t3 is 63 bits
- adds x8, x8, x13
- adcs x9, x9, x12
- adc x10, x10, xzr // At this point acc2 has the value of 4 at most
-Lopen_main_loop_rounds_short:
- add v0.4s, v0.4s, v5.4s
- add v1.4s, v1.4s, v6.4s
- add v2.4s, v2.4s, v7.4s
- add v3.4s, v3.4s, v8.4s
- add v4.4s, v4.4s, v9.4s
-
- eor v15.16b, v15.16b, v0.16b
- eor v16.16b, v16.16b, v1.16b
- eor v17.16b, v17.16b, v2.16b
- eor v18.16b, v18.16b, v3.16b
- eor v19.16b, v19.16b, v4.16b
-
- rev32 v15.8h, v15.8h
- rev32 v16.8h, v16.8h
- rev32 v17.8h, v17.8h
- rev32 v18.8h, v18.8h
- rev32 v19.8h, v19.8h
-
- add v10.4s, v10.4s, v15.4s
- add v11.4s, v11.4s, v16.4s
- add v12.4s, v12.4s, v17.4s
- add v13.4s, v13.4s, v18.4s
- add v14.4s, v14.4s, v19.4s
-
- eor v5.16b, v5.16b, v10.16b
- eor v6.16b, v6.16b, v11.16b
- eor v7.16b, v7.16b, v12.16b
- eor v8.16b, v8.16b, v13.16b
- eor v9.16b, v9.16b, v14.16b
-
- ushr v20.4s, v5.4s, #20
- sli v20.4s, v5.4s, #12
- ushr v5.4s, v6.4s, #20
- sli v5.4s, v6.4s, #12
- ushr v6.4s, v7.4s, #20
- sli v6.4s, v7.4s, #12
- ushr v7.4s, v8.4s, #20
- sli v7.4s, v8.4s, #12
- ushr v8.4s, v9.4s, #20
- sli v8.4s, v9.4s, #12
-
- add v0.4s, v0.4s, v20.4s
- add v1.4s, v1.4s, v5.4s
- add v2.4s, v2.4s, v6.4s
- add v3.4s, v3.4s, v7.4s
- add v4.4s, v4.4s, v8.4s
-
- eor v15.16b, v15.16b, v0.16b
- eor v16.16b, v16.16b, v1.16b
- eor v17.16b, v17.16b, v2.16b
- eor v18.16b, v18.16b, v3.16b
- eor v19.16b, v19.16b, v4.16b
-
- tbl v15.16b, {v15.16b}, v26.16b
- tbl v16.16b, {v16.16b}, v26.16b
- tbl v17.16b, {v17.16b}, v26.16b
- tbl v18.16b, {v18.16b}, v26.16b
- tbl v19.16b, {v19.16b}, v26.16b
-
- add v10.4s, v10.4s, v15.4s
- add v11.4s, v11.4s, v16.4s
- add v12.4s, v12.4s, v17.4s
- add v13.4s, v13.4s, v18.4s
- add v14.4s, v14.4s, v19.4s
-
- eor v20.16b, v20.16b, v10.16b
- eor v5.16b, v5.16b, v11.16b
- eor v6.16b, v6.16b, v12.16b
- eor v7.16b, v7.16b, v13.16b
- eor v8.16b, v8.16b, v14.16b
-
- ushr v9.4s, v8.4s, #25
- sli v9.4s, v8.4s, #7
- ushr v8.4s, v7.4s, #25
- sli v8.4s, v7.4s, #7
- ushr v7.4s, v6.4s, #25
- sli v7.4s, v6.4s, #7
- ushr v6.4s, v5.4s, #25
- sli v6.4s, v5.4s, #7
- ushr v5.4s, v20.4s, #25
- sli v5.4s, v20.4s, #7
-
- ext v9.16b, v9.16b, v9.16b, #4
- ext v14.16b, v14.16b, v14.16b, #8
- ext v19.16b, v19.16b, v19.16b, #12
- ldp x11, x12, [x3], 16
- adds x8, x8, x11
- adcs x9, x9, x12
- adc x10, x10, x15
- mul x11, x8, x16 // [t2:t1:t0] = [acc2:acc1:acc0] * r0
- umulh x12, x8, x16
- mul x13, x9, x16
- umulh x14, x9, x16
- adds x12, x12, x13
- mul x13, x10, x16
- adc x13, x13, x14
- mul x14, x8, x17 // [t3:t2:t1:t0] = [acc2:acc1:acc0] * [r1:r0]
- umulh x8, x8, x17
- adds x12, x12, x14
- mul x14, x9, x17
- umulh x9, x9, x17
- adcs x14, x14, x8
- mul x10, x10, x17
- adc x10, x10, x9
- adds x13, x13, x14
- adc x14, x10, xzr
- and x10, x13, #3 // At this point acc2 is 2 bits at most (value of 3)
- and x8, x13, #-4
- extr x13, x14, x13, #2
- adds x8, x8, x11
- lsr x11, x14, #2
- adc x9, x14, x11 // No carry out since t0 is 61 bits and t3 is 63 bits
- adds x8, x8, x13
- adcs x9, x9, x12
- adc x10, x10, xzr // At this point acc2 has the value of 4 at most
- add v0.4s, v0.4s, v6.4s
- add v1.4s, v1.4s, v7.4s
- add v2.4s, v2.4s, v8.4s
- add v3.4s, v3.4s, v5.4s
- add v4.4s, v4.4s, v9.4s
-
- eor v18.16b, v18.16b, v0.16b
- eor v15.16b, v15.16b, v1.16b
- eor v16.16b, v16.16b, v2.16b
- eor v17.16b, v17.16b, v3.16b
- eor v19.16b, v19.16b, v4.16b
-
- rev32 v18.8h, v18.8h
- rev32 v15.8h, v15.8h
- rev32 v16.8h, v16.8h
- rev32 v17.8h, v17.8h
- rev32 v19.8h, v19.8h
-
- add v12.4s, v12.4s, v18.4s
- add v13.4s, v13.4s, v15.4s
- add v10.4s, v10.4s, v16.4s
- add v11.4s, v11.4s, v17.4s
- add v14.4s, v14.4s, v19.4s
-
- eor v6.16b, v6.16b, v12.16b
- eor v7.16b, v7.16b, v13.16b
- eor v8.16b, v8.16b, v10.16b
- eor v5.16b, v5.16b, v11.16b
- eor v9.16b, v9.16b, v14.16b
-
- ushr v20.4s, v6.4s, #20
- sli v20.4s, v6.4s, #12
- ushr v6.4s, v7.4s, #20
- sli v6.4s, v7.4s, #12
- ushr v7.4s, v8.4s, #20
- sli v7.4s, v8.4s, #12
- ushr v8.4s, v5.4s, #20
- sli v8.4s, v5.4s, #12
- ushr v5.4s, v9.4s, #20
- sli v5.4s, v9.4s, #12
-
- add v0.4s, v0.4s, v20.4s
- add v1.4s, v1.4s, v6.4s
- add v2.4s, v2.4s, v7.4s
- add v3.4s, v3.4s, v8.4s
- add v4.4s, v4.4s, v5.4s
-
- eor v18.16b, v18.16b, v0.16b
- eor v15.16b, v15.16b, v1.16b
- eor v16.16b, v16.16b, v2.16b
- eor v17.16b, v17.16b, v3.16b
- eor v19.16b, v19.16b, v4.16b
-
- tbl v18.16b, {v18.16b}, v26.16b
- tbl v15.16b, {v15.16b}, v26.16b
- tbl v16.16b, {v16.16b}, v26.16b
- tbl v17.16b, {v17.16b}, v26.16b
- tbl v19.16b, {v19.16b}, v26.16b
-
- add v12.4s, v12.4s, v18.4s
- add v13.4s, v13.4s, v15.4s
- add v10.4s, v10.4s, v16.4s
- add v11.4s, v11.4s, v17.4s
- add v14.4s, v14.4s, v19.4s
-
- eor v20.16b, v20.16b, v12.16b
- eor v6.16b, v6.16b, v13.16b
- eor v7.16b, v7.16b, v10.16b
- eor v8.16b, v8.16b, v11.16b
- eor v5.16b, v5.16b, v14.16b
-
- ushr v9.4s, v5.4s, #25
- sli v9.4s, v5.4s, #7
- ushr v5.4s, v8.4s, #25
- sli v5.4s, v8.4s, #7
- ushr v8.4s, v7.4s, #25
- sli v8.4s, v7.4s, #7
- ushr v7.4s, v6.4s, #25
- sli v7.4s, v6.4s, #7
- ushr v6.4s, v20.4s, #25
- sli v6.4s, v20.4s, #7
-
- ext v9.16b, v9.16b, v9.16b, #12
- ext v14.16b, v14.16b, v14.16b, #8
- ext v19.16b, v19.16b, v19.16b, #4
- subs x7, x7, #1
- b.gt Lopen_main_loop_rounds
- subs x6, x6, #1
- b.ge Lopen_main_loop_rounds_short
-
- eor v20.16b, v20.16b, v20.16b //zero
- not v21.16b, v20.16b // -1
- sub v21.4s, v25.4s, v21.4s // Add +1
- ext v20.16b, v21.16b, v20.16b, #12 // Get the last element (counter)
- add v19.4s, v19.4s, v20.4s
-
- add v15.4s, v15.4s, v25.4s
- mov x11, #5
- dup v20.4s, w11
- add v25.4s, v25.4s, v20.4s
-
- zip1 v20.4s, v0.4s, v1.4s
- zip2 v21.4s, v0.4s, v1.4s
- zip1 v22.4s, v2.4s, v3.4s
- zip2 v23.4s, v2.4s, v3.4s
-
- zip1 v0.2d, v20.2d, v22.2d
- zip2 v1.2d, v20.2d, v22.2d
- zip1 v2.2d, v21.2d, v23.2d
- zip2 v3.2d, v21.2d, v23.2d
-
- zip1 v20.4s, v5.4s, v6.4s
- zip2 v21.4s, v5.4s, v6.4s
- zip1 v22.4s, v7.4s, v8.4s
- zip2 v23.4s, v7.4s, v8.4s
-
- zip1 v5.2d, v20.2d, v22.2d
- zip2 v6.2d, v20.2d, v22.2d
- zip1 v7.2d, v21.2d, v23.2d
- zip2 v8.2d, v21.2d, v23.2d
-
- zip1 v20.4s, v10.4s, v11.4s
- zip2 v21.4s, v10.4s, v11.4s
- zip1 v22.4s, v12.4s, v13.4s
- zip2 v23.4s, v12.4s, v13.4s
-
- zip1 v10.2d, v20.2d, v22.2d
- zip2 v11.2d, v20.2d, v22.2d
- zip1 v12.2d, v21.2d, v23.2d
- zip2 v13.2d, v21.2d, v23.2d
-
- zip1 v20.4s, v15.4s, v16.4s
- zip2 v21.4s, v15.4s, v16.4s
- zip1 v22.4s, v17.4s, v18.4s
- zip2 v23.4s, v17.4s, v18.4s
-
- zip1 v15.2d, v20.2d, v22.2d
- zip2 v16.2d, v20.2d, v22.2d
- zip1 v17.2d, v21.2d, v23.2d
- zip2 v18.2d, v21.2d, v23.2d
-
- add v0.4s, v0.4s, v24.4s
- add v5.4s, v5.4s, v28.4s
- add v10.4s, v10.4s, v29.4s
- add v15.4s, v15.4s, v30.4s
-
- add v1.4s, v1.4s, v24.4s
- add v6.4s, v6.4s, v28.4s
- add v11.4s, v11.4s, v29.4s
- add v16.4s, v16.4s, v30.4s
-
- add v2.4s, v2.4s, v24.4s
- add v7.4s, v7.4s, v28.4s
- add v12.4s, v12.4s, v29.4s
- add v17.4s, v17.4s, v30.4s
-
- add v3.4s, v3.4s, v24.4s
- add v8.4s, v8.4s, v28.4s
- add v13.4s, v13.4s, v29.4s
- add v18.4s, v18.4s, v30.4s
-
- add v4.4s, v4.4s, v24.4s
- add v9.4s, v9.4s, v28.4s
- add v14.4s, v14.4s, v29.4s
- add v19.4s, v19.4s, v30.4s
-
- // We can always safely store 192 bytes
- ld1 {v20.16b - v23.16b}, [x1], #64
- eor v20.16b, v20.16b, v0.16b
- eor v21.16b, v21.16b, v5.16b
- eor v22.16b, v22.16b, v10.16b
- eor v23.16b, v23.16b, v15.16b
- st1 {v20.16b - v23.16b}, [x0], #64
-
- ld1 {v20.16b - v23.16b}, [x1], #64
- eor v20.16b, v20.16b, v1.16b
- eor v21.16b, v21.16b, v6.16b
- eor v22.16b, v22.16b, v11.16b
- eor v23.16b, v23.16b, v16.16b
- st1 {v20.16b - v23.16b}, [x0], #64
-
- ld1 {v20.16b - v23.16b}, [x1], #64
- eor v20.16b, v20.16b, v2.16b
- eor v21.16b, v21.16b, v7.16b
- eor v22.16b, v22.16b, v12.16b
- eor v23.16b, v23.16b, v17.16b
- st1 {v20.16b - v23.16b}, [x0], #64
-
- sub x2, x2, #192
-
- mov v0.16b, v3.16b
- mov v5.16b, v8.16b
- mov v10.16b, v13.16b
- mov v15.16b, v18.16b
-
- cmp x2, #64
- b.lt Lopen_tail_64_store
-
- ld1 {v20.16b - v23.16b}, [x1], #64
- eor v20.16b, v20.16b, v3.16b
- eor v21.16b, v21.16b, v8.16b
- eor v22.16b, v22.16b, v13.16b
- eor v23.16b, v23.16b, v18.16b
- st1 {v20.16b - v23.16b}, [x0], #64
-
- sub x2, x2, #64
-
- mov v0.16b, v4.16b
- mov v5.16b, v9.16b
- mov v10.16b, v14.16b
- mov v15.16b, v19.16b
-
- cmp x2, #64
- b.lt Lopen_tail_64_store
-
- ld1 {v20.16b - v23.16b}, [x1], #64
- eor v20.16b, v20.16b, v4.16b
- eor v21.16b, v21.16b, v9.16b
- eor v22.16b, v22.16b, v14.16b
- eor v23.16b, v23.16b, v19.16b
- st1 {v20.16b - v23.16b}, [x0], #64
-
- sub x2, x2, #64
- b Lopen_main_loop
-
-Lopen_tail:
-
- cbz x2, Lopen_finalize
-
- lsr x4, x2, #4 // How many whole blocks we have to hash
-
- cmp x2, #64
- b.le Lopen_tail_64
- cmp x2, #128
- b.le Lopen_tail_128
-
-Lopen_tail_192:
- // We need three more blocks
- mov v0.16b, v24.16b
- mov v1.16b, v24.16b
- mov v2.16b, v24.16b
- mov v5.16b, v28.16b
- mov v6.16b, v28.16b
- mov v7.16b, v28.16b
- mov v10.16b, v29.16b
- mov v11.16b, v29.16b
- mov v12.16b, v29.16b
- mov v15.16b, v30.16b
- mov v16.16b, v30.16b
- mov v17.16b, v30.16b
- eor v23.16b, v23.16b, v23.16b
- eor v21.16b, v21.16b, v21.16b
- ins v23.s[0], v25.s[0]
- ins v21.d[0], x15
-
- add v22.4s, v23.4s, v21.4s
- add v21.4s, v22.4s, v21.4s
-
- add v15.4s, v15.4s, v21.4s
- add v16.4s, v16.4s, v23.4s
- add v17.4s, v17.4s, v22.4s
-
- mov x7, #10
- subs x6, x7, x4 // itr1 can be negative if we have more than 160 bytes to hash
- csel x7, x7, x4, le // if itr1 is zero or less, itr2 should be 10 to indicate all 10 rounds are hashing
- sub x4, x4, x7
-
- cbz x7, Lopen_tail_192_rounds_no_hash
-
-Lopen_tail_192_rounds:
- ldp x11, x12, [x3], 16
- adds x8, x8, x11
- adcs x9, x9, x12
- adc x10, x10, x15
- mul x11, x8, x16 // [t2:t1:t0] = [acc2:acc1:acc0] * r0
- umulh x12, x8, x16
- mul x13, x9, x16
- umulh x14, x9, x16
- adds x12, x12, x13
- mul x13, x10, x16
- adc x13, x13, x14
- mul x14, x8, x17 // [t3:t2:t1:t0] = [acc2:acc1:acc0] * [r1:r0]
- umulh x8, x8, x17
- adds x12, x12, x14
- mul x14, x9, x17
- umulh x9, x9, x17
- adcs x14, x14, x8
- mul x10, x10, x17
- adc x10, x10, x9
- adds x13, x13, x14
- adc x14, x10, xzr
- and x10, x13, #3 // At this point acc2 is 2 bits at most (value of 3)
- and x8, x13, #-4
- extr x13, x14, x13, #2
- adds x8, x8, x11
- lsr x11, x14, #2
- adc x9, x14, x11 // No carry out since t0 is 61 bits and t3 is 63 bits
- adds x8, x8, x13
- adcs x9, x9, x12
- adc x10, x10, xzr // At this point acc2 has the value of 4 at most
-Lopen_tail_192_rounds_no_hash:
- add v0.4s, v0.4s, v5.4s
- add v1.4s, v1.4s, v6.4s
- add v2.4s, v2.4s, v7.4s
- eor v15.16b, v15.16b, v0.16b
- eor v16.16b, v16.16b, v1.16b
- eor v17.16b, v17.16b, v2.16b
- rev32 v15.8h, v15.8h
- rev32 v16.8h, v16.8h
- rev32 v17.8h, v17.8h
-
- add v10.4s, v10.4s, v15.4s
- add v11.4s, v11.4s, v16.4s
- add v12.4s, v12.4s, v17.4s
- eor v5.16b, v5.16b, v10.16b
- eor v6.16b, v6.16b, v11.16b
- eor v7.16b, v7.16b, v12.16b
- ushr v20.4s, v5.4s, #20
- sli v20.4s, v5.4s, #12
- ushr v5.4s, v6.4s, #20
- sli v5.4s, v6.4s, #12
- ushr v6.4s, v7.4s, #20
- sli v6.4s, v7.4s, #12
-
- add v0.4s, v0.4s, v20.4s
- add v1.4s, v1.4s, v5.4s
- add v2.4s, v2.4s, v6.4s
- eor v15.16b, v15.16b, v0.16b
- eor v16.16b, v16.16b, v1.16b
- eor v17.16b, v17.16b, v2.16b
- tbl v15.16b, {v15.16b}, v26.16b
- tbl v16.16b, {v16.16b}, v26.16b
- tbl v17.16b, {v17.16b}, v26.16b
-
- add v10.4s, v10.4s, v15.4s
- add v11.4s, v11.4s, v16.4s
- add v12.4s, v12.4s, v17.4s
- eor v20.16b, v20.16b, v10.16b
- eor v5.16b, v5.16b, v11.16b
- eor v6.16b, v6.16b, v12.16b
- ushr v7.4s, v6.4s, #25
- sli v7.4s, v6.4s, #7
- ushr v6.4s, v5.4s, #25
- sli v6.4s, v5.4s, #7
- ushr v5.4s, v20.4s, #25
- sli v5.4s, v20.4s, #7
-
- ext v5.16b, v5.16b, v5.16b, #4
- ext v6.16b, v6.16b, v6.16b, #4
- ext v7.16b, v7.16b, v7.16b, #4
-
- ext v10.16b, v10.16b, v10.16b, #8
- ext v11.16b, v11.16b, v11.16b, #8
- ext v12.16b, v12.16b, v12.16b, #8
-
- ext v15.16b, v15.16b, v15.16b, #12
- ext v16.16b, v16.16b, v16.16b, #12
- ext v17.16b, v17.16b, v17.16b, #12
- add v0.4s, v0.4s, v5.4s
- add v1.4s, v1.4s, v6.4s
- add v2.4s, v2.4s, v7.4s
- eor v15.16b, v15.16b, v0.16b
- eor v16.16b, v16.16b, v1.16b
- eor v17.16b, v17.16b, v2.16b
- rev32 v15.8h, v15.8h
- rev32 v16.8h, v16.8h
- rev32 v17.8h, v17.8h
-
- add v10.4s, v10.4s, v15.4s
- add v11.4s, v11.4s, v16.4s
- add v12.4s, v12.4s, v17.4s
- eor v5.16b, v5.16b, v10.16b
- eor v6.16b, v6.16b, v11.16b
- eor v7.16b, v7.16b, v12.16b
- ushr v20.4s, v5.4s, #20
- sli v20.4s, v5.4s, #12
- ushr v5.4s, v6.4s, #20
- sli v5.4s, v6.4s, #12
- ushr v6.4s, v7.4s, #20
- sli v6.4s, v7.4s, #12
-
- add v0.4s, v0.4s, v20.4s
- add v1.4s, v1.4s, v5.4s
- add v2.4s, v2.4s, v6.4s
- eor v15.16b, v15.16b, v0.16b
- eor v16.16b, v16.16b, v1.16b
- eor v17.16b, v17.16b, v2.16b
- tbl v15.16b, {v15.16b}, v26.16b
- tbl v16.16b, {v16.16b}, v26.16b
- tbl v17.16b, {v17.16b}, v26.16b
-
- add v10.4s, v10.4s, v15.4s
- add v11.4s, v11.4s, v16.4s
- add v12.4s, v12.4s, v17.4s
- eor v20.16b, v20.16b, v10.16b
- eor v5.16b, v5.16b, v11.16b
- eor v6.16b, v6.16b, v12.16b
- ushr v7.4s, v6.4s, #25
- sli v7.4s, v6.4s, #7
- ushr v6.4s, v5.4s, #25
- sli v6.4s, v5.4s, #7
- ushr v5.4s, v20.4s, #25
- sli v5.4s, v20.4s, #7
-
- ext v5.16b, v5.16b, v5.16b, #12
- ext v6.16b, v6.16b, v6.16b, #12
- ext v7.16b, v7.16b, v7.16b, #12
-
- ext v10.16b, v10.16b, v10.16b, #8
- ext v11.16b, v11.16b, v11.16b, #8
- ext v12.16b, v12.16b, v12.16b, #8
-
- ext v15.16b, v15.16b, v15.16b, #4
- ext v16.16b, v16.16b, v16.16b, #4
- ext v17.16b, v17.16b, v17.16b, #4
- subs x7, x7, #1
- b.gt Lopen_tail_192_rounds
- subs x6, x6, #1
- b.ge Lopen_tail_192_rounds_no_hash
-
- // We hashed 160 bytes at most, may still have 32 bytes left
-Lopen_tail_192_hash:
- cbz x4, Lopen_tail_192_hash_done
- ldp x11, x12, [x3], 16
- adds x8, x8, x11
- adcs x9, x9, x12
- adc x10, x10, x15
- mul x11, x8, x16 // [t2:t1:t0] = [acc2:acc1:acc0] * r0
- umulh x12, x8, x16
- mul x13, x9, x16
- umulh x14, x9, x16
- adds x12, x12, x13
- mul x13, x10, x16
- adc x13, x13, x14
- mul x14, x8, x17 // [t3:t2:t1:t0] = [acc2:acc1:acc0] * [r1:r0]
- umulh x8, x8, x17
- adds x12, x12, x14
- mul x14, x9, x17
- umulh x9, x9, x17
- adcs x14, x14, x8
- mul x10, x10, x17
- adc x10, x10, x9
- adds x13, x13, x14
- adc x14, x10, xzr
- and x10, x13, #3 // At this point acc2 is 2 bits at most (value of 3)
- and x8, x13, #-4
- extr x13, x14, x13, #2
- adds x8, x8, x11
- lsr x11, x14, #2
- adc x9, x14, x11 // No carry out since t0 is 61 bits and t3 is 63 bits
- adds x8, x8, x13
- adcs x9, x9, x12
- adc x10, x10, xzr // At this point acc2 has the value of 4 at most
- sub x4, x4, #1
- b Lopen_tail_192_hash
-
-Lopen_tail_192_hash_done:
-
- add v0.4s, v0.4s, v24.4s
- add v1.4s, v1.4s, v24.4s
- add v2.4s, v2.4s, v24.4s
- add v5.4s, v5.4s, v28.4s
- add v6.4s, v6.4s, v28.4s
- add v7.4s, v7.4s, v28.4s
- add v10.4s, v10.4s, v29.4s
- add v11.4s, v11.4s, v29.4s
- add v12.4s, v12.4s, v29.4s
- add v15.4s, v15.4s, v30.4s
- add v16.4s, v16.4s, v30.4s
- add v17.4s, v17.4s, v30.4s
-
- add v15.4s, v15.4s, v21.4s
- add v16.4s, v16.4s, v23.4s
- add v17.4s, v17.4s, v22.4s
-
- ld1 {v20.16b - v23.16b}, [x1], #64
-
- eor v20.16b, v20.16b, v1.16b
- eor v21.16b, v21.16b, v6.16b
- eor v22.16b, v22.16b, v11.16b
- eor v23.16b, v23.16b, v16.16b
-
- st1 {v20.16b - v23.16b}, [x0], #64
-
- ld1 {v20.16b - v23.16b}, [x1], #64
-
- eor v20.16b, v20.16b, v2.16b
- eor v21.16b, v21.16b, v7.16b
- eor v22.16b, v22.16b, v12.16b
- eor v23.16b, v23.16b, v17.16b
-
- st1 {v20.16b - v23.16b}, [x0], #64
-
- sub x2, x2, #128
- b Lopen_tail_64_store
-
-Lopen_tail_128:
- // We need two more blocks
- mov v0.16b, v24.16b
- mov v1.16b, v24.16b
- mov v5.16b, v28.16b
- mov v6.16b, v28.16b
- mov v10.16b, v29.16b
- mov v11.16b, v29.16b
- mov v15.16b, v30.16b
- mov v16.16b, v30.16b
- eor v23.16b, v23.16b, v23.16b
- eor v22.16b, v22.16b, v22.16b
- ins v23.s[0], v25.s[0]
- ins v22.d[0], x15
- add v22.4s, v22.4s, v23.4s
-
- add v15.4s, v15.4s, v22.4s
- add v16.4s, v16.4s, v23.4s
-
- mov x6, #10
- sub x6, x6, x4
-
-Lopen_tail_128_rounds:
- add v0.4s, v0.4s, v5.4s
- eor v15.16b, v15.16b, v0.16b
- rev32 v15.8h, v15.8h
-
- add v10.4s, v10.4s, v15.4s
- eor v5.16b, v5.16b, v10.16b
- ushr v20.4s, v5.4s, #20
- sli v20.4s, v5.4s, #12
- add v0.4s, v0.4s, v20.4s
- eor v15.16b, v15.16b, v0.16b
- tbl v15.16b, {v15.16b}, v26.16b
-
- add v10.4s, v10.4s, v15.4s
- eor v20.16b, v20.16b, v10.16b
- ushr v5.4s, v20.4s, #25
- sli v5.4s, v20.4s, #7
- ext v5.16b, v5.16b, v5.16b, #4
- ext v10.16b, v10.16b, v10.16b, #8
- ext v15.16b, v15.16b, v15.16b, #12
- add v1.4s, v1.4s, v6.4s
- eor v16.16b, v16.16b, v1.16b
- rev32 v16.8h, v16.8h
-
- add v11.4s, v11.4s, v16.4s
- eor v6.16b, v6.16b, v11.16b
- ushr v20.4s, v6.4s, #20
- sli v20.4s, v6.4s, #12
- add v1.4s, v1.4s, v20.4s
- eor v16.16b, v16.16b, v1.16b
- tbl v16.16b, {v16.16b}, v26.16b
-
- add v11.4s, v11.4s, v16.4s
- eor v20.16b, v20.16b, v11.16b
- ushr v6.4s, v20.4s, #25
- sli v6.4s, v20.4s, #7
- ext v6.16b, v6.16b, v6.16b, #4
- ext v11.16b, v11.16b, v11.16b, #8
- ext v16.16b, v16.16b, v16.16b, #12
- add v0.4s, v0.4s, v5.4s
- eor v15.16b, v15.16b, v0.16b
- rev32 v15.8h, v15.8h
-
- add v10.4s, v10.4s, v15.4s
- eor v5.16b, v5.16b, v10.16b
- ushr v20.4s, v5.4s, #20
- sli v20.4s, v5.4s, #12
- add v0.4s, v0.4s, v20.4s
- eor v15.16b, v15.16b, v0.16b
- tbl v15.16b, {v15.16b}, v26.16b
-
- add v10.4s, v10.4s, v15.4s
- eor v20.16b, v20.16b, v10.16b
- ushr v5.4s, v20.4s, #25
- sli v5.4s, v20.4s, #7
- ext v5.16b, v5.16b, v5.16b, #12
- ext v10.16b, v10.16b, v10.16b, #8
- ext v15.16b, v15.16b, v15.16b, #4
- add v1.4s, v1.4s, v6.4s
- eor v16.16b, v16.16b, v1.16b
- rev32 v16.8h, v16.8h
-
- add v11.4s, v11.4s, v16.4s
- eor v6.16b, v6.16b, v11.16b
- ushr v20.4s, v6.4s, #20
- sli v20.4s, v6.4s, #12
- add v1.4s, v1.4s, v20.4s
- eor v16.16b, v16.16b, v1.16b
- tbl v16.16b, {v16.16b}, v26.16b
-
- add v11.4s, v11.4s, v16.4s
- eor v20.16b, v20.16b, v11.16b
- ushr v6.4s, v20.4s, #25
- sli v6.4s, v20.4s, #7
- ext v6.16b, v6.16b, v6.16b, #12
- ext v11.16b, v11.16b, v11.16b, #8
- ext v16.16b, v16.16b, v16.16b, #4
- subs x6, x6, #1
- b.gt Lopen_tail_128_rounds
- cbz x4, Lopen_tail_128_rounds_done
- subs x4, x4, #1
- ldp x11, x12, [x3], 16
- adds x8, x8, x11
- adcs x9, x9, x12
- adc x10, x10, x15
- mul x11, x8, x16 // [t2:t1:t0] = [acc2:acc1:acc0] * r0
- umulh x12, x8, x16
- mul x13, x9, x16
- umulh x14, x9, x16
- adds x12, x12, x13
- mul x13, x10, x16
- adc x13, x13, x14
- mul x14, x8, x17 // [t3:t2:t1:t0] = [acc2:acc1:acc0] * [r1:r0]
- umulh x8, x8, x17
- adds x12, x12, x14
- mul x14, x9, x17
- umulh x9, x9, x17
- adcs x14, x14, x8
- mul x10, x10, x17
- adc x10, x10, x9
- adds x13, x13, x14
- adc x14, x10, xzr
- and x10, x13, #3 // At this point acc2 is 2 bits at most (value of 3)
- and x8, x13, #-4
- extr x13, x14, x13, #2
- adds x8, x8, x11
- lsr x11, x14, #2
- adc x9, x14, x11 // No carry out since t0 is 61 bits and t3 is 63 bits
- adds x8, x8, x13
- adcs x9, x9, x12
- adc x10, x10, xzr // At this point acc2 has the value of 4 at most
- b Lopen_tail_128_rounds
-
-Lopen_tail_128_rounds_done:
- add v0.4s, v0.4s, v24.4s
- add v1.4s, v1.4s, v24.4s
- add v5.4s, v5.4s, v28.4s
- add v6.4s, v6.4s, v28.4s
- add v10.4s, v10.4s, v29.4s
- add v11.4s, v11.4s, v29.4s
- add v15.4s, v15.4s, v30.4s
- add v16.4s, v16.4s, v30.4s
- add v15.4s, v15.4s, v22.4s
- add v16.4s, v16.4s, v23.4s
-
- ld1 {v20.16b - v23.16b}, [x1], #64
-
- eor v20.16b, v20.16b, v1.16b
- eor v21.16b, v21.16b, v6.16b
- eor v22.16b, v22.16b, v11.16b
- eor v23.16b, v23.16b, v16.16b
-
- st1 {v20.16b - v23.16b}, [x0], #64
- sub x2, x2, #64
-
- b Lopen_tail_64_store
-
-Lopen_tail_64:
- // We just need a single block
- mov v0.16b, v24.16b
- mov v5.16b, v28.16b
- mov v10.16b, v29.16b
- mov v15.16b, v30.16b
- eor v23.16b, v23.16b, v23.16b
- ins v23.s[0], v25.s[0]
- add v15.4s, v15.4s, v23.4s
-
- mov x6, #10
- sub x6, x6, x4
-
-Lopen_tail_64_rounds:
- add v0.4s, v0.4s, v5.4s
- eor v15.16b, v15.16b, v0.16b
- rev32 v15.8h, v15.8h
-
- add v10.4s, v10.4s, v15.4s
- eor v5.16b, v5.16b, v10.16b
- ushr v20.4s, v5.4s, #20
- sli v20.4s, v5.4s, #12
- add v0.4s, v0.4s, v20.4s
- eor v15.16b, v15.16b, v0.16b
- tbl v15.16b, {v15.16b}, v26.16b
-
- add v10.4s, v10.4s, v15.4s
- eor v20.16b, v20.16b, v10.16b
- ushr v5.4s, v20.4s, #25
- sli v5.4s, v20.4s, #7
- ext v5.16b, v5.16b, v5.16b, #4
- ext v10.16b, v10.16b, v10.16b, #8
- ext v15.16b, v15.16b, v15.16b, #12
- add v0.4s, v0.4s, v5.4s
- eor v15.16b, v15.16b, v0.16b
- rev32 v15.8h, v15.8h
-
- add v10.4s, v10.4s, v15.4s
- eor v5.16b, v5.16b, v10.16b
- ushr v20.4s, v5.4s, #20
- sli v20.4s, v5.4s, #12
- add v0.4s, v0.4s, v20.4s
- eor v15.16b, v15.16b, v0.16b
- tbl v15.16b, {v15.16b}, v26.16b
-
- add v10.4s, v10.4s, v15.4s
- eor v20.16b, v20.16b, v10.16b
- ushr v5.4s, v20.4s, #25
- sli v5.4s, v20.4s, #7
- ext v5.16b, v5.16b, v5.16b, #12
- ext v10.16b, v10.16b, v10.16b, #8
- ext v15.16b, v15.16b, v15.16b, #4
- subs x6, x6, #1
- b.gt Lopen_tail_64_rounds
- cbz x4, Lopen_tail_64_rounds_done
- subs x4, x4, #1
- ldp x11, x12, [x3], 16
- adds x8, x8, x11
- adcs x9, x9, x12
- adc x10, x10, x15
- mul x11, x8, x16 // [t2:t1:t0] = [acc2:acc1:acc0] * r0
- umulh x12, x8, x16
- mul x13, x9, x16
- umulh x14, x9, x16
- adds x12, x12, x13
- mul x13, x10, x16
- adc x13, x13, x14
- mul x14, x8, x17 // [t3:t2:t1:t0] = [acc2:acc1:acc0] * [r1:r0]
- umulh x8, x8, x17
- adds x12, x12, x14
- mul x14, x9, x17
- umulh x9, x9, x17
- adcs x14, x14, x8
- mul x10, x10, x17
- adc x10, x10, x9
- adds x13, x13, x14
- adc x14, x10, xzr
- and x10, x13, #3 // At this point acc2 is 2 bits at most (value of 3)
- and x8, x13, #-4
- extr x13, x14, x13, #2
- adds x8, x8, x11
- lsr x11, x14, #2
- adc x9, x14, x11 // No carry out since t0 is 61 bits and t3 is 63 bits
- adds x8, x8, x13
- adcs x9, x9, x12
- adc x10, x10, xzr // At this point acc2 has the value of 4 at most
- b Lopen_tail_64_rounds
-
-Lopen_tail_64_rounds_done:
- add v0.4s, v0.4s, v24.4s
- add v5.4s, v5.4s, v28.4s
- add v10.4s, v10.4s, v29.4s
- add v15.4s, v15.4s, v30.4s
- add v15.4s, v15.4s, v23.4s
-
-Lopen_tail_64_store:
- cmp x2, #16
- b.lt Lopen_tail_16
-
- ld1 {v20.16b}, [x1], #16
- eor v20.16b, v20.16b, v0.16b
- st1 {v20.16b}, [x0], #16
- mov v0.16b, v5.16b
- mov v5.16b, v10.16b
- mov v10.16b, v15.16b
- sub x2, x2, #16
- b Lopen_tail_64_store
-
-Lopen_tail_16:
- // Here we handle the last [0,16) bytes that require a padded block
- cbz x2, Lopen_finalize
-
- eor v20.16b, v20.16b, v20.16b // Use T0 to load the ciphertext
- eor v21.16b, v21.16b, v21.16b // Use T1 to generate an AND mask
- not v22.16b, v20.16b
-
- add x7, x1, x2
- mov x6, x2
-
-Lopen_tail_16_compose:
- ext v20.16b, v20.16b, v20.16b, #15
- ldrb w11, [x7, #-1]!
- mov v20.b[0], w11
- ext v21.16b, v22.16b, v21.16b, #15
- subs x2, x2, #1
- b.gt Lopen_tail_16_compose
-
- and v20.16b, v20.16b, v21.16b
- // Hash in the final padded block
- mov x11, v20.d[0]
- mov x12, v20.d[1]
- adds x8, x8, x11
- adcs x9, x9, x12
- adc x10, x10, x15
- mul x11, x8, x16 // [t2:t1:t0] = [acc2:acc1:acc0] * r0
- umulh x12, x8, x16
- mul x13, x9, x16
- umulh x14, x9, x16
- adds x12, x12, x13
- mul x13, x10, x16
- adc x13, x13, x14
- mul x14, x8, x17 // [t3:t2:t1:t0] = [acc2:acc1:acc0] * [r1:r0]
- umulh x8, x8, x17
- adds x12, x12, x14
- mul x14, x9, x17
- umulh x9, x9, x17
- adcs x14, x14, x8
- mul x10, x10, x17
- adc x10, x10, x9
- adds x13, x13, x14
- adc x14, x10, xzr
- and x10, x13, #3 // At this point acc2 is 2 bits at most (value of 3)
- and x8, x13, #-4
- extr x13, x14, x13, #2
- adds x8, x8, x11
- lsr x11, x14, #2
- adc x9, x14, x11 // No carry out since t0 is 61 bits and t3 is 63 bits
- adds x8, x8, x13
- adcs x9, x9, x12
- adc x10, x10, xzr // At this point acc2 has the value of 4 at most
- eor v20.16b, v20.16b, v0.16b
-
-Lopen_tail_16_store:
- umov w11, v20.b[0]
- strb w11, [x0], #1
- ext v20.16b, v20.16b, v20.16b, #1
- subs x6, x6, #1
- b.gt Lopen_tail_16_store
-
-Lopen_finalize:
- mov x11, v31.d[0]
- mov x12, v31.d[1]
- adds x8, x8, x11
- adcs x9, x9, x12
- adc x10, x10, x15
- mul x11, x8, x16 // [t2:t1:t0] = [acc2:acc1:acc0] * r0
- umulh x12, x8, x16
- mul x13, x9, x16
- umulh x14, x9, x16
- adds x12, x12, x13
- mul x13, x10, x16
- adc x13, x13, x14
- mul x14, x8, x17 // [t3:t2:t1:t0] = [acc2:acc1:acc0] * [r1:r0]
- umulh x8, x8, x17
- adds x12, x12, x14
- mul x14, x9, x17
- umulh x9, x9, x17
- adcs x14, x14, x8
- mul x10, x10, x17
- adc x10, x10, x9
- adds x13, x13, x14
- adc x14, x10, xzr
- and x10, x13, #3 // At this point acc2 is 2 bits at most (value of 3)
- and x8, x13, #-4
- extr x13, x14, x13, #2
- adds x8, x8, x11
- lsr x11, x14, #2
- adc x9, x14, x11 // No carry out since t0 is 61 bits and t3 is 63 bits
- adds x8, x8, x13
- adcs x9, x9, x12
- adc x10, x10, xzr // At this point acc2 has the value of 4 at most
- // Final reduction step
- sub x12, xzr, x15
- orr x13, xzr, #3
- subs x11, x8, #-5
- sbcs x12, x9, x12
- sbcs x13, x10, x13
- csel x8, x11, x8, cs
- csel x9, x12, x9, cs
- csel x10, x13, x10, cs
- mov x11, v27.d[0]
- mov x12, v27.d[1]
- adds x8, x8, x11
- adcs x9, x9, x12
- adc x10, x10, x15
-
- stp x8, x9, [x5]
-
- ldp d8, d9, [sp, #16]
- ldp d10, d11, [sp, #32]
- ldp d12, d13, [sp, #48]
- ldp d14, d15, [sp, #64]
-.cfi_restore b15
-.cfi_restore b14
-.cfi_restore b13
-.cfi_restore b12
-.cfi_restore b11
-.cfi_restore b10
-.cfi_restore b9
-.cfi_restore b8
- ldp x29, x30, [sp], 80
-.cfi_restore w29
-.cfi_restore w30
-.cfi_def_cfa_offset 0
- AARCH64_VALIDATE_LINK_REGISTER
- ret
-
-Lopen_128:
- // On some architectures preparing 5 blocks for small buffers is wasteful
- eor v25.16b, v25.16b, v25.16b
- mov x11, #1
- mov v25.s[0], w11
- mov v0.16b, v24.16b
- mov v1.16b, v24.16b
- mov v2.16b, v24.16b
- mov v5.16b, v28.16b
- mov v6.16b, v28.16b
- mov v7.16b, v28.16b
- mov v10.16b, v29.16b
- mov v11.16b, v29.16b
- mov v12.16b, v29.16b
- mov v17.16b, v30.16b
- add v15.4s, v17.4s, v25.4s
- add v16.4s, v15.4s, v25.4s
-
- mov x6, #10
-
-Lopen_128_rounds:
- add v0.4s, v0.4s, v5.4s
- add v1.4s, v1.4s, v6.4s
- add v2.4s, v2.4s, v7.4s
- eor v15.16b, v15.16b, v0.16b
- eor v16.16b, v16.16b, v1.16b
- eor v17.16b, v17.16b, v2.16b
- rev32 v15.8h, v15.8h
- rev32 v16.8h, v16.8h
- rev32 v17.8h, v17.8h
-
- add v10.4s, v10.4s, v15.4s
- add v11.4s, v11.4s, v16.4s
- add v12.4s, v12.4s, v17.4s
- eor v5.16b, v5.16b, v10.16b
- eor v6.16b, v6.16b, v11.16b
- eor v7.16b, v7.16b, v12.16b
- ushr v20.4s, v5.4s, #20
- sli v20.4s, v5.4s, #12
- ushr v5.4s, v6.4s, #20
- sli v5.4s, v6.4s, #12
- ushr v6.4s, v7.4s, #20
- sli v6.4s, v7.4s, #12
-
- add v0.4s, v0.4s, v20.4s
- add v1.4s, v1.4s, v5.4s
- add v2.4s, v2.4s, v6.4s
- eor v15.16b, v15.16b, v0.16b
- eor v16.16b, v16.16b, v1.16b
- eor v17.16b, v17.16b, v2.16b
- tbl v15.16b, {v15.16b}, v26.16b
- tbl v16.16b, {v16.16b}, v26.16b
- tbl v17.16b, {v17.16b}, v26.16b
-
- add v10.4s, v10.4s, v15.4s
- add v11.4s, v11.4s, v16.4s
- add v12.4s, v12.4s, v17.4s
- eor v20.16b, v20.16b, v10.16b
- eor v5.16b, v5.16b, v11.16b
- eor v6.16b, v6.16b, v12.16b
- ushr v7.4s, v6.4s, #25
- sli v7.4s, v6.4s, #7
- ushr v6.4s, v5.4s, #25
- sli v6.4s, v5.4s, #7
- ushr v5.4s, v20.4s, #25
- sli v5.4s, v20.4s, #7
-
- ext v5.16b, v5.16b, v5.16b, #4
- ext v6.16b, v6.16b, v6.16b, #4
- ext v7.16b, v7.16b, v7.16b, #4
-
- ext v10.16b, v10.16b, v10.16b, #8
- ext v11.16b, v11.16b, v11.16b, #8
- ext v12.16b, v12.16b, v12.16b, #8
-
- ext v15.16b, v15.16b, v15.16b, #12
- ext v16.16b, v16.16b, v16.16b, #12
- ext v17.16b, v17.16b, v17.16b, #12
- add v0.4s, v0.4s, v5.4s
- add v1.4s, v1.4s, v6.4s
- add v2.4s, v2.4s, v7.4s
- eor v15.16b, v15.16b, v0.16b
- eor v16.16b, v16.16b, v1.16b
- eor v17.16b, v17.16b, v2.16b
- rev32 v15.8h, v15.8h
- rev32 v16.8h, v16.8h
- rev32 v17.8h, v17.8h
-
- add v10.4s, v10.4s, v15.4s
- add v11.4s, v11.4s, v16.4s
- add v12.4s, v12.4s, v17.4s
- eor v5.16b, v5.16b, v10.16b
- eor v6.16b, v6.16b, v11.16b
- eor v7.16b, v7.16b, v12.16b
- ushr v20.4s, v5.4s, #20
- sli v20.4s, v5.4s, #12
- ushr v5.4s, v6.4s, #20
- sli v5.4s, v6.4s, #12
- ushr v6.4s, v7.4s, #20
- sli v6.4s, v7.4s, #12
-
- add v0.4s, v0.4s, v20.4s
- add v1.4s, v1.4s, v5.4s
- add v2.4s, v2.4s, v6.4s
- eor v15.16b, v15.16b, v0.16b
- eor v16.16b, v16.16b, v1.16b
- eor v17.16b, v17.16b, v2.16b
- tbl v15.16b, {v15.16b}, v26.16b
- tbl v16.16b, {v16.16b}, v26.16b
- tbl v17.16b, {v17.16b}, v26.16b
-
- add v10.4s, v10.4s, v15.4s
- add v11.4s, v11.4s, v16.4s
- add v12.4s, v12.4s, v17.4s
- eor v20.16b, v20.16b, v10.16b
- eor v5.16b, v5.16b, v11.16b
- eor v6.16b, v6.16b, v12.16b
- ushr v7.4s, v6.4s, #25
- sli v7.4s, v6.4s, #7
- ushr v6.4s, v5.4s, #25
- sli v6.4s, v5.4s, #7
- ushr v5.4s, v20.4s, #25
- sli v5.4s, v20.4s, #7
-
- ext v5.16b, v5.16b, v5.16b, #12
- ext v6.16b, v6.16b, v6.16b, #12
- ext v7.16b, v7.16b, v7.16b, #12
-
- ext v10.16b, v10.16b, v10.16b, #8
- ext v11.16b, v11.16b, v11.16b, #8
- ext v12.16b, v12.16b, v12.16b, #8
-
- ext v15.16b, v15.16b, v15.16b, #4
- ext v16.16b, v16.16b, v16.16b, #4
- ext v17.16b, v17.16b, v17.16b, #4
- subs x6, x6, #1
- b.hi Lopen_128_rounds
-
- add v0.4s, v0.4s, v24.4s
- add v1.4s, v1.4s, v24.4s
- add v2.4s, v2.4s, v24.4s
-
- add v5.4s, v5.4s, v28.4s
- add v6.4s, v6.4s, v28.4s
- add v7.4s, v7.4s, v28.4s
-
- add v10.4s, v10.4s, v29.4s
- add v11.4s, v11.4s, v29.4s
-
- add v30.4s, v30.4s, v25.4s
- add v15.4s, v15.4s, v30.4s
- add v30.4s, v30.4s, v25.4s
- add v16.4s, v16.4s, v30.4s
-
- and v2.16b, v2.16b, v27.16b
- mov x16, v2.d[0] // Move the R key to GPRs
- mov x17, v2.d[1]
- mov v27.16b, v7.16b // Store the S key
-
- bl Lpoly_hash_ad_internal
-
-Lopen_128_store:
- cmp x2, #64
- b.lt Lopen_128_store_64
-
- ld1 {v20.16b - v23.16b}, [x1], #64
-
- mov x11, v20.d[0]
- mov x12, v20.d[1]
- adds x8, x8, x11
- adcs x9, x9, x12
- adc x10, x10, x15
- mul x11, x8, x16 // [t2:t1:t0] = [acc2:acc1:acc0] * r0
- umulh x12, x8, x16
- mul x13, x9, x16
- umulh x14, x9, x16
- adds x12, x12, x13
- mul x13, x10, x16
- adc x13, x13, x14
- mul x14, x8, x17 // [t3:t2:t1:t0] = [acc2:acc1:acc0] * [r1:r0]
- umulh x8, x8, x17
- adds x12, x12, x14
- mul x14, x9, x17
- umulh x9, x9, x17
- adcs x14, x14, x8
- mul x10, x10, x17
- adc x10, x10, x9
- adds x13, x13, x14
- adc x14, x10, xzr
- and x10, x13, #3 // At this point acc2 is 2 bits at most (value of 3)
- and x8, x13, #-4
- extr x13, x14, x13, #2
- adds x8, x8, x11
- lsr x11, x14, #2
- adc x9, x14, x11 // No carry out since t0 is 61 bits and t3 is 63 bits
- adds x8, x8, x13
- adcs x9, x9, x12
- adc x10, x10, xzr // At this point acc2 has the value of 4 at most
- mov x11, v21.d[0]
- mov x12, v21.d[1]
- adds x8, x8, x11
- adcs x9, x9, x12
- adc x10, x10, x15
- mul x11, x8, x16 // [t2:t1:t0] = [acc2:acc1:acc0] * r0
- umulh x12, x8, x16
- mul x13, x9, x16
- umulh x14, x9, x16
- adds x12, x12, x13
- mul x13, x10, x16
- adc x13, x13, x14
- mul x14, x8, x17 // [t3:t2:t1:t0] = [acc2:acc1:acc0] * [r1:r0]
- umulh x8, x8, x17
- adds x12, x12, x14
- mul x14, x9, x17
- umulh x9, x9, x17
- adcs x14, x14, x8
- mul x10, x10, x17
- adc x10, x10, x9
- adds x13, x13, x14
- adc x14, x10, xzr
- and x10, x13, #3 // At this point acc2 is 2 bits at most (value of 3)
- and x8, x13, #-4
- extr x13, x14, x13, #2
- adds x8, x8, x11
- lsr x11, x14, #2
- adc x9, x14, x11 // No carry out since t0 is 61 bits and t3 is 63 bits
- adds x8, x8, x13
- adcs x9, x9, x12
- adc x10, x10, xzr // At this point acc2 has the value of 4 at most
- mov x11, v22.d[0]
- mov x12, v22.d[1]
- adds x8, x8, x11
- adcs x9, x9, x12
- adc x10, x10, x15
- mul x11, x8, x16 // [t2:t1:t0] = [acc2:acc1:acc0] * r0
- umulh x12, x8, x16
- mul x13, x9, x16
- umulh x14, x9, x16
- adds x12, x12, x13
- mul x13, x10, x16
- adc x13, x13, x14
- mul x14, x8, x17 // [t3:t2:t1:t0] = [acc2:acc1:acc0] * [r1:r0]
- umulh x8, x8, x17
- adds x12, x12, x14
- mul x14, x9, x17
- umulh x9, x9, x17
- adcs x14, x14, x8
- mul x10, x10, x17
- adc x10, x10, x9
- adds x13, x13, x14
- adc x14, x10, xzr
- and x10, x13, #3 // At this point acc2 is 2 bits at most (value of 3)
- and x8, x13, #-4
- extr x13, x14, x13, #2
- adds x8, x8, x11
- lsr x11, x14, #2
- adc x9, x14, x11 // No carry out since t0 is 61 bits and t3 is 63 bits
- adds x8, x8, x13
- adcs x9, x9, x12
- adc x10, x10, xzr // At this point acc2 has the value of 4 at most
- mov x11, v23.d[0]
- mov x12, v23.d[1]
- adds x8, x8, x11
- adcs x9, x9, x12
- adc x10, x10, x15
- mul x11, x8, x16 // [t2:t1:t0] = [acc2:acc1:acc0] * r0
- umulh x12, x8, x16
- mul x13, x9, x16
- umulh x14, x9, x16
- adds x12, x12, x13
- mul x13, x10, x16
- adc x13, x13, x14
- mul x14, x8, x17 // [t3:t2:t1:t0] = [acc2:acc1:acc0] * [r1:r0]
- umulh x8, x8, x17
- adds x12, x12, x14
- mul x14, x9, x17
- umulh x9, x9, x17
- adcs x14, x14, x8
- mul x10, x10, x17
- adc x10, x10, x9
- adds x13, x13, x14
- adc x14, x10, xzr
- and x10, x13, #3 // At this point acc2 is 2 bits at most (value of 3)
- and x8, x13, #-4
- extr x13, x14, x13, #2
- adds x8, x8, x11
- lsr x11, x14, #2
- adc x9, x14, x11 // No carry out since t0 is 61 bits and t3 is 63 bits
- adds x8, x8, x13
- adcs x9, x9, x12
- adc x10, x10, xzr // At this point acc2 has the value of 4 at most
-
- eor v20.16b, v20.16b, v0.16b
- eor v21.16b, v21.16b, v5.16b
- eor v22.16b, v22.16b, v10.16b
- eor v23.16b, v23.16b, v15.16b
-
- st1 {v20.16b - v23.16b}, [x0], #64
-
- sub x2, x2, #64
-
- mov v0.16b, v1.16b
- mov v5.16b, v6.16b
- mov v10.16b, v11.16b
- mov v15.16b, v16.16b
-
-Lopen_128_store_64:
-
- lsr x4, x2, #4
- mov x3, x1
-
-Lopen_128_hash_64:
- cbz x4, Lopen_tail_64_store
- ldp x11, x12, [x3], 16
- adds x8, x8, x11
- adcs x9, x9, x12
- adc x10, x10, x15
- mul x11, x8, x16 // [t2:t1:t0] = [acc2:acc1:acc0] * r0
- umulh x12, x8, x16
- mul x13, x9, x16
- umulh x14, x9, x16
- adds x12, x12, x13
- mul x13, x10, x16
- adc x13, x13, x14
- mul x14, x8, x17 // [t3:t2:t1:t0] = [acc2:acc1:acc0] * [r1:r0]
- umulh x8, x8, x17
- adds x12, x12, x14
- mul x14, x9, x17
- umulh x9, x9, x17
- adcs x14, x14, x8
- mul x10, x10, x17
- adc x10, x10, x9
- adds x13, x13, x14
- adc x14, x10, xzr
- and x10, x13, #3 // At this point acc2 is 2 bits at most (value of 3)
- and x8, x13, #-4
- extr x13, x14, x13, #2
- adds x8, x8, x11
- lsr x11, x14, #2
- adc x9, x14, x11 // No carry out since t0 is 61 bits and t3 is 63 bits
- adds x8, x8, x13
- adcs x9, x9, x12
- adc x10, x10, xzr // At this point acc2 has the value of 4 at most
- sub x4, x4, #1
- b Lopen_128_hash_64
-.cfi_endproc
-
-#endif // !OPENSSL_NO_ASM && defined(OPENSSL_AARCH64) && defined(_WIN32)
diff --git a/win-aarch64/crypto/fipsmodule/aesv8-armv8-win.S b/win-aarch64/crypto/fipsmodule/aesv8-armv8-win.S
deleted file mode 100644
index a3ab33af..00000000
--- a/win-aarch64/crypto/fipsmodule/aesv8-armv8-win.S
+++ /dev/null
@@ -1,803 +0,0 @@
-// This file is generated from a similarly-named Perl script in the BoringSSL
-// source tree. Do not edit by hand.
-
-#include <openssl/asm_base.h>
-
-#if !defined(OPENSSL_NO_ASM) && defined(OPENSSL_AARCH64) && defined(_WIN32)
-#include <openssl/arm_arch.h>
-
-#if __ARM_MAX_ARCH__>=7
-.text
-.arch armv8-a+crypto
-.section .rodata
-.align 5
-Lrcon:
-.long 0x01,0x01,0x01,0x01
-.long 0x0c0f0e0d,0x0c0f0e0d,0x0c0f0e0d,0x0c0f0e0d // rotate-n-splat
-.long 0x1b,0x1b,0x1b,0x1b
-
-.text
-
-.globl aes_hw_set_encrypt_key
-
-.def aes_hw_set_encrypt_key
- .type 32
-.endef
-.align 5
-aes_hw_set_encrypt_key:
-Lenc_key:
- // Armv8.3-A PAuth: even though x30 is pushed to stack it is not popped later.
- AARCH64_VALID_CALL_TARGET
- stp x29,x30,[sp,#-16]!
- add x29,sp,#0
- mov x3,#-1
- cmp x0,#0
- b.eq Lenc_key_abort
- cmp x2,#0
- b.eq Lenc_key_abort
- mov x3,#-2
- cmp w1,#128
- b.lt Lenc_key_abort
- cmp w1,#256
- b.gt Lenc_key_abort
- tst w1,#0x3f
- b.ne Lenc_key_abort
-
- adrp x3,Lrcon
- add x3,x3,:lo12:Lrcon
- cmp w1,#192
-
- eor v0.16b,v0.16b,v0.16b
- ld1 {v3.16b},[x0],#16
- mov w1,#8 // reuse w1
- ld1 {v1.4s,v2.4s},[x3],#32
-
- b.lt Loop128
- b.eq L192
- b L256
-
-.align 4
-Loop128:
- tbl v6.16b,{v3.16b},v2.16b
- ext v5.16b,v0.16b,v3.16b,#12
- st1 {v3.4s},[x2],#16
- aese v6.16b,v0.16b
- subs w1,w1,#1
-
- eor v3.16b,v3.16b,v5.16b
- ext v5.16b,v0.16b,v5.16b,#12
- eor v3.16b,v3.16b,v5.16b
- ext v5.16b,v0.16b,v5.16b,#12
- eor v6.16b,v6.16b,v1.16b
- eor v3.16b,v3.16b,v5.16b
- shl v1.16b,v1.16b,#1
- eor v3.16b,v3.16b,v6.16b
- b.ne Loop128
-
- ld1 {v1.4s},[x3]
-
- tbl v6.16b,{v3.16b},v2.16b
- ext v5.16b,v0.16b,v3.16b,#12
- st1 {v3.4s},[x2],#16
- aese v6.16b,v0.16b
-
- eor v3.16b,v3.16b,v5.16b
- ext v5.16b,v0.16b,v5.16b,#12
- eor v3.16b,v3.16b,v5.16b
- ext v5.16b,v0.16b,v5.16b,#12
- eor v6.16b,v6.16b,v1.16b
- eor v3.16b,v3.16b,v5.16b
- shl v1.16b,v1.16b,#1
- eor v3.16b,v3.16b,v6.16b
-
- tbl v6.16b,{v3.16b},v2.16b
- ext v5.16b,v0.16b,v3.16b,#12
- st1 {v3.4s},[x2],#16
- aese v6.16b,v0.16b
-
- eor v3.16b,v3.16b,v5.16b
- ext v5.16b,v0.16b,v5.16b,#12
- eor v3.16b,v3.16b,v5.16b
- ext v5.16b,v0.16b,v5.16b,#12
- eor v6.16b,v6.16b,v1.16b
- eor v3.16b,v3.16b,v5.16b
- eor v3.16b,v3.16b,v6.16b
- st1 {v3.4s},[x2]
- add x2,x2,#0x50
-
- mov w12,#10
- b Ldone
-
-.align 4
-L192:
- ld1 {v4.8b},[x0],#8
- movi v6.16b,#8 // borrow v6.16b
- st1 {v3.4s},[x2],#16
- sub v2.16b,v2.16b,v6.16b // adjust the mask
-
-Loop192:
- tbl v6.16b,{v4.16b},v2.16b
- ext v5.16b,v0.16b,v3.16b,#12
- st1 {v4.8b},[x2],#8
- aese v6.16b,v0.16b
- subs w1,w1,#1
-
- eor v3.16b,v3.16b,v5.16b
- ext v5.16b,v0.16b,v5.16b,#12
- eor v3.16b,v3.16b,v5.16b
- ext v5.16b,v0.16b,v5.16b,#12
- eor v3.16b,v3.16b,v5.16b
-
- dup v5.4s,v3.s[3]
- eor v5.16b,v5.16b,v4.16b
- eor v6.16b,v6.16b,v1.16b
- ext v4.16b,v0.16b,v4.16b,#12
- shl v1.16b,v1.16b,#1
- eor v4.16b,v4.16b,v5.16b
- eor v3.16b,v3.16b,v6.16b
- eor v4.16b,v4.16b,v6.16b
- st1 {v3.4s},[x2],#16
- b.ne Loop192
-
- mov w12,#12
- add x2,x2,#0x20
- b Ldone
-
-.align 4
-L256:
- ld1 {v4.16b},[x0]
- mov w1,#7
- mov w12,#14
- st1 {v3.4s},[x2],#16
-
-Loop256:
- tbl v6.16b,{v4.16b},v2.16b
- ext v5.16b,v0.16b,v3.16b,#12
- st1 {v4.4s},[x2],#16
- aese v6.16b,v0.16b
- subs w1,w1,#1
-
- eor v3.16b,v3.16b,v5.16b
- ext v5.16b,v0.16b,v5.16b,#12
- eor v3.16b,v3.16b,v5.16b
- ext v5.16b,v0.16b,v5.16b,#12
- eor v6.16b,v6.16b,v1.16b
- eor v3.16b,v3.16b,v5.16b
- shl v1.16b,v1.16b,#1
- eor v3.16b,v3.16b,v6.16b
- st1 {v3.4s},[x2],#16
- b.eq Ldone
-
- dup v6.4s,v3.s[3] // just splat
- ext v5.16b,v0.16b,v4.16b,#12
- aese v6.16b,v0.16b
-
- eor v4.16b,v4.16b,v5.16b
- ext v5.16b,v0.16b,v5.16b,#12
- eor v4.16b,v4.16b,v5.16b
- ext v5.16b,v0.16b,v5.16b,#12
- eor v4.16b,v4.16b,v5.16b
-
- eor v4.16b,v4.16b,v6.16b
- b Loop256
-
-Ldone:
- str w12,[x2]
- mov x3,#0
-
-Lenc_key_abort:
- mov x0,x3 // return value
- ldr x29,[sp],#16
- ret
-
-
-.globl aes_hw_set_decrypt_key
-
-.def aes_hw_set_decrypt_key
- .type 32
-.endef
-.align 5
-aes_hw_set_decrypt_key:
- AARCH64_SIGN_LINK_REGISTER
- stp x29,x30,[sp,#-16]!
- add x29,sp,#0
- bl Lenc_key
-
- cmp x0,#0
- b.ne Ldec_key_abort
-
- sub x2,x2,#240 // restore original x2
- mov x4,#-16
- add x0,x2,x12,lsl#4 // end of key schedule
-
- ld1 {v0.4s},[x2]
- ld1 {v1.4s},[x0]
- st1 {v0.4s},[x0],x4
- st1 {v1.4s},[x2],#16
-
-Loop_imc:
- ld1 {v0.4s},[x2]
- ld1 {v1.4s},[x0]
- aesimc v0.16b,v0.16b
- aesimc v1.16b,v1.16b
- st1 {v0.4s},[x0],x4
- st1 {v1.4s},[x2],#16
- cmp x0,x2
- b.hi Loop_imc
-
- ld1 {v0.4s},[x2]
- aesimc v0.16b,v0.16b
- st1 {v0.4s},[x0]
-
- eor x0,x0,x0 // return value
-Ldec_key_abort:
- ldp x29,x30,[sp],#16
- AARCH64_VALIDATE_LINK_REGISTER
- ret
-
-.globl aes_hw_encrypt
-
-.def aes_hw_encrypt
- .type 32
-.endef
-.align 5
-aes_hw_encrypt:
- AARCH64_VALID_CALL_TARGET
- ldr w3,[x2,#240]
- ld1 {v0.4s},[x2],#16
- ld1 {v2.16b},[x0]
- sub w3,w3,#2
- ld1 {v1.4s},[x2],#16
-
-Loop_enc:
- aese v2.16b,v0.16b
- aesmc v2.16b,v2.16b
- ld1 {v0.4s},[x2],#16
- subs w3,w3,#2
- aese v2.16b,v1.16b
- aesmc v2.16b,v2.16b
- ld1 {v1.4s},[x2],#16
- b.gt Loop_enc
-
- aese v2.16b,v0.16b
- aesmc v2.16b,v2.16b
- ld1 {v0.4s},[x2]
- aese v2.16b,v1.16b
- eor v2.16b,v2.16b,v0.16b
-
- st1 {v2.16b},[x1]
- ret
-
-.globl aes_hw_decrypt
-
-.def aes_hw_decrypt
- .type 32
-.endef
-.align 5
-aes_hw_decrypt:
- AARCH64_VALID_CALL_TARGET
- ldr w3,[x2,#240]
- ld1 {v0.4s},[x2],#16
- ld1 {v2.16b},[x0]
- sub w3,w3,#2
- ld1 {v1.4s},[x2],#16
-
-Loop_dec:
- aesd v2.16b,v0.16b
- aesimc v2.16b,v2.16b
- ld1 {v0.4s},[x2],#16
- subs w3,w3,#2
- aesd v2.16b,v1.16b
- aesimc v2.16b,v2.16b
- ld1 {v1.4s},[x2],#16
- b.gt Loop_dec
-
- aesd v2.16b,v0.16b
- aesimc v2.16b,v2.16b
- ld1 {v0.4s},[x2]
- aesd v2.16b,v1.16b
- eor v2.16b,v2.16b,v0.16b
-
- st1 {v2.16b},[x1]
- ret
-
-.globl aes_hw_cbc_encrypt
-
-.def aes_hw_cbc_encrypt
- .type 32
-.endef
-.align 5
-aes_hw_cbc_encrypt:
- // Armv8.3-A PAuth: even though x30 is pushed to stack it is not popped later.
- AARCH64_VALID_CALL_TARGET
- stp x29,x30,[sp,#-16]!
- add x29,sp,#0
- subs x2,x2,#16
- mov x8,#16
- b.lo Lcbc_abort
- csel x8,xzr,x8,eq
-
- cmp w5,#0 // en- or decrypting?
- ldr w5,[x3,#240]
- and x2,x2,#-16
- ld1 {v6.16b},[x4]
- ld1 {v0.16b},[x0],x8
-
- ld1 {v16.4s,v17.4s},[x3] // load key schedule...
- sub w5,w5,#6
- add x7,x3,x5,lsl#4 // pointer to last 7 round keys
- sub w5,w5,#2
- ld1 {v18.4s,v19.4s},[x7],#32
- ld1 {v20.4s,v21.4s},[x7],#32
- ld1 {v22.4s,v23.4s},[x7],#32
- ld1 {v7.4s},[x7]
-
- add x7,x3,#32
- mov w6,w5
- b.eq Lcbc_dec
-
- cmp w5,#2
- eor v0.16b,v0.16b,v6.16b
- eor v5.16b,v16.16b,v7.16b
- b.eq Lcbc_enc128
-
- ld1 {v2.4s,v3.4s},[x7]
- add x7,x3,#16
- add x6,x3,#16*4
- add x12,x3,#16*5
- aese v0.16b,v16.16b
- aesmc v0.16b,v0.16b
- add x14,x3,#16*6
- add x3,x3,#16*7
- b Lenter_cbc_enc
-
-.align 4
-Loop_cbc_enc:
- aese v0.16b,v16.16b
- aesmc v0.16b,v0.16b
- st1 {v6.16b},[x1],#16
-Lenter_cbc_enc:
- aese v0.16b,v17.16b
- aesmc v0.16b,v0.16b
- aese v0.16b,v2.16b
- aesmc v0.16b,v0.16b
- ld1 {v16.4s},[x6]
- cmp w5,#4
- aese v0.16b,v3.16b
- aesmc v0.16b,v0.16b
- ld1 {v17.4s},[x12]
- b.eq Lcbc_enc192
-
- aese v0.16b,v16.16b
- aesmc v0.16b,v0.16b
- ld1 {v16.4s},[x14]
- aese v0.16b,v17.16b
- aesmc v0.16b,v0.16b
- ld1 {v17.4s},[x3]
- nop
-
-Lcbc_enc192:
- aese v0.16b,v16.16b
- aesmc v0.16b,v0.16b
- subs x2,x2,#16
- aese v0.16b,v17.16b
- aesmc v0.16b,v0.16b
- csel x8,xzr,x8,eq
- aese v0.16b,v18.16b
- aesmc v0.16b,v0.16b
- aese v0.16b,v19.16b
- aesmc v0.16b,v0.16b
- ld1 {v16.16b},[x0],x8
- aese v0.16b,v20.16b
- aesmc v0.16b,v0.16b
- eor v16.16b,v16.16b,v5.16b
- aese v0.16b,v21.16b
- aesmc v0.16b,v0.16b
- ld1 {v17.4s},[x7] // re-pre-load rndkey[1]
- aese v0.16b,v22.16b
- aesmc v0.16b,v0.16b
- aese v0.16b,v23.16b
- eor v6.16b,v0.16b,v7.16b
- b.hs Loop_cbc_enc
-
- st1 {v6.16b},[x1],#16
- b Lcbc_done
-
-.align 5
-Lcbc_enc128:
- ld1 {v2.4s,v3.4s},[x7]
- aese v0.16b,v16.16b
- aesmc v0.16b,v0.16b
- b Lenter_cbc_enc128
-Loop_cbc_enc128:
- aese v0.16b,v16.16b
- aesmc v0.16b,v0.16b
- st1 {v6.16b},[x1],#16
-Lenter_cbc_enc128:
- aese v0.16b,v17.16b
- aesmc v0.16b,v0.16b
- subs x2,x2,#16
- aese v0.16b,v2.16b
- aesmc v0.16b,v0.16b
- csel x8,xzr,x8,eq
- aese v0.16b,v3.16b
- aesmc v0.16b,v0.16b
- aese v0.16b,v18.16b
- aesmc v0.16b,v0.16b
- aese v0.16b,v19.16b
- aesmc v0.16b,v0.16b
- ld1 {v16.16b},[x0],x8
- aese v0.16b,v20.16b
- aesmc v0.16b,v0.16b
- aese v0.16b,v21.16b
- aesmc v0.16b,v0.16b
- aese v0.16b,v22.16b
- aesmc v0.16b,v0.16b
- eor v16.16b,v16.16b,v5.16b
- aese v0.16b,v23.16b
- eor v6.16b,v0.16b,v7.16b
- b.hs Loop_cbc_enc128
-
- st1 {v6.16b},[x1],#16
- b Lcbc_done
-.align 5
-Lcbc_dec:
- ld1 {v18.16b},[x0],#16
- subs x2,x2,#32 // bias
- add w6,w5,#2
- orr v3.16b,v0.16b,v0.16b
- orr v1.16b,v0.16b,v0.16b
- orr v19.16b,v18.16b,v18.16b
- b.lo Lcbc_dec_tail
-
- orr v1.16b,v18.16b,v18.16b
- ld1 {v18.16b},[x0],#16
- orr v2.16b,v0.16b,v0.16b
- orr v3.16b,v1.16b,v1.16b
- orr v19.16b,v18.16b,v18.16b
-
-Loop3x_cbc_dec:
- aesd v0.16b,v16.16b
- aesimc v0.16b,v0.16b
- aesd v1.16b,v16.16b
- aesimc v1.16b,v1.16b
- aesd v18.16b,v16.16b
- aesimc v18.16b,v18.16b
- ld1 {v16.4s},[x7],#16
- subs w6,w6,#2
- aesd v0.16b,v17.16b
- aesimc v0.16b,v0.16b
- aesd v1.16b,v17.16b
- aesimc v1.16b,v1.16b
- aesd v18.16b,v17.16b
- aesimc v18.16b,v18.16b
- ld1 {v17.4s},[x7],#16
- b.gt Loop3x_cbc_dec
-
- aesd v0.16b,v16.16b
- aesimc v0.16b,v0.16b
- aesd v1.16b,v16.16b
- aesimc v1.16b,v1.16b
- aesd v18.16b,v16.16b
- aesimc v18.16b,v18.16b
- eor v4.16b,v6.16b,v7.16b
- subs x2,x2,#0x30
- eor v5.16b,v2.16b,v7.16b
- csel x6,x2,x6,lo // x6, w6, is zero at this point
- aesd v0.16b,v17.16b
- aesimc v0.16b,v0.16b
- aesd v1.16b,v17.16b
- aesimc v1.16b,v1.16b
- aesd v18.16b,v17.16b
- aesimc v18.16b,v18.16b
- eor v17.16b,v3.16b,v7.16b
- add x0,x0,x6 // x0 is adjusted in such way that
- // at exit from the loop v1.16b-v18.16b
- // are loaded with last "words"
- orr v6.16b,v19.16b,v19.16b
- mov x7,x3
- aesd v0.16b,v20.16b
- aesimc v0.16b,v0.16b
- aesd v1.16b,v20.16b
- aesimc v1.16b,v1.16b
- aesd v18.16b,v20.16b
- aesimc v18.16b,v18.16b
- ld1 {v2.16b},[x0],#16
- aesd v0.16b,v21.16b
- aesimc v0.16b,v0.16b
- aesd v1.16b,v21.16b
- aesimc v1.16b,v1.16b
- aesd v18.16b,v21.16b
- aesimc v18.16b,v18.16b
- ld1 {v3.16b},[x0],#16
- aesd v0.16b,v22.16b
- aesimc v0.16b,v0.16b
- aesd v1.16b,v22.16b
- aesimc v1.16b,v1.16b
- aesd v18.16b,v22.16b
- aesimc v18.16b,v18.16b
- ld1 {v19.16b},[x0],#16
- aesd v0.16b,v23.16b
- aesd v1.16b,v23.16b
- aesd v18.16b,v23.16b
- ld1 {v16.4s},[x7],#16 // re-pre-load rndkey[0]
- add w6,w5,#2
- eor v4.16b,v4.16b,v0.16b
- eor v5.16b,v5.16b,v1.16b
- eor v18.16b,v18.16b,v17.16b
- ld1 {v17.4s},[x7],#16 // re-pre-load rndkey[1]
- st1 {v4.16b},[x1],#16
- orr v0.16b,v2.16b,v2.16b
- st1 {v5.16b},[x1],#16
- orr v1.16b,v3.16b,v3.16b
- st1 {v18.16b},[x1],#16
- orr v18.16b,v19.16b,v19.16b
- b.hs Loop3x_cbc_dec
-
- cmn x2,#0x30
- b.eq Lcbc_done
- nop
-
-Lcbc_dec_tail:
- aesd v1.16b,v16.16b
- aesimc v1.16b,v1.16b
- aesd v18.16b,v16.16b
- aesimc v18.16b,v18.16b
- ld1 {v16.4s},[x7],#16
- subs w6,w6,#2
- aesd v1.16b,v17.16b
- aesimc v1.16b,v1.16b
- aesd v18.16b,v17.16b
- aesimc v18.16b,v18.16b
- ld1 {v17.4s},[x7],#16
- b.gt Lcbc_dec_tail
-
- aesd v1.16b,v16.16b
- aesimc v1.16b,v1.16b
- aesd v18.16b,v16.16b
- aesimc v18.16b,v18.16b
- aesd v1.16b,v17.16b
- aesimc v1.16b,v1.16b
- aesd v18.16b,v17.16b
- aesimc v18.16b,v18.16b
- aesd v1.16b,v20.16b
- aesimc v1.16b,v1.16b
- aesd v18.16b,v20.16b
- aesimc v18.16b,v18.16b
- cmn x2,#0x20
- aesd v1.16b,v21.16b
- aesimc v1.16b,v1.16b
- aesd v18.16b,v21.16b
- aesimc v18.16b,v18.16b
- eor v5.16b,v6.16b,v7.16b
- aesd v1.16b,v22.16b
- aesimc v1.16b,v1.16b
- aesd v18.16b,v22.16b
- aesimc v18.16b,v18.16b
- eor v17.16b,v3.16b,v7.16b
- aesd v1.16b,v23.16b
- aesd v18.16b,v23.16b
- b.eq Lcbc_dec_one
- eor v5.16b,v5.16b,v1.16b
- eor v17.16b,v17.16b,v18.16b
- orr v6.16b,v19.16b,v19.16b
- st1 {v5.16b},[x1],#16
- st1 {v17.16b},[x1],#16
- b Lcbc_done
-
-Lcbc_dec_one:
- eor v5.16b,v5.16b,v18.16b
- orr v6.16b,v19.16b,v19.16b
- st1 {v5.16b},[x1],#16
-
-Lcbc_done:
- st1 {v6.16b},[x4]
-Lcbc_abort:
- ldr x29,[sp],#16
- ret
-
-.globl aes_hw_ctr32_encrypt_blocks
-
-.def aes_hw_ctr32_encrypt_blocks
- .type 32
-.endef
-.align 5
-aes_hw_ctr32_encrypt_blocks:
- // Armv8.3-A PAuth: even though x30 is pushed to stack it is not popped later.
- AARCH64_VALID_CALL_TARGET
- stp x29,x30,[sp,#-16]!
- add x29,sp,#0
- ldr w5,[x3,#240]
-
- ldr w8, [x4, #12]
- ld1 {v0.4s},[x4]
-
- ld1 {v16.4s,v17.4s},[x3] // load key schedule...
- sub w5,w5,#4
- mov x12,#16
- cmp x2,#2
- add x7,x3,x5,lsl#4 // pointer to last 5 round keys
- sub w5,w5,#2
- ld1 {v20.4s,v21.4s},[x7],#32
- ld1 {v22.4s,v23.4s},[x7],#32
- ld1 {v7.4s},[x7]
- add x7,x3,#32
- mov w6,w5
- csel x12,xzr,x12,lo
-
- // ARM Cortex-A57 and Cortex-A72 cores running in 32-bit mode are
- // affected by silicon errata #1742098 [0] and #1655431 [1],
- // respectively, where the second instruction of an aese/aesmc
- // instruction pair may execute twice if an interrupt is taken right
- // after the first instruction consumes an input register of which a
- // single 32-bit lane has been updated the last time it was modified.
- //
- // This function uses a counter in one 32-bit lane. The vmov lines
- // could write to v1.16b and v18.16b directly, but that trips this bugs.
- // We write to v6.16b and copy to the final register as a workaround.
- //
- // [0] ARM-EPM-049219 v23 Cortex-A57 MPCore Software Developers Errata Notice
- // [1] ARM-EPM-012079 v11.0 Cortex-A72 MPCore Software Developers Errata Notice
-#ifndef __AARCH64EB__
- rev w8, w8
-#endif
- add w10, w8, #1
- orr v6.16b,v0.16b,v0.16b
- rev w10, w10
- mov v6.s[3],w10
- add w8, w8, #2
- orr v1.16b,v6.16b,v6.16b
- b.ls Lctr32_tail
- rev w12, w8
- mov v6.s[3],w12
- sub x2,x2,#3 // bias
- orr v18.16b,v6.16b,v6.16b
- b Loop3x_ctr32
-
-.align 4
-Loop3x_ctr32:
- aese v0.16b,v16.16b
- aesmc v0.16b,v0.16b
- aese v1.16b,v16.16b
- aesmc v1.16b,v1.16b
- aese v18.16b,v16.16b
- aesmc v18.16b,v18.16b
- ld1 {v16.4s},[x7],#16
- subs w6,w6,#2
- aese v0.16b,v17.16b
- aesmc v0.16b,v0.16b
- aese v1.16b,v17.16b
- aesmc v1.16b,v1.16b
- aese v18.16b,v17.16b
- aesmc v18.16b,v18.16b
- ld1 {v17.4s},[x7],#16
- b.gt Loop3x_ctr32
-
- aese v0.16b,v16.16b
- aesmc v4.16b,v0.16b
- aese v1.16b,v16.16b
- aesmc v5.16b,v1.16b
- ld1 {v2.16b},[x0],#16
- add w9,w8,#1
- aese v18.16b,v16.16b
- aesmc v18.16b,v18.16b
- ld1 {v3.16b},[x0],#16
- rev w9,w9
- aese v4.16b,v17.16b
- aesmc v4.16b,v4.16b
- aese v5.16b,v17.16b
- aesmc v5.16b,v5.16b
- ld1 {v19.16b},[x0],#16
- mov x7,x3
- aese v18.16b,v17.16b
- aesmc v17.16b,v18.16b
- aese v4.16b,v20.16b
- aesmc v4.16b,v4.16b
- aese v5.16b,v20.16b
- aesmc v5.16b,v5.16b
- eor v2.16b,v2.16b,v7.16b
- add w10,w8,#2
- aese v17.16b,v20.16b
- aesmc v17.16b,v17.16b
- eor v3.16b,v3.16b,v7.16b
- add w8,w8,#3
- aese v4.16b,v21.16b
- aesmc v4.16b,v4.16b
- aese v5.16b,v21.16b
- aesmc v5.16b,v5.16b
- // Note the logic to update v0.16b, v1.16b, and v1.16b is written to work
- // around a bug in ARM Cortex-A57 and Cortex-A72 cores running in
- // 32-bit mode. See the comment above.
- eor v19.16b,v19.16b,v7.16b
- mov v6.s[3], w9
- aese v17.16b,v21.16b
- aesmc v17.16b,v17.16b
- orr v0.16b,v6.16b,v6.16b
- rev w10,w10
- aese v4.16b,v22.16b
- aesmc v4.16b,v4.16b
- mov v6.s[3], w10
- rev w12,w8
- aese v5.16b,v22.16b
- aesmc v5.16b,v5.16b
- orr v1.16b,v6.16b,v6.16b
- mov v6.s[3], w12
- aese v17.16b,v22.16b
- aesmc v17.16b,v17.16b
- orr v18.16b,v6.16b,v6.16b
- subs x2,x2,#3
- aese v4.16b,v23.16b
- aese v5.16b,v23.16b
- aese v17.16b,v23.16b
-
- eor v2.16b,v2.16b,v4.16b
- ld1 {v16.4s},[x7],#16 // re-pre-load rndkey[0]
- st1 {v2.16b},[x1],#16
- eor v3.16b,v3.16b,v5.16b
- mov w6,w5
- st1 {v3.16b},[x1],#16
- eor v19.16b,v19.16b,v17.16b
- ld1 {v17.4s},[x7],#16 // re-pre-load rndkey[1]
- st1 {v19.16b},[x1],#16
- b.hs Loop3x_ctr32
-
- adds x2,x2,#3
- b.eq Lctr32_done
- cmp x2,#1
- mov x12,#16
- csel x12,xzr,x12,eq
-
-Lctr32_tail:
- aese v0.16b,v16.16b
- aesmc v0.16b,v0.16b
- aese v1.16b,v16.16b
- aesmc v1.16b,v1.16b
- ld1 {v16.4s},[x7],#16
- subs w6,w6,#2
- aese v0.16b,v17.16b
- aesmc v0.16b,v0.16b
- aese v1.16b,v17.16b
- aesmc v1.16b,v1.16b
- ld1 {v17.4s},[x7],#16
- b.gt Lctr32_tail
-
- aese v0.16b,v16.16b
- aesmc v0.16b,v0.16b
- aese v1.16b,v16.16b
- aesmc v1.16b,v1.16b
- aese v0.16b,v17.16b
- aesmc v0.16b,v0.16b
- aese v1.16b,v17.16b
- aesmc v1.16b,v1.16b
- ld1 {v2.16b},[x0],x12
- aese v0.16b,v20.16b
- aesmc v0.16b,v0.16b
- aese v1.16b,v20.16b
- aesmc v1.16b,v1.16b
- ld1 {v3.16b},[x0]
- aese v0.16b,v21.16b
- aesmc v0.16b,v0.16b
- aese v1.16b,v21.16b
- aesmc v1.16b,v1.16b
- eor v2.16b,v2.16b,v7.16b
- aese v0.16b,v22.16b
- aesmc v0.16b,v0.16b
- aese v1.16b,v22.16b
- aesmc v1.16b,v1.16b
- eor v3.16b,v3.16b,v7.16b
- aese v0.16b,v23.16b
- aese v1.16b,v23.16b
-
- cmp x2,#1
- eor v2.16b,v2.16b,v0.16b
- eor v3.16b,v3.16b,v1.16b
- st1 {v2.16b},[x1],#16
- b.eq Lctr32_done
- st1 {v3.16b},[x1]
-
-Lctr32_done:
- ldr x29,[sp],#16
- ret
-
-#endif
-#endif // !OPENSSL_NO_ASM && defined(OPENSSL_AARCH64) && defined(_WIN32)
diff --git a/win-aarch64/crypto/fipsmodule/aesv8-gcm-armv8-win.S b/win-aarch64/crypto/fipsmodule/aesv8-gcm-armv8-win.S
deleted file mode 100644
index 12337969..00000000
--- a/win-aarch64/crypto/fipsmodule/aesv8-gcm-armv8-win.S
+++ /dev/null
@@ -1,1559 +0,0 @@
-// This file is generated from a similarly-named Perl script in the BoringSSL
-// source tree. Do not edit by hand.
-
-#include <openssl/asm_base.h>
-
-#if !defined(OPENSSL_NO_ASM) && defined(OPENSSL_AARCH64) && defined(_WIN32)
-#include <openssl/arm_arch.h>
-#if __ARM_MAX_ARCH__ >= 8
-
-.arch armv8-a+crypto
-.text
-.globl aes_gcm_enc_kernel
-
-.def aes_gcm_enc_kernel
- .type 32
-.endef
-.align 4
-aes_gcm_enc_kernel:
- AARCH64_SIGN_LINK_REGISTER
- stp x29, x30, [sp, #-128]!
- mov x29, sp
- stp x19, x20, [sp, #16]
- mov x16, x4
- mov x8, x5
- stp x21, x22, [sp, #32]
- stp x23, x24, [sp, #48]
- stp d8, d9, [sp, #64]
- stp d10, d11, [sp, #80]
- stp d12, d13, [sp, #96]
- stp d14, d15, [sp, #112]
- ldr w17, [x8, #240]
- add x19, x8, x17, lsl #4 // borrow input_l1 for last key
- ldp x13, x14, [x19] // load round N keys
- ldr q31, [x19, #-16] // load round N-1 keys
- add x4, x0, x1, lsr #3 // end_input_ptr
- lsr x5, x1, #3 // byte_len
- mov x15, x5
- ldp x10, x11, [x16] // ctr96_b64, ctr96_t32
- ld1 { v0.16b}, [x16] // special case vector load initial counter so we can start first AES block as quickly as possible
- sub x5, x5, #1 // byte_len - 1
- ldr q18, [x8, #0] // load rk0
- and x5, x5, #0xffffffffffffffc0 // number of bytes to be processed in main loop (at least 1 byte must be handled by tail)
- ldr q25, [x8, #112] // load rk7
- add x5, x5, x0
- lsr x12, x11, #32
- fmov d2, x10 // CTR block 2
- orr w11, w11, w11
- rev w12, w12 // rev_ctr32
- fmov d1, x10 // CTR block 1
- aese v0.16b, v18.16b
- aesmc v0.16b, v0.16b // AES block 0 - round 0
- add w12, w12, #1 // increment rev_ctr32
- rev w9, w12 // CTR block 1
- fmov d3, x10 // CTR block 3
- orr x9, x11, x9, lsl #32 // CTR block 1
- add w12, w12, #1 // CTR block 1
- ldr q19, [x8, #16] // load rk1
- fmov v1.d[1], x9 // CTR block 1
- rev w9, w12 // CTR block 2
- add w12, w12, #1 // CTR block 2
- orr x9, x11, x9, lsl #32 // CTR block 2
- ldr q20, [x8, #32] // load rk2
- fmov v2.d[1], x9 // CTR block 2
- rev w9, w12 // CTR block 3
- aese v0.16b, v19.16b
- aesmc v0.16b, v0.16b // AES block 0 - round 1
- orr x9, x11, x9, lsl #32 // CTR block 3
- fmov v3.d[1], x9 // CTR block 3
- aese v1.16b, v18.16b
- aesmc v1.16b, v1.16b // AES block 1 - round 0
- ldr q21, [x8, #48] // load rk3
- aese v0.16b, v20.16b
- aesmc v0.16b, v0.16b // AES block 0 - round 2
- ldr q24, [x8, #96] // load rk6
- aese v2.16b, v18.16b
- aesmc v2.16b, v2.16b // AES block 2 - round 0
- ldr q23, [x8, #80] // load rk5
- aese v1.16b, v19.16b
- aesmc v1.16b, v1.16b // AES block 1 - round 1
- ldr q14, [x6, #48] // load h3l | h3h
- ext v14.16b, v14.16b, v14.16b, #8
- aese v3.16b, v18.16b
- aesmc v3.16b, v3.16b // AES block 3 - round 0
- aese v2.16b, v19.16b
- aesmc v2.16b, v2.16b // AES block 2 - round 1
- ldr q22, [x8, #64] // load rk4
- aese v1.16b, v20.16b
- aesmc v1.16b, v1.16b // AES block 1 - round 2
- ldr q13, [x6, #32] // load h2l | h2h
- ext v13.16b, v13.16b, v13.16b, #8
- aese v3.16b, v19.16b
- aesmc v3.16b, v3.16b // AES block 3 - round 1
- ldr q30, [x8, #192] // load rk12
- aese v2.16b, v20.16b
- aesmc v2.16b, v2.16b // AES block 2 - round 2
- ldr q15, [x6, #80] // load h4l | h4h
- ext v15.16b, v15.16b, v15.16b, #8
- aese v1.16b, v21.16b
- aesmc v1.16b, v1.16b // AES block 1 - round 3
- ldr q29, [x8, #176] // load rk11
- aese v3.16b, v20.16b
- aesmc v3.16b, v3.16b // AES block 3 - round 2
- ldr q26, [x8, #128] // load rk8
- aese v2.16b, v21.16b
- aesmc v2.16b, v2.16b // AES block 2 - round 3
- add w12, w12, #1 // CTR block 3
- aese v0.16b, v21.16b
- aesmc v0.16b, v0.16b // AES block 0 - round 3
- aese v3.16b, v21.16b
- aesmc v3.16b, v3.16b // AES block 3 - round 3
- ld1 { v11.16b}, [x3]
- ext v11.16b, v11.16b, v11.16b, #8
- rev64 v11.16b, v11.16b
- aese v2.16b, v22.16b
- aesmc v2.16b, v2.16b // AES block 2 - round 4
- aese v0.16b, v22.16b
- aesmc v0.16b, v0.16b // AES block 0 - round 4
- aese v1.16b, v22.16b
- aesmc v1.16b, v1.16b // AES block 1 - round 4
- aese v3.16b, v22.16b
- aesmc v3.16b, v3.16b // AES block 3 - round 4
- cmp x17, #12 // setup flags for AES-128/192/256 check
- aese v0.16b, v23.16b
- aesmc v0.16b, v0.16b // AES block 0 - round 5
- aese v1.16b, v23.16b
- aesmc v1.16b, v1.16b // AES block 1 - round 5
- aese v3.16b, v23.16b
- aesmc v3.16b, v3.16b // AES block 3 - round 5
- aese v2.16b, v23.16b
- aesmc v2.16b, v2.16b // AES block 2 - round 5
- aese v1.16b, v24.16b
- aesmc v1.16b, v1.16b // AES block 1 - round 6
- trn2 v17.2d, v14.2d, v15.2d // h4l | h3l
- aese v3.16b, v24.16b
- aesmc v3.16b, v3.16b // AES block 3 - round 6
- ldr q27, [x8, #144] // load rk9
- aese v0.16b, v24.16b
- aesmc v0.16b, v0.16b // AES block 0 - round 6
- ldr q12, [x6] // load h1l | h1h
- ext v12.16b, v12.16b, v12.16b, #8
- aese v2.16b, v24.16b
- aesmc v2.16b, v2.16b // AES block 2 - round 6
- ldr q28, [x8, #160] // load rk10
- aese v1.16b, v25.16b
- aesmc v1.16b, v1.16b // AES block 1 - round 7
- trn1 v9.2d, v14.2d, v15.2d // h4h | h3h
- aese v0.16b, v25.16b
- aesmc v0.16b, v0.16b // AES block 0 - round 7
- aese v2.16b, v25.16b
- aesmc v2.16b, v2.16b // AES block 2 - round 7
- aese v3.16b, v25.16b
- aesmc v3.16b, v3.16b // AES block 3 - round 7
- trn2 v16.2d, v12.2d, v13.2d // h2l | h1l
- aese v1.16b, v26.16b
- aesmc v1.16b, v1.16b // AES block 1 - round 8
- aese v2.16b, v26.16b
- aesmc v2.16b, v2.16b // AES block 2 - round 8
- aese v3.16b, v26.16b
- aesmc v3.16b, v3.16b // AES block 3 - round 8
- aese v0.16b, v26.16b
- aesmc v0.16b, v0.16b // AES block 0 - round 8
- b.lt Lenc_finish_first_blocks // branch if AES-128
-
- aese v1.16b, v27.16b
- aesmc v1.16b, v1.16b // AES block 1 - round 9
- aese v2.16b, v27.16b
- aesmc v2.16b, v2.16b // AES block 2 - round 9
- aese v3.16b, v27.16b
- aesmc v3.16b, v3.16b // AES block 3 - round 9
- aese v0.16b, v27.16b
- aesmc v0.16b, v0.16b // AES block 0 - round 9
- aese v1.16b, v28.16b
- aesmc v1.16b, v1.16b // AES block 1 - round 10
- aese v2.16b, v28.16b
- aesmc v2.16b, v2.16b // AES block 2 - round 10
- aese v3.16b, v28.16b
- aesmc v3.16b, v3.16b // AES block 3 - round 10
- aese v0.16b, v28.16b
- aesmc v0.16b, v0.16b // AES block 0 - round 10
- b.eq Lenc_finish_first_blocks // branch if AES-192
-
- aese v1.16b, v29.16b
- aesmc v1.16b, v1.16b // AES block 1 - round 11
- aese v2.16b, v29.16b
- aesmc v2.16b, v2.16b // AES block 2 - round 11
- aese v0.16b, v29.16b
- aesmc v0.16b, v0.16b // AES block 0 - round 11
- aese v3.16b, v29.16b
- aesmc v3.16b, v3.16b // AES block 3 - round 11
- aese v1.16b, v30.16b
- aesmc v1.16b, v1.16b // AES block 1 - round 12
- aese v2.16b, v30.16b
- aesmc v2.16b, v2.16b // AES block 2 - round 12
- aese v0.16b, v30.16b
- aesmc v0.16b, v0.16b // AES block 0 - round 12
- aese v3.16b, v30.16b
- aesmc v3.16b, v3.16b // AES block 3 - round 12
-
-Lenc_finish_first_blocks:
- cmp x0, x5 // check if we have <= 4 blocks
- eor v17.16b, v17.16b, v9.16b // h4k | h3k
- aese v2.16b, v31.16b // AES block 2 - round N-1
- trn1 v8.2d, v12.2d, v13.2d // h2h | h1h
- aese v1.16b, v31.16b // AES block 1 - round N-1
- aese v0.16b, v31.16b // AES block 0 - round N-1
- aese v3.16b, v31.16b // AES block 3 - round N-1
- eor v16.16b, v16.16b, v8.16b // h2k | h1k
- b.ge Lenc_tail // handle tail
-
- ldp x19, x20, [x0, #16] // AES block 1 - load plaintext
- rev w9, w12 // CTR block 4
- ldp x6, x7, [x0, #0] // AES block 0 - load plaintext
- ldp x23, x24, [x0, #48] // AES block 3 - load plaintext
- ldp x21, x22, [x0, #32] // AES block 2 - load plaintext
- add x0, x0, #64 // AES input_ptr update
- eor x19, x19, x13 // AES block 1 - round N low
- eor x20, x20, x14 // AES block 1 - round N high
- fmov d5, x19 // AES block 1 - mov low
- eor x6, x6, x13 // AES block 0 - round N low
- eor x7, x7, x14 // AES block 0 - round N high
- eor x24, x24, x14 // AES block 3 - round N high
- fmov d4, x6 // AES block 0 - mov low
- cmp x0, x5 // check if we have <= 8 blocks
- fmov v4.d[1], x7 // AES block 0 - mov high
- eor x23, x23, x13 // AES block 3 - round N low
- eor x21, x21, x13 // AES block 2 - round N low
- fmov v5.d[1], x20 // AES block 1 - mov high
- fmov d6, x21 // AES block 2 - mov low
- add w12, w12, #1 // CTR block 4
- orr x9, x11, x9, lsl #32 // CTR block 4
- fmov d7, x23 // AES block 3 - mov low
- eor x22, x22, x14 // AES block 2 - round N high
- fmov v6.d[1], x22 // AES block 2 - mov high
- eor v4.16b, v4.16b, v0.16b // AES block 0 - result
- fmov d0, x10 // CTR block 4
- fmov v0.d[1], x9 // CTR block 4
- rev w9, w12 // CTR block 5
- add w12, w12, #1 // CTR block 5
- eor v5.16b, v5.16b, v1.16b // AES block 1 - result
- fmov d1, x10 // CTR block 5
- orr x9, x11, x9, lsl #32 // CTR block 5
- fmov v1.d[1], x9 // CTR block 5
- rev w9, w12 // CTR block 6
- st1 { v4.16b}, [x2], #16 // AES block 0 - store result
- fmov v7.d[1], x24 // AES block 3 - mov high
- orr x9, x11, x9, lsl #32 // CTR block 6
- eor v6.16b, v6.16b, v2.16b // AES block 2 - result
- st1 { v5.16b}, [x2], #16 // AES block 1 - store result
- add w12, w12, #1 // CTR block 6
- fmov d2, x10 // CTR block 6
- fmov v2.d[1], x9 // CTR block 6
- st1 { v6.16b}, [x2], #16 // AES block 2 - store result
- rev w9, w12 // CTR block 7
- orr x9, x11, x9, lsl #32 // CTR block 7
- eor v7.16b, v7.16b, v3.16b // AES block 3 - result
- st1 { v7.16b}, [x2], #16 // AES block 3 - store result
- b.ge Lenc_prepretail // do prepretail
-
-Lenc_main_loop: // main loop start
- aese v0.16b, v18.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 0
- rev64 v4.16b, v4.16b // GHASH block 4k (only t0 is free)
- aese v1.16b, v18.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 0
- fmov d3, x10 // CTR block 4k+3
- aese v2.16b, v18.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 0
- ext v11.16b, v11.16b, v11.16b, #8 // PRE 0
- aese v0.16b, v19.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 1
- fmov v3.d[1], x9 // CTR block 4k+3
- aese v1.16b, v19.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 1
- ldp x23, x24, [x0, #48] // AES block 4k+7 - load plaintext
- aese v2.16b, v19.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 1
- ldp x21, x22, [x0, #32] // AES block 4k+6 - load plaintext
- aese v0.16b, v20.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 2
- eor v4.16b, v4.16b, v11.16b // PRE 1
- aese v1.16b, v20.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 2
- aese v3.16b, v18.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 0
- eor x23, x23, x13 // AES block 4k+7 - round N low
- aese v0.16b, v21.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 3
- mov d10, v17.d[1] // GHASH block 4k - mid
- pmull2 v9.1q, v4.2d, v15.2d // GHASH block 4k - high
- eor x22, x22, x14 // AES block 4k+6 - round N high
- mov d8, v4.d[1] // GHASH block 4k - mid
- aese v3.16b, v19.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 1
- rev64 v5.16b, v5.16b // GHASH block 4k+1 (t0 and t1 free)
- aese v0.16b, v22.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 4
- pmull v11.1q, v4.1d, v15.1d // GHASH block 4k - low
- eor v8.8b, v8.8b, v4.8b // GHASH block 4k - mid
- aese v2.16b, v20.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 2
- aese v0.16b, v23.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 5
- rev64 v7.16b, v7.16b // GHASH block 4k+3 (t0, t1, t2 and t3 free)
- pmull2 v4.1q, v5.2d, v14.2d // GHASH block 4k+1 - high
- pmull v10.1q, v8.1d, v10.1d // GHASH block 4k - mid
- rev64 v6.16b, v6.16b // GHASH block 4k+2 (t0, t1, and t2 free)
- pmull v8.1q, v5.1d, v14.1d // GHASH block 4k+1 - low
- eor v9.16b, v9.16b, v4.16b // GHASH block 4k+1 - high
- mov d4, v5.d[1] // GHASH block 4k+1 - mid
- aese v1.16b, v21.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 3
- aese v3.16b, v20.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 2
- eor v11.16b, v11.16b, v8.16b // GHASH block 4k+1 - low
- aese v2.16b, v21.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 3
- aese v1.16b, v22.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 4
- mov d8, v6.d[1] // GHASH block 4k+2 - mid
- aese v3.16b, v21.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 3
- eor v4.8b, v4.8b, v5.8b // GHASH block 4k+1 - mid
- aese v2.16b, v22.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 4
- aese v0.16b, v24.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 6
- eor v8.8b, v8.8b, v6.8b // GHASH block 4k+2 - mid
- aese v3.16b, v22.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 4
- pmull v4.1q, v4.1d, v17.1d // GHASH block 4k+1 - mid
- aese v0.16b, v25.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 7
- aese v3.16b, v23.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 5
- ins v8.d[1], v8.d[0] // GHASH block 4k+2 - mid
- aese v1.16b, v23.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 5
- aese v0.16b, v26.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 8
- aese v2.16b, v23.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 5
- aese v1.16b, v24.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 6
- eor v10.16b, v10.16b, v4.16b // GHASH block 4k+1 - mid
- pmull2 v4.1q, v6.2d, v13.2d // GHASH block 4k+2 - high
- pmull v5.1q, v6.1d, v13.1d // GHASH block 4k+2 - low
- aese v1.16b, v25.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 7
- pmull v6.1q, v7.1d, v12.1d // GHASH block 4k+3 - low
- eor v9.16b, v9.16b, v4.16b // GHASH block 4k+2 - high
- aese v3.16b, v24.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 6
- ldp x19, x20, [x0, #16] // AES block 4k+5 - load plaintext
- aese v1.16b, v26.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 8
- mov d4, v7.d[1] // GHASH block 4k+3 - mid
- aese v2.16b, v24.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 6
- eor v11.16b, v11.16b, v5.16b // GHASH block 4k+2 - low
- pmull2 v8.1q, v8.2d, v16.2d // GHASH block 4k+2 - mid
- pmull2 v5.1q, v7.2d, v12.2d // GHASH block 4k+3 - high
- eor v4.8b, v4.8b, v7.8b // GHASH block 4k+3 - mid
- aese v2.16b, v25.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 7
- eor x19, x19, x13 // AES block 4k+5 - round N low
- aese v2.16b, v26.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 8
- eor v10.16b, v10.16b, v8.16b // GHASH block 4k+2 - mid
- aese v3.16b, v25.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 7
- eor x21, x21, x13 // AES block 4k+6 - round N low
- aese v3.16b, v26.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 8
- movi v8.8b, #0xc2
- pmull v4.1q, v4.1d, v16.1d // GHASH block 4k+3 - mid
- eor v9.16b, v9.16b, v5.16b // GHASH block 4k+3 - high
- cmp x17, #12 // setup flags for AES-128/192/256 check
- fmov d5, x19 // AES block 4k+5 - mov low
- ldp x6, x7, [x0, #0] // AES block 4k+4 - load plaintext
- b.lt Lenc_main_loop_continue // branch if AES-128
-
- aese v1.16b, v27.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 9
- aese v0.16b, v27.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 9
- aese v2.16b, v27.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 9
- aese v3.16b, v27.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 9
- aese v0.16b, v28.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 10
- aese v1.16b, v28.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 10
- aese v2.16b, v28.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 10
- aese v3.16b, v28.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 10
- b.eq Lenc_main_loop_continue // branch if AES-192
-
- aese v0.16b, v29.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 11
- aese v1.16b, v29.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 11
- aese v2.16b, v29.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 11
- aese v3.16b, v29.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 11
- aese v1.16b, v30.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 12
- aese v0.16b, v30.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 12
- aese v2.16b, v30.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 12
- aese v3.16b, v30.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 12
-
-Lenc_main_loop_continue:
- shl d8, d8, #56 // mod_constant
- eor v11.16b, v11.16b, v6.16b // GHASH block 4k+3 - low
- eor v10.16b, v10.16b, v4.16b // GHASH block 4k+3 - mid
- add w12, w12, #1 // CTR block 4k+3
- eor v4.16b, v11.16b, v9.16b // MODULO - karatsuba tidy up
- add x0, x0, #64 // AES input_ptr update
- pmull v7.1q, v9.1d, v8.1d // MODULO - top 64b align with mid
- rev w9, w12 // CTR block 4k+8
- ext v9.16b, v9.16b, v9.16b, #8 // MODULO - other top alignment
- eor x6, x6, x13 // AES block 4k+4 - round N low
- eor v10.16b, v10.16b, v4.16b // MODULO - karatsuba tidy up
- eor x7, x7, x14 // AES block 4k+4 - round N high
- fmov d4, x6 // AES block 4k+4 - mov low
- orr x9, x11, x9, lsl #32 // CTR block 4k+8
- eor v7.16b, v9.16b, v7.16b // MODULO - fold into mid
- eor x20, x20, x14 // AES block 4k+5 - round N high
- eor x24, x24, x14 // AES block 4k+7 - round N high
- add w12, w12, #1 // CTR block 4k+8
- aese v0.16b, v31.16b // AES block 4k+4 - round N-1
- fmov v4.d[1], x7 // AES block 4k+4 - mov high
- eor v10.16b, v10.16b, v7.16b // MODULO - fold into mid
- fmov d7, x23 // AES block 4k+7 - mov low
- aese v1.16b, v31.16b // AES block 4k+5 - round N-1
- fmov v5.d[1], x20 // AES block 4k+5 - mov high
- fmov d6, x21 // AES block 4k+6 - mov low
- cmp x0, x5 // LOOP CONTROL
- fmov v6.d[1], x22 // AES block 4k+6 - mov high
- pmull v9.1q, v10.1d, v8.1d // MODULO - mid 64b align with low
- eor v4.16b, v4.16b, v0.16b // AES block 4k+4 - result
- fmov d0, x10 // CTR block 4k+8
- fmov v0.d[1], x9 // CTR block 4k+8
- rev w9, w12 // CTR block 4k+9
- add w12, w12, #1 // CTR block 4k+9
- eor v5.16b, v5.16b, v1.16b // AES block 4k+5 - result
- fmov d1, x10 // CTR block 4k+9
- orr x9, x11, x9, lsl #32 // CTR block 4k+9
- fmov v1.d[1], x9 // CTR block 4k+9
- aese v2.16b, v31.16b // AES block 4k+6 - round N-1
- rev w9, w12 // CTR block 4k+10
- st1 { v4.16b}, [x2], #16 // AES block 4k+4 - store result
- orr x9, x11, x9, lsl #32 // CTR block 4k+10
- eor v11.16b, v11.16b, v9.16b // MODULO - fold into low
- fmov v7.d[1], x24 // AES block 4k+7 - mov high
- ext v10.16b, v10.16b, v10.16b, #8 // MODULO - other mid alignment
- st1 { v5.16b}, [x2], #16 // AES block 4k+5 - store result
- add w12, w12, #1 // CTR block 4k+10
- aese v3.16b, v31.16b // AES block 4k+7 - round N-1
- eor v6.16b, v6.16b, v2.16b // AES block 4k+6 - result
- fmov d2, x10 // CTR block 4k+10
- st1 { v6.16b}, [x2], #16 // AES block 4k+6 - store result
- fmov v2.d[1], x9 // CTR block 4k+10
- rev w9, w12 // CTR block 4k+11
- eor v11.16b, v11.16b, v10.16b // MODULO - fold into low
- orr x9, x11, x9, lsl #32 // CTR block 4k+11
- eor v7.16b, v7.16b, v3.16b // AES block 4k+7 - result
- st1 { v7.16b}, [x2], #16 // AES block 4k+7 - store result
- b.lt Lenc_main_loop
-
-Lenc_prepretail: // PREPRETAIL
- aese v1.16b, v18.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 0
- rev64 v6.16b, v6.16b // GHASH block 4k+2 (t0, t1, and t2 free)
- aese v2.16b, v18.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 0
- fmov d3, x10 // CTR block 4k+3
- aese v0.16b, v18.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 0
- rev64 v4.16b, v4.16b // GHASH block 4k (only t0 is free)
- fmov v3.d[1], x9 // CTR block 4k+3
- ext v11.16b, v11.16b, v11.16b, #8 // PRE 0
- aese v2.16b, v19.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 1
- aese v0.16b, v19.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 1
- eor v4.16b, v4.16b, v11.16b // PRE 1
- rev64 v5.16b, v5.16b // GHASH block 4k+1 (t0 and t1 free)
- aese v2.16b, v20.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 2
- aese v3.16b, v18.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 0
- mov d10, v17.d[1] // GHASH block 4k - mid
- aese v1.16b, v19.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 1
- pmull v11.1q, v4.1d, v15.1d // GHASH block 4k - low
- mov d8, v4.d[1] // GHASH block 4k - mid
- pmull2 v9.1q, v4.2d, v15.2d // GHASH block 4k - high
- aese v2.16b, v21.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 3
- aese v1.16b, v20.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 2
- eor v8.8b, v8.8b, v4.8b // GHASH block 4k - mid
- aese v0.16b, v20.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 2
- aese v3.16b, v19.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 1
- aese v1.16b, v21.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 3
- pmull v10.1q, v8.1d, v10.1d // GHASH block 4k - mid
- pmull2 v4.1q, v5.2d, v14.2d // GHASH block 4k+1 - high
- pmull v8.1q, v5.1d, v14.1d // GHASH block 4k+1 - low
- aese v3.16b, v20.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 2
- eor v9.16b, v9.16b, v4.16b // GHASH block 4k+1 - high
- mov d4, v5.d[1] // GHASH block 4k+1 - mid
- aese v0.16b, v21.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 3
- eor v11.16b, v11.16b, v8.16b // GHASH block 4k+1 - low
- aese v3.16b, v21.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 3
- eor v4.8b, v4.8b, v5.8b // GHASH block 4k+1 - mid
- mov d8, v6.d[1] // GHASH block 4k+2 - mid
- aese v0.16b, v22.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 4
- rev64 v7.16b, v7.16b // GHASH block 4k+3 (t0, t1, t2 and t3 free)
- aese v3.16b, v22.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 4
- pmull v4.1q, v4.1d, v17.1d // GHASH block 4k+1 - mid
- eor v8.8b, v8.8b, v6.8b // GHASH block 4k+2 - mid
- add w12, w12, #1 // CTR block 4k+3
- pmull v5.1q, v6.1d, v13.1d // GHASH block 4k+2 - low
- aese v3.16b, v23.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 5
- aese v2.16b, v22.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 4
- eor v10.16b, v10.16b, v4.16b // GHASH block 4k+1 - mid
- pmull2 v4.1q, v6.2d, v13.2d // GHASH block 4k+2 - high
- eor v11.16b, v11.16b, v5.16b // GHASH block 4k+2 - low
- ins v8.d[1], v8.d[0] // GHASH block 4k+2 - mid
- aese v2.16b, v23.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 5
- eor v9.16b, v9.16b, v4.16b // GHASH block 4k+2 - high
- mov d4, v7.d[1] // GHASH block 4k+3 - mid
- aese v1.16b, v22.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 4
- pmull2 v8.1q, v8.2d, v16.2d // GHASH block 4k+2 - mid
- eor v4.8b, v4.8b, v7.8b // GHASH block 4k+3 - mid
- pmull2 v5.1q, v7.2d, v12.2d // GHASH block 4k+3 - high
- aese v1.16b, v23.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 5
- pmull v4.1q, v4.1d, v16.1d // GHASH block 4k+3 - mid
- eor v10.16b, v10.16b, v8.16b // GHASH block 4k+2 - mid
- aese v0.16b, v23.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 5
- aese v1.16b, v24.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 6
- aese v2.16b, v24.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 6
- aese v0.16b, v24.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 6
- movi v8.8b, #0xc2
- aese v3.16b, v24.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 6
- aese v1.16b, v25.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 7
- eor v9.16b, v9.16b, v5.16b // GHASH block 4k+3 - high
- aese v0.16b, v25.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 7
- aese v3.16b, v25.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 7
- shl d8, d8, #56 // mod_constant
- aese v1.16b, v26.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 8
- eor v10.16b, v10.16b, v4.16b // GHASH block 4k+3 - mid
- pmull v6.1q, v7.1d, v12.1d // GHASH block 4k+3 - low
- aese v3.16b, v26.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 8
- cmp x17, #12 // setup flags for AES-128/192/256 check
- aese v0.16b, v26.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 8
- eor v11.16b, v11.16b, v6.16b // GHASH block 4k+3 - low
- aese v2.16b, v25.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 7
- eor v10.16b, v10.16b, v9.16b // karatsuba tidy up
- aese v2.16b, v26.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 8
- pmull v4.1q, v9.1d, v8.1d
- ext v9.16b, v9.16b, v9.16b, #8
- eor v10.16b, v10.16b, v11.16b
- b.lt Lenc_finish_prepretail // branch if AES-128
-
- aese v1.16b, v27.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 9
- aese v3.16b, v27.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 9
- aese v0.16b, v27.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 9
- aese v2.16b, v27.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 9
- aese v3.16b, v28.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 10
- aese v1.16b, v28.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 10
- aese v0.16b, v28.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 10
- aese v2.16b, v28.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 10
- b.eq Lenc_finish_prepretail // branch if AES-192
-
- aese v1.16b, v29.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 11
- aese v0.16b, v29.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 11
- aese v3.16b, v29.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 11
- aese v2.16b, v29.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 11
- aese v1.16b, v30.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 12
- aese v0.16b, v30.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 12
- aese v3.16b, v30.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 12
- aese v2.16b, v30.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 12
-
-Lenc_finish_prepretail:
- eor v10.16b, v10.16b, v4.16b
- eor v10.16b, v10.16b, v9.16b
- pmull v4.1q, v10.1d, v8.1d
- ext v10.16b, v10.16b, v10.16b, #8
- aese v1.16b, v31.16b // AES block 4k+5 - round N-1
- eor v11.16b, v11.16b, v4.16b
- aese v3.16b, v31.16b // AES block 4k+7 - round N-1
- aese v0.16b, v31.16b // AES block 4k+4 - round N-1
- aese v2.16b, v31.16b // AES block 4k+6 - round N-1
- eor v11.16b, v11.16b, v10.16b
-
-Lenc_tail: // TAIL
- ext v8.16b, v11.16b, v11.16b, #8 // prepare final partial tag
- sub x5, x4, x0 // main_end_input_ptr is number of bytes left to process
- ldp x6, x7, [x0], #16 // AES block 4k+4 - load plaintext
- eor x6, x6, x13 // AES block 4k+4 - round N low
- eor x7, x7, x14 // AES block 4k+4 - round N high
- cmp x5, #48
- fmov d4, x6 // AES block 4k+4 - mov low
- fmov v4.d[1], x7 // AES block 4k+4 - mov high
- eor v5.16b, v4.16b, v0.16b // AES block 4k+4 - result
- b.gt Lenc_blocks_more_than_3
- cmp x5, #32
- mov v3.16b, v2.16b
- movi v11.8b, #0
- movi v9.8b, #0
- sub w12, w12, #1
- mov v2.16b, v1.16b
- movi v10.8b, #0
- b.gt Lenc_blocks_more_than_2
- mov v3.16b, v1.16b
- sub w12, w12, #1
- cmp x5, #16
- b.gt Lenc_blocks_more_than_1
- sub w12, w12, #1
- b Lenc_blocks_less_than_1
-Lenc_blocks_more_than_3: // blocks left > 3
- st1 { v5.16b}, [x2], #16 // AES final-3 block - store result
- ldp x6, x7, [x0], #16 // AES final-2 block - load input low & high
- rev64 v4.16b, v5.16b // GHASH final-3 block
- eor x6, x6, x13 // AES final-2 block - round N low
- eor v4.16b, v4.16b, v8.16b // feed in partial tag
- eor x7, x7, x14 // AES final-2 block - round N high
- mov d22, v4.d[1] // GHASH final-3 block - mid
- fmov d5, x6 // AES final-2 block - mov low
- fmov v5.d[1], x7 // AES final-2 block - mov high
- eor v22.8b, v22.8b, v4.8b // GHASH final-3 block - mid
- movi v8.8b, #0 // suppress further partial tag feed in
- mov d10, v17.d[1] // GHASH final-3 block - mid
- pmull v11.1q, v4.1d, v15.1d // GHASH final-3 block - low
- pmull2 v9.1q, v4.2d, v15.2d // GHASH final-3 block - high
- pmull v10.1q, v22.1d, v10.1d // GHASH final-3 block - mid
- eor v5.16b, v5.16b, v1.16b // AES final-2 block - result
-Lenc_blocks_more_than_2: // blocks left > 2
- st1 { v5.16b}, [x2], #16 // AES final-2 block - store result
- ldp x6, x7, [x0], #16 // AES final-1 block - load input low & high
- rev64 v4.16b, v5.16b // GHASH final-2 block
- eor x6, x6, x13 // AES final-1 block - round N low
- eor v4.16b, v4.16b, v8.16b // feed in partial tag
- fmov d5, x6 // AES final-1 block - mov low
- eor x7, x7, x14 // AES final-1 block - round N high
- fmov v5.d[1], x7 // AES final-1 block - mov high
- movi v8.8b, #0 // suppress further partial tag feed in
- pmull2 v20.1q, v4.2d, v14.2d // GHASH final-2 block - high
- mov d22, v4.d[1] // GHASH final-2 block - mid
- pmull v21.1q, v4.1d, v14.1d // GHASH final-2 block - low
- eor v22.8b, v22.8b, v4.8b // GHASH final-2 block - mid
- eor v5.16b, v5.16b, v2.16b // AES final-1 block - result
- eor v9.16b, v9.16b, v20.16b // GHASH final-2 block - high
- pmull v22.1q, v22.1d, v17.1d // GHASH final-2 block - mid
- eor v11.16b, v11.16b, v21.16b // GHASH final-2 block - low
- eor v10.16b, v10.16b, v22.16b // GHASH final-2 block - mid
-Lenc_blocks_more_than_1: // blocks left > 1
- st1 { v5.16b}, [x2], #16 // AES final-1 block - store result
- rev64 v4.16b, v5.16b // GHASH final-1 block
- ldp x6, x7, [x0], #16 // AES final block - load input low & high
- eor v4.16b, v4.16b, v8.16b // feed in partial tag
- movi v8.8b, #0 // suppress further partial tag feed in
- eor x6, x6, x13 // AES final block - round N low
- mov d22, v4.d[1] // GHASH final-1 block - mid
- pmull2 v20.1q, v4.2d, v13.2d // GHASH final-1 block - high
- eor x7, x7, x14 // AES final block - round N high
- eor v22.8b, v22.8b, v4.8b // GHASH final-1 block - mid
- eor v9.16b, v9.16b, v20.16b // GHASH final-1 block - high
- ins v22.d[1], v22.d[0] // GHASH final-1 block - mid
- fmov d5, x6 // AES final block - mov low
- fmov v5.d[1], x7 // AES final block - mov high
- pmull2 v22.1q, v22.2d, v16.2d // GHASH final-1 block - mid
- pmull v21.1q, v4.1d, v13.1d // GHASH final-1 block - low
- eor v5.16b, v5.16b, v3.16b // AES final block - result
- eor v10.16b, v10.16b, v22.16b // GHASH final-1 block - mid
- eor v11.16b, v11.16b, v21.16b // GHASH final-1 block - low
-Lenc_blocks_less_than_1: // blocks left <= 1
- and x1, x1, #127 // bit_length %= 128
- mvn x13, xzr // rkN_l = 0xffffffffffffffff
- sub x1, x1, #128 // bit_length -= 128
- neg x1, x1 // bit_length = 128 - #bits in input (in range [1,128])
- ld1 { v18.16b}, [x2] // load existing bytes where the possibly partial last block is to be stored
- mvn x14, xzr // rkN_h = 0xffffffffffffffff
- and x1, x1, #127 // bit_length %= 128
- lsr x14, x14, x1 // rkN_h is mask for top 64b of last block
- cmp x1, #64
- csel x6, x13, x14, lt
- csel x7, x14, xzr, lt
- fmov d0, x6 // ctr0b is mask for last block
- fmov v0.d[1], x7
- and v5.16b, v5.16b, v0.16b // possibly partial last block has zeroes in highest bits
- rev64 v4.16b, v5.16b // GHASH final block
- eor v4.16b, v4.16b, v8.16b // feed in partial tag
- bif v5.16b, v18.16b, v0.16b // insert existing bytes in top end of result before storing
- pmull2 v20.1q, v4.2d, v12.2d // GHASH final block - high
- mov d8, v4.d[1] // GHASH final block - mid
- rev w9, w12
- pmull v21.1q, v4.1d, v12.1d // GHASH final block - low
- eor v9.16b, v9.16b, v20.16b // GHASH final block - high
- eor v8.8b, v8.8b, v4.8b // GHASH final block - mid
- pmull v8.1q, v8.1d, v16.1d // GHASH final block - mid
- eor v11.16b, v11.16b, v21.16b // GHASH final block - low
- eor v10.16b, v10.16b, v8.16b // GHASH final block - mid
- movi v8.8b, #0xc2
- eor v4.16b, v11.16b, v9.16b // MODULO - karatsuba tidy up
- shl d8, d8, #56 // mod_constant
- eor v10.16b, v10.16b, v4.16b // MODULO - karatsuba tidy up
- pmull v7.1q, v9.1d, v8.1d // MODULO - top 64b align with mid
- ext v9.16b, v9.16b, v9.16b, #8 // MODULO - other top alignment
- eor v10.16b, v10.16b, v7.16b // MODULO - fold into mid
- eor v10.16b, v10.16b, v9.16b // MODULO - fold into mid
- pmull v9.1q, v10.1d, v8.1d // MODULO - mid 64b align with low
- ext v10.16b, v10.16b, v10.16b, #8 // MODULO - other mid alignment
- str w9, [x16, #12] // store the updated counter
- st1 { v5.16b}, [x2] // store all 16B
- eor v11.16b, v11.16b, v9.16b // MODULO - fold into low
- eor v11.16b, v11.16b, v10.16b // MODULO - fold into low
- ext v11.16b, v11.16b, v11.16b, #8
- rev64 v11.16b, v11.16b
- mov x0, x15
- st1 { v11.16b }, [x3]
- ldp x19, x20, [sp, #16]
- ldp x21, x22, [sp, #32]
- ldp x23, x24, [sp, #48]
- ldp d8, d9, [sp, #64]
- ldp d10, d11, [sp, #80]
- ldp d12, d13, [sp, #96]
- ldp d14, d15, [sp, #112]
- ldp x29, x30, [sp], #128
- AARCH64_VALIDATE_LINK_REGISTER
- ret
-
-.globl aes_gcm_dec_kernel
-
-.def aes_gcm_dec_kernel
- .type 32
-.endef
-.align 4
-aes_gcm_dec_kernel:
- AARCH64_SIGN_LINK_REGISTER
- stp x29, x30, [sp, #-128]!
- mov x29, sp
- stp x19, x20, [sp, #16]
- mov x16, x4
- mov x8, x5
- stp x21, x22, [sp, #32]
- stp x23, x24, [sp, #48]
- stp d8, d9, [sp, #64]
- stp d10, d11, [sp, #80]
- stp d12, d13, [sp, #96]
- stp d14, d15, [sp, #112]
- ldr w17, [x8, #240]
- add x19, x8, x17, lsl #4 // borrow input_l1 for last key
- ldp x13, x14, [x19] // load round N keys
- ldr q31, [x19, #-16] // load round N-1 keys
- lsr x5, x1, #3 // byte_len
- mov x15, x5
- ldp x10, x11, [x16] // ctr96_b64, ctr96_t32
- ldr q26, [x8, #128] // load rk8
- sub x5, x5, #1 // byte_len - 1
- ldr q25, [x8, #112] // load rk7
- and x5, x5, #0xffffffffffffffc0 // number of bytes to be processed in main loop (at least 1 byte must be handled by tail)
- add x4, x0, x1, lsr #3 // end_input_ptr
- ldr q24, [x8, #96] // load rk6
- lsr x12, x11, #32
- ldr q23, [x8, #80] // load rk5
- orr w11, w11, w11
- ldr q21, [x8, #48] // load rk3
- add x5, x5, x0
- rev w12, w12 // rev_ctr32
- add w12, w12, #1 // increment rev_ctr32
- fmov d3, x10 // CTR block 3
- rev w9, w12 // CTR block 1
- add w12, w12, #1 // CTR block 1
- fmov d1, x10 // CTR block 1
- orr x9, x11, x9, lsl #32 // CTR block 1
- ld1 { v0.16b}, [x16] // special case vector load initial counter so we can start first AES block as quickly as possible
- fmov v1.d[1], x9 // CTR block 1
- rev w9, w12 // CTR block 2
- add w12, w12, #1 // CTR block 2
- fmov d2, x10 // CTR block 2
- orr x9, x11, x9, lsl #32 // CTR block 2
- fmov v2.d[1], x9 // CTR block 2
- rev w9, w12 // CTR block 3
- orr x9, x11, x9, lsl #32 // CTR block 3
- ldr q18, [x8, #0] // load rk0
- fmov v3.d[1], x9 // CTR block 3
- add w12, w12, #1 // CTR block 3
- ldr q22, [x8, #64] // load rk4
- ldr q19, [x8, #16] // load rk1
- aese v0.16b, v18.16b
- aesmc v0.16b, v0.16b // AES block 0 - round 0
- ldr q14, [x6, #48] // load h3l | h3h
- ext v14.16b, v14.16b, v14.16b, #8
- aese v3.16b, v18.16b
- aesmc v3.16b, v3.16b // AES block 3 - round 0
- ldr q15, [x6, #80] // load h4l | h4h
- ext v15.16b, v15.16b, v15.16b, #8
- aese v1.16b, v18.16b
- aesmc v1.16b, v1.16b // AES block 1 - round 0
- ldr q13, [x6, #32] // load h2l | h2h
- ext v13.16b, v13.16b, v13.16b, #8
- aese v2.16b, v18.16b
- aesmc v2.16b, v2.16b // AES block 2 - round 0
- ldr q20, [x8, #32] // load rk2
- aese v0.16b, v19.16b
- aesmc v0.16b, v0.16b // AES block 0 - round 1
- aese v1.16b, v19.16b
- aesmc v1.16b, v1.16b // AES block 1 - round 1
- ld1 { v11.16b}, [x3]
- ext v11.16b, v11.16b, v11.16b, #8
- rev64 v11.16b, v11.16b
- aese v2.16b, v19.16b
- aesmc v2.16b, v2.16b // AES block 2 - round 1
- ldr q27, [x8, #144] // load rk9
- aese v3.16b, v19.16b
- aesmc v3.16b, v3.16b // AES block 3 - round 1
- ldr q30, [x8, #192] // load rk12
- aese v0.16b, v20.16b
- aesmc v0.16b, v0.16b // AES block 0 - round 2
- ldr q12, [x6] // load h1l | h1h
- ext v12.16b, v12.16b, v12.16b, #8
- aese v2.16b, v20.16b
- aesmc v2.16b, v2.16b // AES block 2 - round 2
- ldr q28, [x8, #160] // load rk10
- aese v3.16b, v20.16b
- aesmc v3.16b, v3.16b // AES block 3 - round 2
- aese v0.16b, v21.16b
- aesmc v0.16b, v0.16b // AES block 0 - round 3
- aese v1.16b, v20.16b
- aesmc v1.16b, v1.16b // AES block 1 - round 2
- aese v3.16b, v21.16b
- aesmc v3.16b, v3.16b // AES block 3 - round 3
- aese v0.16b, v22.16b
- aesmc v0.16b, v0.16b // AES block 0 - round 4
- aese v2.16b, v21.16b
- aesmc v2.16b, v2.16b // AES block 2 - round 3
- aese v1.16b, v21.16b
- aesmc v1.16b, v1.16b // AES block 1 - round 3
- aese v3.16b, v22.16b
- aesmc v3.16b, v3.16b // AES block 3 - round 4
- aese v2.16b, v22.16b
- aesmc v2.16b, v2.16b // AES block 2 - round 4
- aese v1.16b, v22.16b
- aesmc v1.16b, v1.16b // AES block 1 - round 4
- aese v3.16b, v23.16b
- aesmc v3.16b, v3.16b // AES block 3 - round 5
- aese v0.16b, v23.16b
- aesmc v0.16b, v0.16b // AES block 0 - round 5
- aese v1.16b, v23.16b
- aesmc v1.16b, v1.16b // AES block 1 - round 5
- aese v2.16b, v23.16b
- aesmc v2.16b, v2.16b // AES block 2 - round 5
- aese v0.16b, v24.16b
- aesmc v0.16b, v0.16b // AES block 0 - round 6
- aese v3.16b, v24.16b
- aesmc v3.16b, v3.16b // AES block 3 - round 6
- cmp x17, #12 // setup flags for AES-128/192/256 check
- aese v1.16b, v24.16b
- aesmc v1.16b, v1.16b // AES block 1 - round 6
- aese v2.16b, v24.16b
- aesmc v2.16b, v2.16b // AES block 2 - round 6
- aese v0.16b, v25.16b
- aesmc v0.16b, v0.16b // AES block 0 - round 7
- aese v1.16b, v25.16b
- aesmc v1.16b, v1.16b // AES block 1 - round 7
- aese v3.16b, v25.16b
- aesmc v3.16b, v3.16b // AES block 3 - round 7
- aese v0.16b, v26.16b
- aesmc v0.16b, v0.16b // AES block 0 - round 8
- aese v2.16b, v25.16b
- aesmc v2.16b, v2.16b // AES block 2 - round 7
- aese v3.16b, v26.16b
- aesmc v3.16b, v3.16b // AES block 3 - round 8
- aese v1.16b, v26.16b
- aesmc v1.16b, v1.16b // AES block 1 - round 8
- ldr q29, [x8, #176] // load rk11
- aese v2.16b, v26.16b
- aesmc v2.16b, v2.16b // AES block 2 - round 8
- b.lt Ldec_finish_first_blocks // branch if AES-128
-
- aese v0.16b, v27.16b
- aesmc v0.16b, v0.16b // AES block 0 - round 9
- aese v1.16b, v27.16b
- aesmc v1.16b, v1.16b // AES block 1 - round 9
- aese v3.16b, v27.16b
- aesmc v3.16b, v3.16b // AES block 3 - round 9
- aese v2.16b, v27.16b
- aesmc v2.16b, v2.16b // AES block 2 - round 9
- aese v0.16b, v28.16b
- aesmc v0.16b, v0.16b // AES block 0 - round 10
- aese v1.16b, v28.16b
- aesmc v1.16b, v1.16b // AES block 1 - round 10
- aese v3.16b, v28.16b
- aesmc v3.16b, v3.16b // AES block 3 - round 10
- aese v2.16b, v28.16b
- aesmc v2.16b, v2.16b // AES block 2 - round 10
- b.eq Ldec_finish_first_blocks // branch if AES-192
-
- aese v0.16b, v29.16b
- aesmc v0.16b, v0.16b // AES block 0 - round 11
- aese v3.16b, v29.16b
- aesmc v3.16b, v3.16b // AES block 3 - round 11
- aese v1.16b, v29.16b
- aesmc v1.16b, v1.16b // AES block 1 - round 11
- aese v2.16b, v29.16b
- aesmc v2.16b, v2.16b // AES block 2 - round 11
- aese v1.16b, v30.16b
- aesmc v1.16b, v1.16b // AES block 1 - round 12
- aese v0.16b, v30.16b
- aesmc v0.16b, v0.16b // AES block 0 - round 12
- aese v2.16b, v30.16b
- aesmc v2.16b, v2.16b // AES block 2 - round 12
- aese v3.16b, v30.16b
- aesmc v3.16b, v3.16b // AES block 3 - round 12
-
-Ldec_finish_first_blocks:
- cmp x0, x5 // check if we have <= 4 blocks
- trn1 v9.2d, v14.2d, v15.2d // h4h | h3h
- trn2 v17.2d, v14.2d, v15.2d // h4l | h3l
- trn1 v8.2d, v12.2d, v13.2d // h2h | h1h
- trn2 v16.2d, v12.2d, v13.2d // h2l | h1l
- eor v17.16b, v17.16b, v9.16b // h4k | h3k
- aese v1.16b, v31.16b // AES block 1 - round N-1
- aese v2.16b, v31.16b // AES block 2 - round N-1
- eor v16.16b, v16.16b, v8.16b // h2k | h1k
- aese v3.16b, v31.16b // AES block 3 - round N-1
- aese v0.16b, v31.16b // AES block 0 - round N-1
- b.ge Ldec_tail // handle tail
-
- ldr q4, [x0, #0] // AES block 0 - load ciphertext
- ldr q5, [x0, #16] // AES block 1 - load ciphertext
- rev w9, w12 // CTR block 4
- eor v0.16b, v4.16b, v0.16b // AES block 0 - result
- eor v1.16b, v5.16b, v1.16b // AES block 1 - result
- rev64 v5.16b, v5.16b // GHASH block 1
- ldr q7, [x0, #48] // AES block 3 - load ciphertext
- mov x7, v0.d[1] // AES block 0 - mov high
- mov x6, v0.d[0] // AES block 0 - mov low
- rev64 v4.16b, v4.16b // GHASH block 0
- add w12, w12, #1 // CTR block 4
- fmov d0, x10 // CTR block 4
- orr x9, x11, x9, lsl #32 // CTR block 4
- fmov v0.d[1], x9 // CTR block 4
- rev w9, w12 // CTR block 5
- add w12, w12, #1 // CTR block 5
- mov x19, v1.d[0] // AES block 1 - mov low
- orr x9, x11, x9, lsl #32 // CTR block 5
- mov x20, v1.d[1] // AES block 1 - mov high
- eor x7, x7, x14 // AES block 0 - round N high
- eor x6, x6, x13 // AES block 0 - round N low
- stp x6, x7, [x2], #16 // AES block 0 - store result
- fmov d1, x10 // CTR block 5
- ldr q6, [x0, #32] // AES block 2 - load ciphertext
- add x0, x0, #64 // AES input_ptr update
- fmov v1.d[1], x9 // CTR block 5
- rev w9, w12 // CTR block 6
- add w12, w12, #1 // CTR block 6
- eor x19, x19, x13 // AES block 1 - round N low
- orr x9, x11, x9, lsl #32 // CTR block 6
- eor x20, x20, x14 // AES block 1 - round N high
- stp x19, x20, [x2], #16 // AES block 1 - store result
- eor v2.16b, v6.16b, v2.16b // AES block 2 - result
- cmp x0, x5 // check if we have <= 8 blocks
- b.ge Ldec_prepretail // do prepretail
-
-Ldec_main_loop: // main loop start
- mov x21, v2.d[0] // AES block 4k+2 - mov low
- ext v11.16b, v11.16b, v11.16b, #8 // PRE 0
- eor v3.16b, v7.16b, v3.16b // AES block 4k+3 - result
- aese v0.16b, v18.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 0
- mov x22, v2.d[1] // AES block 4k+2 - mov high
- aese v1.16b, v18.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 0
- fmov d2, x10 // CTR block 4k+6
- fmov v2.d[1], x9 // CTR block 4k+6
- eor v4.16b, v4.16b, v11.16b // PRE 1
- rev w9, w12 // CTR block 4k+7
- aese v0.16b, v19.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 1
- mov x24, v3.d[1] // AES block 4k+3 - mov high
- aese v1.16b, v19.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 1
- mov x23, v3.d[0] // AES block 4k+3 - mov low
- pmull2 v9.1q, v4.2d, v15.2d // GHASH block 4k - high
- mov d8, v4.d[1] // GHASH block 4k - mid
- fmov d3, x10 // CTR block 4k+7
- aese v0.16b, v20.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 2
- orr x9, x11, x9, lsl #32 // CTR block 4k+7
- aese v2.16b, v18.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 0
- fmov v3.d[1], x9 // CTR block 4k+7
- aese v1.16b, v20.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 2
- eor v8.8b, v8.8b, v4.8b // GHASH block 4k - mid
- aese v0.16b, v21.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 3
- eor x22, x22, x14 // AES block 4k+2 - round N high
- aese v2.16b, v19.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 1
- mov d10, v17.d[1] // GHASH block 4k - mid
- aese v1.16b, v21.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 3
- rev64 v6.16b, v6.16b // GHASH block 4k+2
- aese v3.16b, v18.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 0
- eor x21, x21, x13 // AES block 4k+2 - round N low
- aese v2.16b, v20.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 2
- stp x21, x22, [x2], #16 // AES block 4k+2 - store result
- pmull v11.1q, v4.1d, v15.1d // GHASH block 4k - low
- pmull2 v4.1q, v5.2d, v14.2d // GHASH block 4k+1 - high
- aese v2.16b, v21.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 3
- rev64 v7.16b, v7.16b // GHASH block 4k+3
- pmull v10.1q, v8.1d, v10.1d // GHASH block 4k - mid
- eor x23, x23, x13 // AES block 4k+3 - round N low
- pmull v8.1q, v5.1d, v14.1d // GHASH block 4k+1 - low
- eor x24, x24, x14 // AES block 4k+3 - round N high
- eor v9.16b, v9.16b, v4.16b // GHASH block 4k+1 - high
- aese v2.16b, v22.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 4
- aese v3.16b, v19.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 1
- mov d4, v5.d[1] // GHASH block 4k+1 - mid
- aese v0.16b, v22.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 4
- eor v11.16b, v11.16b, v8.16b // GHASH block 4k+1 - low
- aese v2.16b, v23.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 5
- add w12, w12, #1 // CTR block 4k+7
- aese v3.16b, v20.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 2
- mov d8, v6.d[1] // GHASH block 4k+2 - mid
- aese v1.16b, v22.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 4
- eor v4.8b, v4.8b, v5.8b // GHASH block 4k+1 - mid
- pmull v5.1q, v6.1d, v13.1d // GHASH block 4k+2 - low
- aese v3.16b, v21.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 3
- eor v8.8b, v8.8b, v6.8b // GHASH block 4k+2 - mid
- aese v1.16b, v23.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 5
- aese v0.16b, v23.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 5
- eor v11.16b, v11.16b, v5.16b // GHASH block 4k+2 - low
- pmull v4.1q, v4.1d, v17.1d // GHASH block 4k+1 - mid
- rev w9, w12 // CTR block 4k+8
- aese v1.16b, v24.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 6
- ins v8.d[1], v8.d[0] // GHASH block 4k+2 - mid
- aese v0.16b, v24.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 6
- add w12, w12, #1 // CTR block 4k+8
- aese v3.16b, v22.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 4
- aese v1.16b, v25.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 7
- eor v10.16b, v10.16b, v4.16b // GHASH block 4k+1 - mid
- aese v0.16b, v25.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 7
- pmull2 v4.1q, v6.2d, v13.2d // GHASH block 4k+2 - high
- mov d6, v7.d[1] // GHASH block 4k+3 - mid
- aese v3.16b, v23.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 5
- pmull2 v8.1q, v8.2d, v16.2d // GHASH block 4k+2 - mid
- aese v0.16b, v26.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 8
- eor v9.16b, v9.16b, v4.16b // GHASH block 4k+2 - high
- aese v3.16b, v24.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 6
- pmull v4.1q, v7.1d, v12.1d // GHASH block 4k+3 - low
- orr x9, x11, x9, lsl #32 // CTR block 4k+8
- eor v10.16b, v10.16b, v8.16b // GHASH block 4k+2 - mid
- pmull2 v5.1q, v7.2d, v12.2d // GHASH block 4k+3 - high
- cmp x17, #12 // setup flags for AES-128/192/256 check
- eor v6.8b, v6.8b, v7.8b // GHASH block 4k+3 - mid
- aese v1.16b, v26.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 8
- aese v2.16b, v24.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 6
- eor v9.16b, v9.16b, v5.16b // GHASH block 4k+3 - high
- pmull v6.1q, v6.1d, v16.1d // GHASH block 4k+3 - mid
- movi v8.8b, #0xc2
- aese v2.16b, v25.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 7
- eor v11.16b, v11.16b, v4.16b // GHASH block 4k+3 - low
- aese v3.16b, v25.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 7
- shl d8, d8, #56 // mod_constant
- aese v2.16b, v26.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 8
- eor v10.16b, v10.16b, v6.16b // GHASH block 4k+3 - mid
- aese v3.16b, v26.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 8
- b.lt Ldec_main_loop_continue // branch if AES-128
-
- aese v0.16b, v27.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 9
- aese v2.16b, v27.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 9
- aese v1.16b, v27.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 9
- aese v3.16b, v27.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 9
- aese v0.16b, v28.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 10
- aese v1.16b, v28.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 10
- aese v2.16b, v28.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 10
- aese v3.16b, v28.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 10
- b.eq Ldec_main_loop_continue // branch if AES-192
-
- aese v0.16b, v29.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 11
- aese v1.16b, v29.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 11
- aese v2.16b, v29.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 11
- aese v3.16b, v29.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 11
- aese v0.16b, v30.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 12
- aese v1.16b, v30.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 12
- aese v2.16b, v30.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 12
- aese v3.16b, v30.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 12
-
-Ldec_main_loop_continue:
- pmull v7.1q, v9.1d, v8.1d // MODULO - top 64b align with mid
- eor v6.16b, v11.16b, v9.16b // MODULO - karatsuba tidy up
- ldr q4, [x0, #0] // AES block 4k+4 - load ciphertext
- aese v0.16b, v31.16b // AES block 4k+4 - round N-1
- ext v9.16b, v9.16b, v9.16b, #8 // MODULO - other top alignment
- eor v10.16b, v10.16b, v6.16b // MODULO - karatsuba tidy up
- ldr q5, [x0, #16] // AES block 4k+5 - load ciphertext
- eor v0.16b, v4.16b, v0.16b // AES block 4k+4 - result
- stp x23, x24, [x2], #16 // AES block 4k+3 - store result
- eor v10.16b, v10.16b, v7.16b // MODULO - fold into mid
- ldr q7, [x0, #48] // AES block 4k+7 - load ciphertext
- ldr q6, [x0, #32] // AES block 4k+6 - load ciphertext
- mov x7, v0.d[1] // AES block 4k+4 - mov high
- eor v10.16b, v10.16b, v9.16b // MODULO - fold into mid
- aese v1.16b, v31.16b // AES block 4k+5 - round N-1
- add x0, x0, #64 // AES input_ptr update
- mov x6, v0.d[0] // AES block 4k+4 - mov low
- fmov d0, x10 // CTR block 4k+8
- fmov v0.d[1], x9 // CTR block 4k+8
- pmull v8.1q, v10.1d, v8.1d // MODULO - mid 64b align with low
- eor v1.16b, v5.16b, v1.16b // AES block 4k+5 - result
- rev w9, w12 // CTR block 4k+9
- aese v2.16b, v31.16b // AES block 4k+6 - round N-1
- orr x9, x11, x9, lsl #32 // CTR block 4k+9
- cmp x0, x5 // LOOP CONTROL
- add w12, w12, #1 // CTR block 4k+9
- eor x6, x6, x13 // AES block 4k+4 - round N low
- eor x7, x7, x14 // AES block 4k+4 - round N high
- mov x20, v1.d[1] // AES block 4k+5 - mov high
- eor v2.16b, v6.16b, v2.16b // AES block 4k+6 - result
- eor v11.16b, v11.16b, v8.16b // MODULO - fold into low
- mov x19, v1.d[0] // AES block 4k+5 - mov low
- fmov d1, x10 // CTR block 4k+9
- ext v10.16b, v10.16b, v10.16b, #8 // MODULO - other mid alignment
- fmov v1.d[1], x9 // CTR block 4k+9
- rev w9, w12 // CTR block 4k+10
- add w12, w12, #1 // CTR block 4k+10
- aese v3.16b, v31.16b // AES block 4k+7 - round N-1
- orr x9, x11, x9, lsl #32 // CTR block 4k+10
- rev64 v5.16b, v5.16b // GHASH block 4k+5
- eor x20, x20, x14 // AES block 4k+5 - round N high
- stp x6, x7, [x2], #16 // AES block 4k+4 - store result
- eor x19, x19, x13 // AES block 4k+5 - round N low
- stp x19, x20, [x2], #16 // AES block 4k+5 - store result
- rev64 v4.16b, v4.16b // GHASH block 4k+4
- eor v11.16b, v11.16b, v10.16b // MODULO - fold into low
- b.lt Ldec_main_loop
-
-Ldec_prepretail: // PREPRETAIL
- ext v11.16b, v11.16b, v11.16b, #8 // PRE 0
- mov x21, v2.d[0] // AES block 4k+2 - mov low
- eor v3.16b, v7.16b, v3.16b // AES block 4k+3 - result
- aese v0.16b, v18.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 0
- mov x22, v2.d[1] // AES block 4k+2 - mov high
- aese v1.16b, v18.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 0
- fmov d2, x10 // CTR block 4k+6
- fmov v2.d[1], x9 // CTR block 4k+6
- rev w9, w12 // CTR block 4k+7
- eor v4.16b, v4.16b, v11.16b // PRE 1
- rev64 v6.16b, v6.16b // GHASH block 4k+2
- orr x9, x11, x9, lsl #32 // CTR block 4k+7
- mov x23, v3.d[0] // AES block 4k+3 - mov low
- aese v1.16b, v19.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 1
- mov x24, v3.d[1] // AES block 4k+3 - mov high
- pmull v11.1q, v4.1d, v15.1d // GHASH block 4k - low
- mov d8, v4.d[1] // GHASH block 4k - mid
- fmov d3, x10 // CTR block 4k+7
- pmull2 v9.1q, v4.2d, v15.2d // GHASH block 4k - high
- fmov v3.d[1], x9 // CTR block 4k+7
- aese v2.16b, v18.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 0
- mov d10, v17.d[1] // GHASH block 4k - mid
- aese v0.16b, v19.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 1
- eor v8.8b, v8.8b, v4.8b // GHASH block 4k - mid
- pmull2 v4.1q, v5.2d, v14.2d // GHASH block 4k+1 - high
- aese v2.16b, v19.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 1
- rev64 v7.16b, v7.16b // GHASH block 4k+3
- aese v3.16b, v18.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 0
- pmull v10.1q, v8.1d, v10.1d // GHASH block 4k - mid
- eor v9.16b, v9.16b, v4.16b // GHASH block 4k+1 - high
- pmull v8.1q, v5.1d, v14.1d // GHASH block 4k+1 - low
- aese v3.16b, v19.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 1
- mov d4, v5.d[1] // GHASH block 4k+1 - mid
- aese v0.16b, v20.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 2
- aese v1.16b, v20.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 2
- eor v11.16b, v11.16b, v8.16b // GHASH block 4k+1 - low
- aese v2.16b, v20.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 2
- aese v0.16b, v21.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 3
- mov d8, v6.d[1] // GHASH block 4k+2 - mid
- aese v3.16b, v20.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 2
- eor v4.8b, v4.8b, v5.8b // GHASH block 4k+1 - mid
- pmull v5.1q, v6.1d, v13.1d // GHASH block 4k+2 - low
- aese v0.16b, v22.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 4
- aese v3.16b, v21.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 3
- eor v8.8b, v8.8b, v6.8b // GHASH block 4k+2 - mid
- pmull v4.1q, v4.1d, v17.1d // GHASH block 4k+1 - mid
- aese v0.16b, v23.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 5
- eor v11.16b, v11.16b, v5.16b // GHASH block 4k+2 - low
- aese v3.16b, v22.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 4
- pmull2 v5.1q, v7.2d, v12.2d // GHASH block 4k+3 - high
- eor v10.16b, v10.16b, v4.16b // GHASH block 4k+1 - mid
- pmull2 v4.1q, v6.2d, v13.2d // GHASH block 4k+2 - high
- aese v3.16b, v23.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 5
- ins v8.d[1], v8.d[0] // GHASH block 4k+2 - mid
- aese v2.16b, v21.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 3
- aese v1.16b, v21.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 3
- eor v9.16b, v9.16b, v4.16b // GHASH block 4k+2 - high
- pmull v4.1q, v7.1d, v12.1d // GHASH block 4k+3 - low
- aese v2.16b, v22.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 4
- mov d6, v7.d[1] // GHASH block 4k+3 - mid
- aese v1.16b, v22.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 4
- pmull2 v8.1q, v8.2d, v16.2d // GHASH block 4k+2 - mid
- aese v2.16b, v23.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 5
- eor v6.8b, v6.8b, v7.8b // GHASH block 4k+3 - mid
- aese v1.16b, v23.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 5
- aese v3.16b, v24.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 6
- eor v10.16b, v10.16b, v8.16b // GHASH block 4k+2 - mid
- aese v2.16b, v24.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 6
- aese v0.16b, v24.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 6
- movi v8.8b, #0xc2
- aese v1.16b, v24.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 6
- eor v11.16b, v11.16b, v4.16b // GHASH block 4k+3 - low
- pmull v6.1q, v6.1d, v16.1d // GHASH block 4k+3 - mid
- aese v3.16b, v25.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 7
- cmp x17, #12 // setup flags for AES-128/192/256 check
- eor v9.16b, v9.16b, v5.16b // GHASH block 4k+3 - high
- aese v1.16b, v25.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 7
- aese v0.16b, v25.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 7
- eor v10.16b, v10.16b, v6.16b // GHASH block 4k+3 - mid
- aese v3.16b, v26.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 8
- aese v2.16b, v25.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 7
- eor v6.16b, v11.16b, v9.16b // MODULO - karatsuba tidy up
- aese v1.16b, v26.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 8
- aese v0.16b, v26.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 8
- shl d8, d8, #56 // mod_constant
- aese v2.16b, v26.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 8
- b.lt Ldec_finish_prepretail // branch if AES-128
-
- aese v1.16b, v27.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 9
- aese v2.16b, v27.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 9
- aese v3.16b, v27.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 9
- aese v0.16b, v27.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 9
- aese v2.16b, v28.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 10
- aese v3.16b, v28.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 10
- aese v0.16b, v28.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 10
- aese v1.16b, v28.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 10
- b.eq Ldec_finish_prepretail // branch if AES-192
-
- aese v2.16b, v29.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 11
- aese v0.16b, v29.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 11
- aese v1.16b, v29.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 11
- aese v2.16b, v30.16b
- aesmc v2.16b, v2.16b // AES block 4k+6 - round 12
- aese v3.16b, v29.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 11
- aese v1.16b, v30.16b
- aesmc v1.16b, v1.16b // AES block 4k+5 - round 12
- aese v0.16b, v30.16b
- aesmc v0.16b, v0.16b // AES block 4k+4 - round 12
- aese v3.16b, v30.16b
- aesmc v3.16b, v3.16b // AES block 4k+7 - round 12
-
-Ldec_finish_prepretail:
- eor v10.16b, v10.16b, v6.16b // MODULO - karatsuba tidy up
- pmull v7.1q, v9.1d, v8.1d // MODULO - top 64b align with mid
- ext v9.16b, v9.16b, v9.16b, #8 // MODULO - other top alignment
- eor v10.16b, v10.16b, v7.16b // MODULO - fold into mid
- eor x22, x22, x14 // AES block 4k+2 - round N high
- eor x23, x23, x13 // AES block 4k+3 - round N low
- eor v10.16b, v10.16b, v9.16b // MODULO - fold into mid
- add w12, w12, #1 // CTR block 4k+7
- eor x21, x21, x13 // AES block 4k+2 - round N low
- pmull v8.1q, v10.1d, v8.1d // MODULO - mid 64b align with low
- eor x24, x24, x14 // AES block 4k+3 - round N high
- stp x21, x22, [x2], #16 // AES block 4k+2 - store result
- ext v10.16b, v10.16b, v10.16b, #8 // MODULO - other mid alignment
- stp x23, x24, [x2], #16 // AES block 4k+3 - store result
-
- eor v11.16b, v11.16b, v8.16b // MODULO - fold into low
- aese v1.16b, v31.16b // AES block 4k+5 - round N-1
- aese v0.16b, v31.16b // AES block 4k+4 - round N-1
- aese v3.16b, v31.16b // AES block 4k+7 - round N-1
- aese v2.16b, v31.16b // AES block 4k+6 - round N-1
- eor v11.16b, v11.16b, v10.16b // MODULO - fold into low
-
-Ldec_tail: // TAIL
- sub x5, x4, x0 // main_end_input_ptr is number of bytes left to process
- ld1 { v5.16b}, [x0], #16 // AES block 4k+4 - load ciphertext
- eor v0.16b, v5.16b, v0.16b // AES block 4k+4 - result
- mov x6, v0.d[0] // AES block 4k+4 - mov low
- mov x7, v0.d[1] // AES block 4k+4 - mov high
- ext v8.16b, v11.16b, v11.16b, #8 // prepare final partial tag
- cmp x5, #48
- eor x6, x6, x13 // AES block 4k+4 - round N low
- eor x7, x7, x14 // AES block 4k+4 - round N high
- b.gt Ldec_blocks_more_than_3
- sub w12, w12, #1
- mov v3.16b, v2.16b
- movi v10.8b, #0
- movi v11.8b, #0
- cmp x5, #32
- movi v9.8b, #0
- mov v2.16b, v1.16b
- b.gt Ldec_blocks_more_than_2
- sub w12, w12, #1
- mov v3.16b, v1.16b
- cmp x5, #16
- b.gt Ldec_blocks_more_than_1
- sub w12, w12, #1
- b Ldec_blocks_less_than_1
-Ldec_blocks_more_than_3: // blocks left > 3
- rev64 v4.16b, v5.16b // GHASH final-3 block
- ld1 { v5.16b}, [x0], #16 // AES final-2 block - load ciphertext
- stp x6, x7, [x2], #16 // AES final-3 block - store result
- mov d10, v17.d[1] // GHASH final-3 block - mid
- eor v4.16b, v4.16b, v8.16b // feed in partial tag
- eor v0.16b, v5.16b, v1.16b // AES final-2 block - result
- mov d22, v4.d[1] // GHASH final-3 block - mid
- mov x6, v0.d[0] // AES final-2 block - mov low
- mov x7, v0.d[1] // AES final-2 block - mov high
- eor v22.8b, v22.8b, v4.8b // GHASH final-3 block - mid
- movi v8.8b, #0 // suppress further partial tag feed in
- pmull2 v9.1q, v4.2d, v15.2d // GHASH final-3 block - high
- pmull v10.1q, v22.1d, v10.1d // GHASH final-3 block - mid
- eor x6, x6, x13 // AES final-2 block - round N low
- pmull v11.1q, v4.1d, v15.1d // GHASH final-3 block - low
- eor x7, x7, x14 // AES final-2 block - round N high
-Ldec_blocks_more_than_2: // blocks left > 2
- rev64 v4.16b, v5.16b // GHASH final-2 block
- ld1 { v5.16b}, [x0], #16 // AES final-1 block - load ciphertext
- eor v4.16b, v4.16b, v8.16b // feed in partial tag
- stp x6, x7, [x2], #16 // AES final-2 block - store result
- eor v0.16b, v5.16b, v2.16b // AES final-1 block - result
- mov d22, v4.d[1] // GHASH final-2 block - mid
- pmull v21.1q, v4.1d, v14.1d // GHASH final-2 block - low
- pmull2 v20.1q, v4.2d, v14.2d // GHASH final-2 block - high
- eor v22.8b, v22.8b, v4.8b // GHASH final-2 block - mid
- mov x6, v0.d[0] // AES final-1 block - mov low
- mov x7, v0.d[1] // AES final-1 block - mov high
- eor v11.16b, v11.16b, v21.16b // GHASH final-2 block - low
- movi v8.8b, #0 // suppress further partial tag feed in
- pmull v22.1q, v22.1d, v17.1d // GHASH final-2 block - mid
- eor v9.16b, v9.16b, v20.16b // GHASH final-2 block - high
- eor x6, x6, x13 // AES final-1 block - round N low
- eor v10.16b, v10.16b, v22.16b // GHASH final-2 block - mid
- eor x7, x7, x14 // AES final-1 block - round N high
-Ldec_blocks_more_than_1: // blocks left > 1
- stp x6, x7, [x2], #16 // AES final-1 block - store result
- rev64 v4.16b, v5.16b // GHASH final-1 block
- ld1 { v5.16b}, [x0], #16 // AES final block - load ciphertext
- eor v4.16b, v4.16b, v8.16b // feed in partial tag
- movi v8.8b, #0 // suppress further partial tag feed in
- mov d22, v4.d[1] // GHASH final-1 block - mid
- eor v0.16b, v5.16b, v3.16b // AES final block - result
- pmull2 v20.1q, v4.2d, v13.2d // GHASH final-1 block - high
- eor v22.8b, v22.8b, v4.8b // GHASH final-1 block - mid
- pmull v21.1q, v4.1d, v13.1d // GHASH final-1 block - low
- mov x6, v0.d[0] // AES final block - mov low
- ins v22.d[1], v22.d[0] // GHASH final-1 block - mid
- mov x7, v0.d[1] // AES final block - mov high
- pmull2 v22.1q, v22.2d, v16.2d // GHASH final-1 block - mid
- eor x6, x6, x13 // AES final block - round N low
- eor v11.16b, v11.16b, v21.16b // GHASH final-1 block - low
- eor v9.16b, v9.16b, v20.16b // GHASH final-1 block - high
- eor v10.16b, v10.16b, v22.16b // GHASH final-1 block - mid
- eor x7, x7, x14 // AES final block - round N high
-Ldec_blocks_less_than_1: // blocks left <= 1
- and x1, x1, #127 // bit_length %= 128
- mvn x14, xzr // rkN_h = 0xffffffffffffffff
- sub x1, x1, #128 // bit_length -= 128
- mvn x13, xzr // rkN_l = 0xffffffffffffffff
- ldp x4, x5, [x2] // load existing bytes we need to not overwrite
- neg x1, x1 // bit_length = 128 - #bits in input (in range [1,128])
- and x1, x1, #127 // bit_length %= 128
- lsr x14, x14, x1 // rkN_h is mask for top 64b of last block
- cmp x1, #64
- csel x9, x13, x14, lt
- csel x10, x14, xzr, lt
- fmov d0, x9 // ctr0b is mask for last block
- and x6, x6, x9
- mov v0.d[1], x10
- bic x4, x4, x9 // mask out low existing bytes
- rev w9, w12
- bic x5, x5, x10 // mask out high existing bytes
- orr x6, x6, x4
- and x7, x7, x10
- orr x7, x7, x5
- and v5.16b, v5.16b, v0.16b // possibly partial last block has zeroes in highest bits
- rev64 v4.16b, v5.16b // GHASH final block
- eor v4.16b, v4.16b, v8.16b // feed in partial tag
- pmull v21.1q, v4.1d, v12.1d // GHASH final block - low
- mov d8, v4.d[1] // GHASH final block - mid
- eor v8.8b, v8.8b, v4.8b // GHASH final block - mid
- pmull2 v20.1q, v4.2d, v12.2d // GHASH final block - high
- pmull v8.1q, v8.1d, v16.1d // GHASH final block - mid
- eor v9.16b, v9.16b, v20.16b // GHASH final block - high
- eor v11.16b, v11.16b, v21.16b // GHASH final block - low
- eor v10.16b, v10.16b, v8.16b // GHASH final block - mid
- movi v8.8b, #0xc2
- eor v6.16b, v11.16b, v9.16b // MODULO - karatsuba tidy up
- shl d8, d8, #56 // mod_constant
- eor v10.16b, v10.16b, v6.16b // MODULO - karatsuba tidy up
- pmull v7.1q, v9.1d, v8.1d // MODULO - top 64b align with mid
- ext v9.16b, v9.16b, v9.16b, #8 // MODULO - other top alignment
- eor v10.16b, v10.16b, v7.16b // MODULO - fold into mid
- eor v10.16b, v10.16b, v9.16b // MODULO - fold into mid
- pmull v8.1q, v10.1d, v8.1d // MODULO - mid 64b align with low
- ext v10.16b, v10.16b, v10.16b, #8 // MODULO - other mid alignment
- eor v11.16b, v11.16b, v8.16b // MODULO - fold into low
- stp x6, x7, [x2]
- str w9, [x16, #12] // store the updated counter
- eor v11.16b, v11.16b, v10.16b // MODULO - fold into low
- ext v11.16b, v11.16b, v11.16b, #8
- rev64 v11.16b, v11.16b
- mov x0, x15
- st1 { v11.16b }, [x3]
- ldp x19, x20, [sp, #16]
- ldp x21, x22, [sp, #32]
- ldp x23, x24, [sp, #48]
- ldp d8, d9, [sp, #64]
- ldp d10, d11, [sp, #80]
- ldp d12, d13, [sp, #96]
- ldp d14, d15, [sp, #112]
- ldp x29, x30, [sp], #128
- AARCH64_VALIDATE_LINK_REGISTER
- ret
-
-#endif
-#endif // !OPENSSL_NO_ASM && defined(OPENSSL_AARCH64) && defined(_WIN32)
diff --git a/win-aarch64/crypto/fipsmodule/armv8-mont-win.S b/win-aarch64/crypto/fipsmodule/armv8-mont-win.S
deleted file mode 100644
index dcce02c9..00000000
--- a/win-aarch64/crypto/fipsmodule/armv8-mont-win.S
+++ /dev/null
@@ -1,1431 +0,0 @@
-// This file is generated from a similarly-named Perl script in the BoringSSL
-// source tree. Do not edit by hand.
-
-#include <openssl/asm_base.h>
-
-#if !defined(OPENSSL_NO_ASM) && defined(OPENSSL_AARCH64) && defined(_WIN32)
-#include <openssl/arm_arch.h>
-
-.text
-
-.globl bn_mul_mont
-
-.def bn_mul_mont
- .type 32
-.endef
-.align 5
-bn_mul_mont:
- AARCH64_SIGN_LINK_REGISTER
- tst x5,#7
- b.eq __bn_sqr8x_mont
- tst x5,#3
- b.eq __bn_mul4x_mont
-Lmul_mont:
- stp x29,x30,[sp,#-64]!
- add x29,sp,#0
- stp x19,x20,[sp,#16]
- stp x21,x22,[sp,#32]
- stp x23,x24,[sp,#48]
-
- ldr x9,[x2],#8 // bp[0]
- sub x22,sp,x5,lsl#3
- ldp x7,x8,[x1],#16 // ap[0..1]
- lsl x5,x5,#3
- ldr x4,[x4] // *n0
- and x22,x22,#-16 // ABI says so
- ldp x13,x14,[x3],#16 // np[0..1]
-
- mul x6,x7,x9 // ap[0]*bp[0]
- sub x21,x5,#16 // j=num-2
- umulh x7,x7,x9
- mul x10,x8,x9 // ap[1]*bp[0]
- umulh x11,x8,x9
-
- mul x15,x6,x4 // "tp[0]"*n0
- mov sp,x22 // alloca
-
- // (*) mul x12,x13,x15 // np[0]*m1
- umulh x13,x13,x15
- mul x16,x14,x15 // np[1]*m1
- // (*) adds x12,x12,x6 // discarded
- // (*) As for removal of first multiplication and addition
- // instructions. The outcome of first addition is
- // guaranteed to be zero, which leaves two computationally
- // significant outcomes: it either carries or not. Then
- // question is when does it carry? Is there alternative
- // way to deduce it? If you follow operations, you can
- // observe that condition for carry is quite simple:
- // x6 being non-zero. So that carry can be calculated
- // by adding -1 to x6. That's what next instruction does.
- subs xzr,x6,#1 // (*)
- umulh x17,x14,x15
- adc x13,x13,xzr
- cbz x21,L1st_skip
-
-L1st:
- ldr x8,[x1],#8
- adds x6,x10,x7
- sub x21,x21,#8 // j--
- adc x7,x11,xzr
-
- ldr x14,[x3],#8
- adds x12,x16,x13
- mul x10,x8,x9 // ap[j]*bp[0]
- adc x13,x17,xzr
- umulh x11,x8,x9
-
- adds x12,x12,x6
- mul x16,x14,x15 // np[j]*m1
- adc x13,x13,xzr
- umulh x17,x14,x15
- str x12,[x22],#8 // tp[j-1]
- cbnz x21,L1st
-
-L1st_skip:
- adds x6,x10,x7
- sub x1,x1,x5 // rewind x1
- adc x7,x11,xzr
-
- adds x12,x16,x13
- sub x3,x3,x5 // rewind x3
- adc x13,x17,xzr
-
- adds x12,x12,x6
- sub x20,x5,#8 // i=num-1
- adcs x13,x13,x7
-
- adc x19,xzr,xzr // upmost overflow bit
- stp x12,x13,[x22]
-
-Louter:
- ldr x9,[x2],#8 // bp[i]
- ldp x7,x8,[x1],#16
- ldr x23,[sp] // tp[0]
- add x22,sp,#8
-
- mul x6,x7,x9 // ap[0]*bp[i]
- sub x21,x5,#16 // j=num-2
- umulh x7,x7,x9
- ldp x13,x14,[x3],#16
- mul x10,x8,x9 // ap[1]*bp[i]
- adds x6,x6,x23
- umulh x11,x8,x9
- adc x7,x7,xzr
-
- mul x15,x6,x4
- sub x20,x20,#8 // i--
-
- // (*) mul x12,x13,x15 // np[0]*m1
- umulh x13,x13,x15
- mul x16,x14,x15 // np[1]*m1
- // (*) adds x12,x12,x6
- subs xzr,x6,#1 // (*)
- umulh x17,x14,x15
- cbz x21,Linner_skip
-
-Linner:
- ldr x8,[x1],#8
- adc x13,x13,xzr
- ldr x23,[x22],#8 // tp[j]
- adds x6,x10,x7
- sub x21,x21,#8 // j--
- adc x7,x11,xzr
-
- adds x12,x16,x13
- ldr x14,[x3],#8
- adc x13,x17,xzr
-
- mul x10,x8,x9 // ap[j]*bp[i]
- adds x6,x6,x23
- umulh x11,x8,x9
- adc x7,x7,xzr
-
- mul x16,x14,x15 // np[j]*m1
- adds x12,x12,x6
- umulh x17,x14,x15
- str x12,[x22,#-16] // tp[j-1]
- cbnz x21,Linner
-
-Linner_skip:
- ldr x23,[x22],#8 // tp[j]
- adc x13,x13,xzr
- adds x6,x10,x7
- sub x1,x1,x5 // rewind x1
- adc x7,x11,xzr
-
- adds x12,x16,x13
- sub x3,x3,x5 // rewind x3
- adcs x13,x17,x19
- adc x19,xzr,xzr
-
- adds x6,x6,x23
- adc x7,x7,xzr
-
- adds x12,x12,x6
- adcs x13,x13,x7
- adc x19,x19,xzr // upmost overflow bit
- stp x12,x13,[x22,#-16]
-
- cbnz x20,Louter
-
- // Final step. We see if result is larger than modulus, and
- // if it is, subtract the modulus. But comparison implies
- // subtraction. So we subtract modulus, see if it borrowed,
- // and conditionally copy original value.
- ldr x23,[sp] // tp[0]
- add x22,sp,#8
- ldr x14,[x3],#8 // np[0]
- subs x21,x5,#8 // j=num-1 and clear borrow
- mov x1,x0
-Lsub:
- sbcs x8,x23,x14 // tp[j]-np[j]
- ldr x23,[x22],#8
- sub x21,x21,#8 // j--
- ldr x14,[x3],#8
- str x8,[x1],#8 // rp[j]=tp[j]-np[j]
- cbnz x21,Lsub
-
- sbcs x8,x23,x14
- sbcs x19,x19,xzr // did it borrow?
- str x8,[x1],#8 // rp[num-1]
-
- ldr x23,[sp] // tp[0]
- add x22,sp,#8
- ldr x8,[x0],#8 // rp[0]
- sub x5,x5,#8 // num--
- nop
-Lcond_copy:
- sub x5,x5,#8 // num--
- csel x14,x23,x8,lo // did it borrow?
- ldr x23,[x22],#8
- ldr x8,[x0],#8
- str xzr,[x22,#-16] // wipe tp
- str x14,[x0,#-16]
- cbnz x5,Lcond_copy
-
- csel x14,x23,x8,lo
- str xzr,[x22,#-8] // wipe tp
- str x14,[x0,#-8]
-
- ldp x19,x20,[x29,#16]
- mov sp,x29
- ldp x21,x22,[x29,#32]
- mov x0,#1
- ldp x23,x24,[x29,#48]
- ldr x29,[sp],#64
- AARCH64_VALIDATE_LINK_REGISTER
- ret
-
-.def __bn_sqr8x_mont
- .type 32
-.endef
-.align 5
-__bn_sqr8x_mont:
- // Not adding AARCH64_SIGN_LINK_REGISTER here because __bn_sqr8x_mont is jumped to
- // only from bn_mul_mont which has already signed the return address.
- cmp x1,x2
- b.ne __bn_mul4x_mont
-Lsqr8x_mont:
- stp x29,x30,[sp,#-128]!
- add x29,sp,#0
- stp x19,x20,[sp,#16]
- stp x21,x22,[sp,#32]
- stp x23,x24,[sp,#48]
- stp x25,x26,[sp,#64]
- stp x27,x28,[sp,#80]
- stp x0,x3,[sp,#96] // offload rp and np
-
- ldp x6,x7,[x1,#8*0]
- ldp x8,x9,[x1,#8*2]
- ldp x10,x11,[x1,#8*4]
- ldp x12,x13,[x1,#8*6]
-
- sub x2,sp,x5,lsl#4
- lsl x5,x5,#3
- ldr x4,[x4] // *n0
- mov sp,x2 // alloca
- sub x27,x5,#8*8
- b Lsqr8x_zero_start
-
-Lsqr8x_zero:
- sub x27,x27,#8*8
- stp xzr,xzr,[x2,#8*0]
- stp xzr,xzr,[x2,#8*2]
- stp xzr,xzr,[x2,#8*4]
- stp xzr,xzr,[x2,#8*6]
-Lsqr8x_zero_start:
- stp xzr,xzr,[x2,#8*8]
- stp xzr,xzr,[x2,#8*10]
- stp xzr,xzr,[x2,#8*12]
- stp xzr,xzr,[x2,#8*14]
- add x2,x2,#8*16
- cbnz x27,Lsqr8x_zero
-
- add x3,x1,x5
- add x1,x1,#8*8
- mov x19,xzr
- mov x20,xzr
- mov x21,xzr
- mov x22,xzr
- mov x23,xzr
- mov x24,xzr
- mov x25,xzr
- mov x26,xzr
- mov x2,sp
- str x4,[x29,#112] // offload n0
-
- // Multiply everything but a[i]*a[i]
-.align 4
-Lsqr8x_outer_loop:
- // a[1]a[0] (i)
- // a[2]a[0]
- // a[3]a[0]
- // a[4]a[0]
- // a[5]a[0]
- // a[6]a[0]
- // a[7]a[0]
- // a[2]a[1] (ii)
- // a[3]a[1]
- // a[4]a[1]
- // a[5]a[1]
- // a[6]a[1]
- // a[7]a[1]
- // a[3]a[2] (iii)
- // a[4]a[2]
- // a[5]a[2]
- // a[6]a[2]
- // a[7]a[2]
- // a[4]a[3] (iv)
- // a[5]a[3]
- // a[6]a[3]
- // a[7]a[3]
- // a[5]a[4] (v)
- // a[6]a[4]
- // a[7]a[4]
- // a[6]a[5] (vi)
- // a[7]a[5]
- // a[7]a[6] (vii)
-
- mul x14,x7,x6 // lo(a[1..7]*a[0]) (i)
- mul x15,x8,x6
- mul x16,x9,x6
- mul x17,x10,x6
- adds x20,x20,x14 // t[1]+lo(a[1]*a[0])
- mul x14,x11,x6
- adcs x21,x21,x15
- mul x15,x12,x6
- adcs x22,x22,x16
- mul x16,x13,x6
- adcs x23,x23,x17
- umulh x17,x7,x6 // hi(a[1..7]*a[0])
- adcs x24,x24,x14
- umulh x14,x8,x6
- adcs x25,x25,x15
- umulh x15,x9,x6
- adcs x26,x26,x16
- umulh x16,x10,x6
- stp x19,x20,[x2],#8*2 // t[0..1]
- adc x19,xzr,xzr // t[8]
- adds x21,x21,x17 // t[2]+lo(a[1]*a[0])
- umulh x17,x11,x6
- adcs x22,x22,x14
- umulh x14,x12,x6
- adcs x23,x23,x15
- umulh x15,x13,x6
- adcs x24,x24,x16
- mul x16,x8,x7 // lo(a[2..7]*a[1]) (ii)
- adcs x25,x25,x17
- mul x17,x9,x7
- adcs x26,x26,x14
- mul x14,x10,x7
- adc x19,x19,x15
-
- mul x15,x11,x7
- adds x22,x22,x16
- mul x16,x12,x7
- adcs x23,x23,x17
- mul x17,x13,x7
- adcs x24,x24,x14
- umulh x14,x8,x7 // hi(a[2..7]*a[1])
- adcs x25,x25,x15
- umulh x15,x9,x7
- adcs x26,x26,x16
- umulh x16,x10,x7
- adcs x19,x19,x17
- umulh x17,x11,x7
- stp x21,x22,[x2],#8*2 // t[2..3]
- adc x20,xzr,xzr // t[9]
- adds x23,x23,x14
- umulh x14,x12,x7
- adcs x24,x24,x15
- umulh x15,x13,x7
- adcs x25,x25,x16
- mul x16,x9,x8 // lo(a[3..7]*a[2]) (iii)
- adcs x26,x26,x17
- mul x17,x10,x8
- adcs x19,x19,x14
- mul x14,x11,x8
- adc x20,x20,x15
-
- mul x15,x12,x8
- adds x24,x24,x16
- mul x16,x13,x8
- adcs x25,x25,x17
- umulh x17,x9,x8 // hi(a[3..7]*a[2])
- adcs x26,x26,x14
- umulh x14,x10,x8
- adcs x19,x19,x15
- umulh x15,x11,x8
- adcs x20,x20,x16
- umulh x16,x12,x8
- stp x23,x24,[x2],#8*2 // t[4..5]
- adc x21,xzr,xzr // t[10]
- adds x25,x25,x17
- umulh x17,x13,x8
- adcs x26,x26,x14
- mul x14,x10,x9 // lo(a[4..7]*a[3]) (iv)
- adcs x19,x19,x15
- mul x15,x11,x9
- adcs x20,x20,x16
- mul x16,x12,x9
- adc x21,x21,x17
-
- mul x17,x13,x9
- adds x26,x26,x14
- umulh x14,x10,x9 // hi(a[4..7]*a[3])
- adcs x19,x19,x15
- umulh x15,x11,x9
- adcs x20,x20,x16
- umulh x16,x12,x9
- adcs x21,x21,x17
- umulh x17,x13,x9
- stp x25,x26,[x2],#8*2 // t[6..7]
- adc x22,xzr,xzr // t[11]
- adds x19,x19,x14
- mul x14,x11,x10 // lo(a[5..7]*a[4]) (v)
- adcs x20,x20,x15
- mul x15,x12,x10
- adcs x21,x21,x16
- mul x16,x13,x10
- adc x22,x22,x17
-
- umulh x17,x11,x10 // hi(a[5..7]*a[4])
- adds x20,x20,x14
- umulh x14,x12,x10
- adcs x21,x21,x15
- umulh x15,x13,x10
- adcs x22,x22,x16
- mul x16,x12,x11 // lo(a[6..7]*a[5]) (vi)
- adc x23,xzr,xzr // t[12]
- adds x21,x21,x17
- mul x17,x13,x11
- adcs x22,x22,x14
- umulh x14,x12,x11 // hi(a[6..7]*a[5])
- adc x23,x23,x15
-
- umulh x15,x13,x11
- adds x22,x22,x16
- mul x16,x13,x12 // lo(a[7]*a[6]) (vii)
- adcs x23,x23,x17
- umulh x17,x13,x12 // hi(a[7]*a[6])
- adc x24,xzr,xzr // t[13]
- adds x23,x23,x14
- sub x27,x3,x1 // done yet?
- adc x24,x24,x15
-
- adds x24,x24,x16
- sub x14,x3,x5 // rewinded ap
- adc x25,xzr,xzr // t[14]
- add x25,x25,x17
-
- cbz x27,Lsqr8x_outer_break
-
- mov x4,x6
- ldp x6,x7,[x2,#8*0]
- ldp x8,x9,[x2,#8*2]
- ldp x10,x11,[x2,#8*4]
- ldp x12,x13,[x2,#8*6]
- adds x19,x19,x6
- adcs x20,x20,x7
- ldp x6,x7,[x1,#8*0]
- adcs x21,x21,x8
- adcs x22,x22,x9
- ldp x8,x9,[x1,#8*2]
- adcs x23,x23,x10
- adcs x24,x24,x11
- ldp x10,x11,[x1,#8*4]
- adcs x25,x25,x12
- mov x0,x1
- adcs x26,xzr,x13
- ldp x12,x13,[x1,#8*6]
- add x1,x1,#8*8
- //adc x28,xzr,xzr // moved below
- mov x27,#-8*8
-
- // a[8]a[0]
- // a[9]a[0]
- // a[a]a[0]
- // a[b]a[0]
- // a[c]a[0]
- // a[d]a[0]
- // a[e]a[0]
- // a[f]a[0]
- // a[8]a[1]
- // a[f]a[1]........................
- // a[8]a[2]
- // a[f]a[2]........................
- // a[8]a[3]
- // a[f]a[3]........................
- // a[8]a[4]
- // a[f]a[4]........................
- // a[8]a[5]
- // a[f]a[5]........................
- // a[8]a[6]
- // a[f]a[6]........................
- // a[8]a[7]
- // a[f]a[7]........................
-Lsqr8x_mul:
- mul x14,x6,x4
- adc x28,xzr,xzr // carry bit, modulo-scheduled
- mul x15,x7,x4
- add x27,x27,#8
- mul x16,x8,x4
- mul x17,x9,x4
- adds x19,x19,x14
- mul x14,x10,x4
- adcs x20,x20,x15
- mul x15,x11,x4
- adcs x21,x21,x16
- mul x16,x12,x4
- adcs x22,x22,x17
- mul x17,x13,x4
- adcs x23,x23,x14
- umulh x14,x6,x4
- adcs x24,x24,x15
- umulh x15,x7,x4
- adcs x25,x25,x16
- umulh x16,x8,x4
- adcs x26,x26,x17
- umulh x17,x9,x4
- adc x28,x28,xzr
- str x19,[x2],#8
- adds x19,x20,x14
- umulh x14,x10,x4
- adcs x20,x21,x15
- umulh x15,x11,x4
- adcs x21,x22,x16
- umulh x16,x12,x4
- adcs x22,x23,x17
- umulh x17,x13,x4
- ldr x4,[x0,x27]
- adcs x23,x24,x14
- adcs x24,x25,x15
- adcs x25,x26,x16
- adcs x26,x28,x17
- //adc x28,xzr,xzr // moved above
- cbnz x27,Lsqr8x_mul
- // note that carry flag is guaranteed
- // to be zero at this point
- cmp x1,x3 // done yet?
- b.eq Lsqr8x_break
-
- ldp x6,x7,[x2,#8*0]
- ldp x8,x9,[x2,#8*2]
- ldp x10,x11,[x2,#8*4]
- ldp x12,x13,[x2,#8*6]
- adds x19,x19,x6
- ldr x4,[x0,#-8*8]
- adcs x20,x20,x7
- ldp x6,x7,[x1,#8*0]
- adcs x21,x21,x8
- adcs x22,x22,x9
- ldp x8,x9,[x1,#8*2]
- adcs x23,x23,x10
- adcs x24,x24,x11
- ldp x10,x11,[x1,#8*4]
- adcs x25,x25,x12
- mov x27,#-8*8
- adcs x26,x26,x13
- ldp x12,x13,[x1,#8*6]
- add x1,x1,#8*8
- //adc x28,xzr,xzr // moved above
- b Lsqr8x_mul
-
-.align 4
-Lsqr8x_break:
- ldp x6,x7,[x0,#8*0]
- add x1,x0,#8*8
- ldp x8,x9,[x0,#8*2]
- sub x14,x3,x1 // is it last iteration?
- ldp x10,x11,[x0,#8*4]
- sub x15,x2,x14
- ldp x12,x13,[x0,#8*6]
- cbz x14,Lsqr8x_outer_loop
-
- stp x19,x20,[x2,#8*0]
- ldp x19,x20,[x15,#8*0]
- stp x21,x22,[x2,#8*2]
- ldp x21,x22,[x15,#8*2]
- stp x23,x24,[x2,#8*4]
- ldp x23,x24,[x15,#8*4]
- stp x25,x26,[x2,#8*6]
- mov x2,x15
- ldp x25,x26,[x15,#8*6]
- b Lsqr8x_outer_loop
-
-.align 4
-Lsqr8x_outer_break:
- // Now multiply above result by 2 and add a[n-1]*a[n-1]|...|a[0]*a[0]
- ldp x7,x9,[x14,#8*0] // recall that x14 is &a[0]
- ldp x15,x16,[sp,#8*1]
- ldp x11,x13,[x14,#8*2]
- add x1,x14,#8*4
- ldp x17,x14,[sp,#8*3]
-
- stp x19,x20,[x2,#8*0]
- mul x19,x7,x7
- stp x21,x22,[x2,#8*2]
- umulh x7,x7,x7
- stp x23,x24,[x2,#8*4]
- mul x8,x9,x9
- stp x25,x26,[x2,#8*6]
- mov x2,sp
- umulh x9,x9,x9
- adds x20,x7,x15,lsl#1
- extr x15,x16,x15,#63
- sub x27,x5,#8*4
-
-Lsqr4x_shift_n_add:
- adcs x21,x8,x15
- extr x16,x17,x16,#63
- sub x27,x27,#8*4
- adcs x22,x9,x16
- ldp x15,x16,[x2,#8*5]
- mul x10,x11,x11
- ldp x7,x9,[x1],#8*2
- umulh x11,x11,x11
- mul x12,x13,x13
- umulh x13,x13,x13
- extr x17,x14,x17,#63
- stp x19,x20,[x2,#8*0]
- adcs x23,x10,x17
- extr x14,x15,x14,#63
- stp x21,x22,[x2,#8*2]
- adcs x24,x11,x14
- ldp x17,x14,[x2,#8*7]
- extr x15,x16,x15,#63
- adcs x25,x12,x15
- extr x16,x17,x16,#63
- adcs x26,x13,x16
- ldp x15,x16,[x2,#8*9]
- mul x6,x7,x7
- ldp x11,x13,[x1],#8*2
- umulh x7,x7,x7
- mul x8,x9,x9
- umulh x9,x9,x9
- stp x23,x24,[x2,#8*4]
- extr x17,x14,x17,#63
- stp x25,x26,[x2,#8*6]
- add x2,x2,#8*8
- adcs x19,x6,x17
- extr x14,x15,x14,#63
- adcs x20,x7,x14
- ldp x17,x14,[x2,#8*3]
- extr x15,x16,x15,#63
- cbnz x27,Lsqr4x_shift_n_add
- ldp x1,x4,[x29,#104] // pull np and n0
-
- adcs x21,x8,x15
- extr x16,x17,x16,#63
- adcs x22,x9,x16
- ldp x15,x16,[x2,#8*5]
- mul x10,x11,x11
- umulh x11,x11,x11
- stp x19,x20,[x2,#8*0]
- mul x12,x13,x13
- umulh x13,x13,x13
- stp x21,x22,[x2,#8*2]
- extr x17,x14,x17,#63
- adcs x23,x10,x17
- extr x14,x15,x14,#63
- ldp x19,x20,[sp,#8*0]
- adcs x24,x11,x14
- extr x15,x16,x15,#63
- ldp x6,x7,[x1,#8*0]
- adcs x25,x12,x15
- extr x16,xzr,x16,#63
- ldp x8,x9,[x1,#8*2]
- adc x26,x13,x16
- ldp x10,x11,[x1,#8*4]
-
- // Reduce by 512 bits per iteration
- mul x28,x4,x19 // t[0]*n0
- ldp x12,x13,[x1,#8*6]
- add x3,x1,x5
- ldp x21,x22,[sp,#8*2]
- stp x23,x24,[x2,#8*4]
- ldp x23,x24,[sp,#8*4]
- stp x25,x26,[x2,#8*6]
- ldp x25,x26,[sp,#8*6]
- add x1,x1,#8*8
- mov x30,xzr // initial top-most carry
- mov x2,sp
- mov x27,#8
-
-Lsqr8x_reduction:
- // (*) mul x14,x6,x28 // lo(n[0-7])*lo(t[0]*n0)
- mul x15,x7,x28
- sub x27,x27,#1
- mul x16,x8,x28
- str x28,[x2],#8 // put aside t[0]*n0 for tail processing
- mul x17,x9,x28
- // (*) adds xzr,x19,x14
- subs xzr,x19,#1 // (*)
- mul x14,x10,x28
- adcs x19,x20,x15
- mul x15,x11,x28
- adcs x20,x21,x16
- mul x16,x12,x28
- adcs x21,x22,x17
- mul x17,x13,x28
- adcs x22,x23,x14
- umulh x14,x6,x28 // hi(n[0-7])*lo(t[0]*n0)
- adcs x23,x24,x15
- umulh x15,x7,x28
- adcs x24,x25,x16
- umulh x16,x8,x28
- adcs x25,x26,x17
- umulh x17,x9,x28
- adc x26,xzr,xzr
- adds x19,x19,x14
- umulh x14,x10,x28
- adcs x20,x20,x15
- umulh x15,x11,x28
- adcs x21,x21,x16
- umulh x16,x12,x28
- adcs x22,x22,x17
- umulh x17,x13,x28
- mul x28,x4,x19 // next t[0]*n0
- adcs x23,x23,x14
- adcs x24,x24,x15
- adcs x25,x25,x16
- adc x26,x26,x17
- cbnz x27,Lsqr8x_reduction
-
- ldp x14,x15,[x2,#8*0]
- ldp x16,x17,[x2,#8*2]
- mov x0,x2
- sub x27,x3,x1 // done yet?
- adds x19,x19,x14
- adcs x20,x20,x15
- ldp x14,x15,[x2,#8*4]
- adcs x21,x21,x16
- adcs x22,x22,x17
- ldp x16,x17,[x2,#8*6]
- adcs x23,x23,x14
- adcs x24,x24,x15
- adcs x25,x25,x16
- adcs x26,x26,x17
- //adc x28,xzr,xzr // moved below
- cbz x27,Lsqr8x8_post_condition
-
- ldr x4,[x2,#-8*8]
- ldp x6,x7,[x1,#8*0]
- ldp x8,x9,[x1,#8*2]
- ldp x10,x11,[x1,#8*4]
- mov x27,#-8*8
- ldp x12,x13,[x1,#8*6]
- add x1,x1,#8*8
-
-Lsqr8x_tail:
- mul x14,x6,x4
- adc x28,xzr,xzr // carry bit, modulo-scheduled
- mul x15,x7,x4
- add x27,x27,#8
- mul x16,x8,x4
- mul x17,x9,x4
- adds x19,x19,x14
- mul x14,x10,x4
- adcs x20,x20,x15
- mul x15,x11,x4
- adcs x21,x21,x16
- mul x16,x12,x4
- adcs x22,x22,x17
- mul x17,x13,x4
- adcs x23,x23,x14
- umulh x14,x6,x4
- adcs x24,x24,x15
- umulh x15,x7,x4
- adcs x25,x25,x16
- umulh x16,x8,x4
- adcs x26,x26,x17
- umulh x17,x9,x4
- adc x28,x28,xzr
- str x19,[x2],#8
- adds x19,x20,x14
- umulh x14,x10,x4
- adcs x20,x21,x15
- umulh x15,x11,x4
- adcs x21,x22,x16
- umulh x16,x12,x4
- adcs x22,x23,x17
- umulh x17,x13,x4
- ldr x4,[x0,x27]
- adcs x23,x24,x14
- adcs x24,x25,x15
- adcs x25,x26,x16
- adcs x26,x28,x17
- //adc x28,xzr,xzr // moved above
- cbnz x27,Lsqr8x_tail
- // note that carry flag is guaranteed
- // to be zero at this point
- ldp x6,x7,[x2,#8*0]
- sub x27,x3,x1 // done yet?
- sub x16,x3,x5 // rewinded np
- ldp x8,x9,[x2,#8*2]
- ldp x10,x11,[x2,#8*4]
- ldp x12,x13,[x2,#8*6]
- cbz x27,Lsqr8x_tail_break
-
- ldr x4,[x0,#-8*8]
- adds x19,x19,x6
- adcs x20,x20,x7
- ldp x6,x7,[x1,#8*0]
- adcs x21,x21,x8
- adcs x22,x22,x9
- ldp x8,x9,[x1,#8*2]
- adcs x23,x23,x10
- adcs x24,x24,x11
- ldp x10,x11,[x1,#8*4]
- adcs x25,x25,x12
- mov x27,#-8*8
- adcs x26,x26,x13
- ldp x12,x13,[x1,#8*6]
- add x1,x1,#8*8
- //adc x28,xzr,xzr // moved above
- b Lsqr8x_tail
-
-.align 4
-Lsqr8x_tail_break:
- ldr x4,[x29,#112] // pull n0
- add x27,x2,#8*8 // end of current t[num] window
-
- subs xzr,x30,#1 // "move" top-most carry to carry bit
- adcs x14,x19,x6
- adcs x15,x20,x7
- ldp x19,x20,[x0,#8*0]
- adcs x21,x21,x8
- ldp x6,x7,[x16,#8*0] // recall that x16 is &n[0]
- adcs x22,x22,x9
- ldp x8,x9,[x16,#8*2]
- adcs x23,x23,x10
- adcs x24,x24,x11
- ldp x10,x11,[x16,#8*4]
- adcs x25,x25,x12
- adcs x26,x26,x13
- ldp x12,x13,[x16,#8*6]
- add x1,x16,#8*8
- adc x30,xzr,xzr // top-most carry
- mul x28,x4,x19
- stp x14,x15,[x2,#8*0]
- stp x21,x22,[x2,#8*2]
- ldp x21,x22,[x0,#8*2]
- stp x23,x24,[x2,#8*4]
- ldp x23,x24,[x0,#8*4]
- cmp x27,x29 // did we hit the bottom?
- stp x25,x26,[x2,#8*6]
- mov x2,x0 // slide the window
- ldp x25,x26,[x0,#8*6]
- mov x27,#8
- b.ne Lsqr8x_reduction
-
- // Final step. We see if result is larger than modulus, and
- // if it is, subtract the modulus. But comparison implies
- // subtraction. So we subtract modulus, see if it borrowed,
- // and conditionally copy original value.
- ldr x0,[x29,#96] // pull rp
- add x2,x2,#8*8
- subs x14,x19,x6
- sbcs x15,x20,x7
- sub x27,x5,#8*8
- mov x3,x0 // x0 copy
-
-Lsqr8x_sub:
- sbcs x16,x21,x8
- ldp x6,x7,[x1,#8*0]
- sbcs x17,x22,x9
- stp x14,x15,[x0,#8*0]
- sbcs x14,x23,x10
- ldp x8,x9,[x1,#8*2]
- sbcs x15,x24,x11
- stp x16,x17,[x0,#8*2]
- sbcs x16,x25,x12
- ldp x10,x11,[x1,#8*4]
- sbcs x17,x26,x13
- ldp x12,x13,[x1,#8*6]
- add x1,x1,#8*8
- ldp x19,x20,[x2,#8*0]
- sub x27,x27,#8*8
- ldp x21,x22,[x2,#8*2]
- ldp x23,x24,[x2,#8*4]
- ldp x25,x26,[x2,#8*6]
- add x2,x2,#8*8
- stp x14,x15,[x0,#8*4]
- sbcs x14,x19,x6
- stp x16,x17,[x0,#8*6]
- add x0,x0,#8*8
- sbcs x15,x20,x7
- cbnz x27,Lsqr8x_sub
-
- sbcs x16,x21,x8
- mov x2,sp
- add x1,sp,x5
- ldp x6,x7,[x3,#8*0]
- sbcs x17,x22,x9
- stp x14,x15,[x0,#8*0]
- sbcs x14,x23,x10
- ldp x8,x9,[x3,#8*2]
- sbcs x15,x24,x11
- stp x16,x17,[x0,#8*2]
- sbcs x16,x25,x12
- ldp x19,x20,[x1,#8*0]
- sbcs x17,x26,x13
- ldp x21,x22,[x1,#8*2]
- sbcs xzr,x30,xzr // did it borrow?
- ldr x30,[x29,#8] // pull return address
- stp x14,x15,[x0,#8*4]
- stp x16,x17,[x0,#8*6]
-
- sub x27,x5,#8*4
-Lsqr4x_cond_copy:
- sub x27,x27,#8*4
- csel x14,x19,x6,lo
- stp xzr,xzr,[x2,#8*0]
- csel x15,x20,x7,lo
- ldp x6,x7,[x3,#8*4]
- ldp x19,x20,[x1,#8*4]
- csel x16,x21,x8,lo
- stp xzr,xzr,[x2,#8*2]
- add x2,x2,#8*4
- csel x17,x22,x9,lo
- ldp x8,x9,[x3,#8*6]
- ldp x21,x22,[x1,#8*6]
- add x1,x1,#8*4
- stp x14,x15,[x3,#8*0]
- stp x16,x17,[x3,#8*2]
- add x3,x3,#8*4
- stp xzr,xzr,[x1,#8*0]
- stp xzr,xzr,[x1,#8*2]
- cbnz x27,Lsqr4x_cond_copy
-
- csel x14,x19,x6,lo
- stp xzr,xzr,[x2,#8*0]
- csel x15,x20,x7,lo
- stp xzr,xzr,[x2,#8*2]
- csel x16,x21,x8,lo
- csel x17,x22,x9,lo
- stp x14,x15,[x3,#8*0]
- stp x16,x17,[x3,#8*2]
-
- b Lsqr8x_done
-
-.align 4
-Lsqr8x8_post_condition:
- adc x28,xzr,xzr
- ldr x30,[x29,#8] // pull return address
- // x19-7,x28 hold result, x6-7 hold modulus
- subs x6,x19,x6
- ldr x1,[x29,#96] // pull rp
- sbcs x7,x20,x7
- stp xzr,xzr,[sp,#8*0]
- sbcs x8,x21,x8
- stp xzr,xzr,[sp,#8*2]
- sbcs x9,x22,x9
- stp xzr,xzr,[sp,#8*4]
- sbcs x10,x23,x10
- stp xzr,xzr,[sp,#8*6]
- sbcs x11,x24,x11
- stp xzr,xzr,[sp,#8*8]
- sbcs x12,x25,x12
- stp xzr,xzr,[sp,#8*10]
- sbcs x13,x26,x13
- stp xzr,xzr,[sp,#8*12]
- sbcs x28,x28,xzr // did it borrow?
- stp xzr,xzr,[sp,#8*14]
-
- // x6-7 hold result-modulus
- csel x6,x19,x6,lo
- csel x7,x20,x7,lo
- csel x8,x21,x8,lo
- csel x9,x22,x9,lo
- stp x6,x7,[x1,#8*0]
- csel x10,x23,x10,lo
- csel x11,x24,x11,lo
- stp x8,x9,[x1,#8*2]
- csel x12,x25,x12,lo
- csel x13,x26,x13,lo
- stp x10,x11,[x1,#8*4]
- stp x12,x13,[x1,#8*6]
-
-Lsqr8x_done:
- ldp x19,x20,[x29,#16]
- mov sp,x29
- ldp x21,x22,[x29,#32]
- mov x0,#1
- ldp x23,x24,[x29,#48]
- ldp x25,x26,[x29,#64]
- ldp x27,x28,[x29,#80]
- ldr x29,[sp],#128
- // x30 is popped earlier
- AARCH64_VALIDATE_LINK_REGISTER
- ret
-
-.def __bn_mul4x_mont
- .type 32
-.endef
-.align 5
-__bn_mul4x_mont:
- // Not adding AARCH64_SIGN_LINK_REGISTER here because __bn_mul4x_mont is jumped to
- // only from bn_mul_mont or __bn_mul8x_mont which have already signed the
- // return address.
- stp x29,x30,[sp,#-128]!
- add x29,sp,#0
- stp x19,x20,[sp,#16]
- stp x21,x22,[sp,#32]
- stp x23,x24,[sp,#48]
- stp x25,x26,[sp,#64]
- stp x27,x28,[sp,#80]
-
- sub x26,sp,x5,lsl#3
- lsl x5,x5,#3
- ldr x4,[x4] // *n0
- sub sp,x26,#8*4 // alloca
-
- add x10,x2,x5
- add x27,x1,x5
- stp x0,x10,[x29,#96] // offload rp and &b[num]
-
- ldr x24,[x2,#8*0] // b[0]
- ldp x6,x7,[x1,#8*0] // a[0..3]
- ldp x8,x9,[x1,#8*2]
- add x1,x1,#8*4
- mov x19,xzr
- mov x20,xzr
- mov x21,xzr
- mov x22,xzr
- ldp x14,x15,[x3,#8*0] // n[0..3]
- ldp x16,x17,[x3,#8*2]
- adds x3,x3,#8*4 // clear carry bit
- mov x0,xzr
- mov x28,#0
- mov x26,sp
-
-Loop_mul4x_1st_reduction:
- mul x10,x6,x24 // lo(a[0..3]*b[0])
- adc x0,x0,xzr // modulo-scheduled
- mul x11,x7,x24
- add x28,x28,#8
- mul x12,x8,x24
- and x28,x28,#31
- mul x13,x9,x24
- adds x19,x19,x10
- umulh x10,x6,x24 // hi(a[0..3]*b[0])
- adcs x20,x20,x11
- mul x25,x19,x4 // t[0]*n0
- adcs x21,x21,x12
- umulh x11,x7,x24
- adcs x22,x22,x13
- umulh x12,x8,x24
- adc x23,xzr,xzr
- umulh x13,x9,x24
- ldr x24,[x2,x28] // next b[i] (or b[0])
- adds x20,x20,x10
- // (*) mul x10,x14,x25 // lo(n[0..3]*t[0]*n0)
- str x25,[x26],#8 // put aside t[0]*n0 for tail processing
- adcs x21,x21,x11
- mul x11,x15,x25
- adcs x22,x22,x12
- mul x12,x16,x25
- adc x23,x23,x13 // can't overflow
- mul x13,x17,x25
- // (*) adds xzr,x19,x10
- subs xzr,x19,#1 // (*)
- umulh x10,x14,x25 // hi(n[0..3]*t[0]*n0)
- adcs x19,x20,x11
- umulh x11,x15,x25
- adcs x20,x21,x12
- umulh x12,x16,x25
- adcs x21,x22,x13
- umulh x13,x17,x25
- adcs x22,x23,x0
- adc x0,xzr,xzr
- adds x19,x19,x10
- sub x10,x27,x1
- adcs x20,x20,x11
- adcs x21,x21,x12
- adcs x22,x22,x13
- //adc x0,x0,xzr
- cbnz x28,Loop_mul4x_1st_reduction
-
- cbz x10,Lmul4x4_post_condition
-
- ldp x6,x7,[x1,#8*0] // a[4..7]
- ldp x8,x9,[x1,#8*2]
- add x1,x1,#8*4
- ldr x25,[sp] // a[0]*n0
- ldp x14,x15,[x3,#8*0] // n[4..7]
- ldp x16,x17,[x3,#8*2]
- add x3,x3,#8*4
-
-Loop_mul4x_1st_tail:
- mul x10,x6,x24 // lo(a[4..7]*b[i])
- adc x0,x0,xzr // modulo-scheduled
- mul x11,x7,x24
- add x28,x28,#8
- mul x12,x8,x24
- and x28,x28,#31
- mul x13,x9,x24
- adds x19,x19,x10
- umulh x10,x6,x24 // hi(a[4..7]*b[i])
- adcs x20,x20,x11
- umulh x11,x7,x24
- adcs x21,x21,x12
- umulh x12,x8,x24
- adcs x22,x22,x13
- umulh x13,x9,x24
- adc x23,xzr,xzr
- ldr x24,[x2,x28] // next b[i] (or b[0])
- adds x20,x20,x10
- mul x10,x14,x25 // lo(n[4..7]*a[0]*n0)
- adcs x21,x21,x11
- mul x11,x15,x25
- adcs x22,x22,x12
- mul x12,x16,x25
- adc x23,x23,x13 // can't overflow
- mul x13,x17,x25
- adds x19,x19,x10
- umulh x10,x14,x25 // hi(n[4..7]*a[0]*n0)
- adcs x20,x20,x11
- umulh x11,x15,x25
- adcs x21,x21,x12
- umulh x12,x16,x25
- adcs x22,x22,x13
- adcs x23,x23,x0
- umulh x13,x17,x25
- adc x0,xzr,xzr
- ldr x25,[sp,x28] // next t[0]*n0
- str x19,[x26],#8 // result!!!
- adds x19,x20,x10
- sub x10,x27,x1 // done yet?
- adcs x20,x21,x11
- adcs x21,x22,x12
- adcs x22,x23,x13
- //adc x0,x0,xzr
- cbnz x28,Loop_mul4x_1st_tail
-
- sub x11,x27,x5 // rewinded x1
- cbz x10,Lmul4x_proceed
-
- ldp x6,x7,[x1,#8*0]
- ldp x8,x9,[x1,#8*2]
- add x1,x1,#8*4
- ldp x14,x15,[x3,#8*0]
- ldp x16,x17,[x3,#8*2]
- add x3,x3,#8*4
- b Loop_mul4x_1st_tail
-
-.align 5
-Lmul4x_proceed:
- ldr x24,[x2,#8*4]! // *++b
- adc x30,x0,xzr
- ldp x6,x7,[x11,#8*0] // a[0..3]
- sub x3,x3,x5 // rewind np
- ldp x8,x9,[x11,#8*2]
- add x1,x11,#8*4
-
- stp x19,x20,[x26,#8*0] // result!!!
- ldp x19,x20,[sp,#8*4] // t[0..3]
- stp x21,x22,[x26,#8*2] // result!!!
- ldp x21,x22,[sp,#8*6]
-
- ldp x14,x15,[x3,#8*0] // n[0..3]
- mov x26,sp
- ldp x16,x17,[x3,#8*2]
- adds x3,x3,#8*4 // clear carry bit
- mov x0,xzr
-
-.align 4
-Loop_mul4x_reduction:
- mul x10,x6,x24 // lo(a[0..3]*b[4])
- adc x0,x0,xzr // modulo-scheduled
- mul x11,x7,x24
- add x28,x28,#8
- mul x12,x8,x24
- and x28,x28,#31
- mul x13,x9,x24
- adds x19,x19,x10
- umulh x10,x6,x24 // hi(a[0..3]*b[4])
- adcs x20,x20,x11
- mul x25,x19,x4 // t[0]*n0
- adcs x21,x21,x12
- umulh x11,x7,x24
- adcs x22,x22,x13
- umulh x12,x8,x24
- adc x23,xzr,xzr
- umulh x13,x9,x24
- ldr x24,[x2,x28] // next b[i]
- adds x20,x20,x10
- // (*) mul x10,x14,x25
- str x25,[x26],#8 // put aside t[0]*n0 for tail processing
- adcs x21,x21,x11
- mul x11,x15,x25 // lo(n[0..3]*t[0]*n0
- adcs x22,x22,x12
- mul x12,x16,x25
- adc x23,x23,x13 // can't overflow
- mul x13,x17,x25
- // (*) adds xzr,x19,x10
- subs xzr,x19,#1 // (*)
- umulh x10,x14,x25 // hi(n[0..3]*t[0]*n0
- adcs x19,x20,x11
- umulh x11,x15,x25
- adcs x20,x21,x12
- umulh x12,x16,x25
- adcs x21,x22,x13
- umulh x13,x17,x25
- adcs x22,x23,x0
- adc x0,xzr,xzr
- adds x19,x19,x10
- adcs x20,x20,x11
- adcs x21,x21,x12
- adcs x22,x22,x13
- //adc x0,x0,xzr
- cbnz x28,Loop_mul4x_reduction
-
- adc x0,x0,xzr
- ldp x10,x11,[x26,#8*4] // t[4..7]
- ldp x12,x13,[x26,#8*6]
- ldp x6,x7,[x1,#8*0] // a[4..7]
- ldp x8,x9,[x1,#8*2]
- add x1,x1,#8*4
- adds x19,x19,x10
- adcs x20,x20,x11
- adcs x21,x21,x12
- adcs x22,x22,x13
- //adc x0,x0,xzr
-
- ldr x25,[sp] // t[0]*n0
- ldp x14,x15,[x3,#8*0] // n[4..7]
- ldp x16,x17,[x3,#8*2]
- add x3,x3,#8*4
-
-.align 4
-Loop_mul4x_tail:
- mul x10,x6,x24 // lo(a[4..7]*b[4])
- adc x0,x0,xzr // modulo-scheduled
- mul x11,x7,x24
- add x28,x28,#8
- mul x12,x8,x24
- and x28,x28,#31
- mul x13,x9,x24
- adds x19,x19,x10
- umulh x10,x6,x24 // hi(a[4..7]*b[4])
- adcs x20,x20,x11
- umulh x11,x7,x24
- adcs x21,x21,x12
- umulh x12,x8,x24
- adcs x22,x22,x13
- umulh x13,x9,x24
- adc x23,xzr,xzr
- ldr x24,[x2,x28] // next b[i]
- adds x20,x20,x10
- mul x10,x14,x25 // lo(n[4..7]*t[0]*n0)
- adcs x21,x21,x11
- mul x11,x15,x25
- adcs x22,x22,x12
- mul x12,x16,x25
- adc x23,x23,x13 // can't overflow
- mul x13,x17,x25
- adds x19,x19,x10
- umulh x10,x14,x25 // hi(n[4..7]*t[0]*n0)
- adcs x20,x20,x11
- umulh x11,x15,x25
- adcs x21,x21,x12
- umulh x12,x16,x25
- adcs x22,x22,x13
- umulh x13,x17,x25
- adcs x23,x23,x0
- ldr x25,[sp,x28] // next a[0]*n0
- adc x0,xzr,xzr
- str x19,[x26],#8 // result!!!
- adds x19,x20,x10
- sub x10,x27,x1 // done yet?
- adcs x20,x21,x11
- adcs x21,x22,x12
- adcs x22,x23,x13
- //adc x0,x0,xzr
- cbnz x28,Loop_mul4x_tail
-
- sub x11,x3,x5 // rewinded np?
- adc x0,x0,xzr
- cbz x10,Loop_mul4x_break
-
- ldp x10,x11,[x26,#8*4]
- ldp x12,x13,[x26,#8*6]
- ldp x6,x7,[x1,#8*0]
- ldp x8,x9,[x1,#8*2]
- add x1,x1,#8*4
- adds x19,x19,x10
- adcs x20,x20,x11
- adcs x21,x21,x12
- adcs x22,x22,x13
- //adc x0,x0,xzr
- ldp x14,x15,[x3,#8*0]
- ldp x16,x17,[x3,#8*2]
- add x3,x3,#8*4
- b Loop_mul4x_tail
-
-.align 4
-Loop_mul4x_break:
- ldp x12,x13,[x29,#96] // pull rp and &b[num]
- adds x19,x19,x30
- add x2,x2,#8*4 // bp++
- adcs x20,x20,xzr
- sub x1,x1,x5 // rewind ap
- adcs x21,x21,xzr
- stp x19,x20,[x26,#8*0] // result!!!
- adcs x22,x22,xzr
- ldp x19,x20,[sp,#8*4] // t[0..3]
- adc x30,x0,xzr
- stp x21,x22,[x26,#8*2] // result!!!
- cmp x2,x13 // done yet?
- ldp x21,x22,[sp,#8*6]
- ldp x14,x15,[x11,#8*0] // n[0..3]
- ldp x16,x17,[x11,#8*2]
- add x3,x11,#8*4
- b.eq Lmul4x_post
-
- ldr x24,[x2]
- ldp x6,x7,[x1,#8*0] // a[0..3]
- ldp x8,x9,[x1,#8*2]
- adds x1,x1,#8*4 // clear carry bit
- mov x0,xzr
- mov x26,sp
- b Loop_mul4x_reduction
-
-.align 4
-Lmul4x_post:
- // Final step. We see if result is larger than modulus, and
- // if it is, subtract the modulus. But comparison implies
- // subtraction. So we subtract modulus, see if it borrowed,
- // and conditionally copy original value.
- mov x0,x12
- mov x27,x12 // x0 copy
- subs x10,x19,x14
- add x26,sp,#8*8
- sbcs x11,x20,x15
- sub x28,x5,#8*4
-
-Lmul4x_sub:
- sbcs x12,x21,x16
- ldp x14,x15,[x3,#8*0]
- sub x28,x28,#8*4
- ldp x19,x20,[x26,#8*0]
- sbcs x13,x22,x17
- ldp x16,x17,[x3,#8*2]
- add x3,x3,#8*4
- ldp x21,x22,[x26,#8*2]
- add x26,x26,#8*4
- stp x10,x11,[x0,#8*0]
- sbcs x10,x19,x14
- stp x12,x13,[x0,#8*2]
- add x0,x0,#8*4
- sbcs x11,x20,x15
- cbnz x28,Lmul4x_sub
-
- sbcs x12,x21,x16
- mov x26,sp
- add x1,sp,#8*4
- ldp x6,x7,[x27,#8*0]
- sbcs x13,x22,x17
- stp x10,x11,[x0,#8*0]
- ldp x8,x9,[x27,#8*2]
- stp x12,x13,[x0,#8*2]
- ldp x19,x20,[x1,#8*0]
- ldp x21,x22,[x1,#8*2]
- sbcs xzr,x30,xzr // did it borrow?
- ldr x30,[x29,#8] // pull return address
-
- sub x28,x5,#8*4
-Lmul4x_cond_copy:
- sub x28,x28,#8*4
- csel x10,x19,x6,lo
- stp xzr,xzr,[x26,#8*0]
- csel x11,x20,x7,lo
- ldp x6,x7,[x27,#8*4]
- ldp x19,x20,[x1,#8*4]
- csel x12,x21,x8,lo
- stp xzr,xzr,[x26,#8*2]
- add x26,x26,#8*4
- csel x13,x22,x9,lo
- ldp x8,x9,[x27,#8*6]
- ldp x21,x22,[x1,#8*6]
- add x1,x1,#8*4
- stp x10,x11,[x27,#8*0]
- stp x12,x13,[x27,#8*2]
- add x27,x27,#8*4
- cbnz x28,Lmul4x_cond_copy
-
- csel x10,x19,x6,lo
- stp xzr,xzr,[x26,#8*0]
- csel x11,x20,x7,lo
- stp xzr,xzr,[x26,#8*2]
- csel x12,x21,x8,lo
- stp xzr,xzr,[x26,#8*3]
- csel x13,x22,x9,lo
- stp xzr,xzr,[x26,#8*4]
- stp x10,x11,[x27,#8*0]
- stp x12,x13,[x27,#8*2]
-
- b Lmul4x_done
-
-.align 4
-Lmul4x4_post_condition:
- adc x0,x0,xzr
- ldr x1,[x29,#96] // pull rp
- // x19-3,x0 hold result, x14-7 hold modulus
- subs x6,x19,x14
- ldr x30,[x29,#8] // pull return address
- sbcs x7,x20,x15
- stp xzr,xzr,[sp,#8*0]
- sbcs x8,x21,x16
- stp xzr,xzr,[sp,#8*2]
- sbcs x9,x22,x17
- stp xzr,xzr,[sp,#8*4]
- sbcs xzr,x0,xzr // did it borrow?
- stp xzr,xzr,[sp,#8*6]
-
- // x6-3 hold result-modulus
- csel x6,x19,x6,lo
- csel x7,x20,x7,lo
- csel x8,x21,x8,lo
- csel x9,x22,x9,lo
- stp x6,x7,[x1,#8*0]
- stp x8,x9,[x1,#8*2]
-
-Lmul4x_done:
- ldp x19,x20,[x29,#16]
- mov sp,x29
- ldp x21,x22,[x29,#32]
- mov x0,#1
- ldp x23,x24,[x29,#48]
- ldp x25,x26,[x29,#64]
- ldp x27,x28,[x29,#80]
- ldr x29,[sp],#128
- // x30 is popped earlier
- AARCH64_VALIDATE_LINK_REGISTER
- ret
-
-.byte 77,111,110,116,103,111,109,101,114,121,32,77,117,108,116,105,112,108,105,99,97,116,105,111,110,32,102,111,114,32,65,82,77,118,56,44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112,114,111,64,111,112,101,110,115,115,108,46,111,114,103,62,0
-.align 2
-.align 4
-#endif // !OPENSSL_NO_ASM && defined(OPENSSL_AARCH64) && defined(_WIN32)
diff --git a/win-aarch64/crypto/fipsmodule/bn-armv8-win.S b/win-aarch64/crypto/fipsmodule/bn-armv8-win.S
deleted file mode 100644
index af970800..00000000
--- a/win-aarch64/crypto/fipsmodule/bn-armv8-win.S
+++ /dev/null
@@ -1,89 +0,0 @@
-// This file is generated from a similarly-named Perl script in the BoringSSL
-// source tree. Do not edit by hand.
-
-#include <openssl/asm_base.h>
-
-#if !defined(OPENSSL_NO_ASM) && defined(OPENSSL_AARCH64) && defined(_WIN32)
-#include <openssl/arm_arch.h>
-
-.text
-
-// BN_ULONG bn_add_words(BN_ULONG *rp, const BN_ULONG *ap, const BN_ULONG *bp,
-// size_t num);
-
-.globl bn_add_words
-
-.align 4
-bn_add_words:
- AARCH64_VALID_CALL_TARGET
- # Clear the carry flag.
- cmn xzr, xzr
-
- # aarch64 can load two registers at a time, so we do two loop iterations at
- # at a time. Split x3 = 2 * x8 + x3. This allows loop
- # operations to use CBNZ without clobbering the carry flag.
- lsr x8, x3, #1
- and x3, x3, #1
-
- cbz x8, Ladd_tail
-Ladd_loop:
- ldp x4, x5, [x1], #16
- ldp x6, x7, [x2], #16
- sub x8, x8, #1
- adcs x4, x4, x6
- adcs x5, x5, x7
- stp x4, x5, [x0], #16
- cbnz x8, Ladd_loop
-
-Ladd_tail:
- cbz x3, Ladd_exit
- ldr x4, [x1], #8
- ldr x6, [x2], #8
- adcs x4, x4, x6
- str x4, [x0], #8
-
-Ladd_exit:
- cset x0, cs
- ret
-
-
-// BN_ULONG bn_sub_words(BN_ULONG *rp, const BN_ULONG *ap, const BN_ULONG *bp,
-// size_t num);
-
-.globl bn_sub_words
-
-.align 4
-bn_sub_words:
- AARCH64_VALID_CALL_TARGET
- # Set the carry flag. Arm's borrow bit is flipped from the carry flag,
- # so we want C = 1 here.
- cmp xzr, xzr
-
- # aarch64 can load two registers at a time, so we do two loop iterations at
- # at a time. Split x3 = 2 * x8 + x3. This allows loop
- # operations to use CBNZ without clobbering the carry flag.
- lsr x8, x3, #1
- and x3, x3, #1
-
- cbz x8, Lsub_tail
-Lsub_loop:
- ldp x4, x5, [x1], #16
- ldp x6, x7, [x2], #16
- sub x8, x8, #1
- sbcs x4, x4, x6
- sbcs x5, x5, x7
- stp x4, x5, [x0], #16
- cbnz x8, Lsub_loop
-
-Lsub_tail:
- cbz x3, Lsub_exit
- ldr x4, [x1], #8
- ldr x6, [x2], #8
- sbcs x4, x4, x6
- str x4, [x0], #8
-
-Lsub_exit:
- cset x0, cc
- ret
-
-#endif // !OPENSSL_NO_ASM && defined(OPENSSL_AARCH64) && defined(_WIN32)
diff --git a/win-aarch64/crypto/fipsmodule/ghash-neon-armv8-win.S b/win-aarch64/crypto/fipsmodule/ghash-neon-armv8-win.S
deleted file mode 100644
index d9688931..00000000
--- a/win-aarch64/crypto/fipsmodule/ghash-neon-armv8-win.S
+++ /dev/null
@@ -1,341 +0,0 @@
-// This file is generated from a similarly-named Perl script in the BoringSSL
-// source tree. Do not edit by hand.
-
-#include <openssl/asm_base.h>
-
-#if !defined(OPENSSL_NO_ASM) && defined(OPENSSL_AARCH64) && defined(_WIN32)
-#include <openssl/arm_arch.h>
-
-.text
-
-.globl gcm_init_neon
-
-.def gcm_init_neon
- .type 32
-.endef
-.align 4
-gcm_init_neon:
- AARCH64_VALID_CALL_TARGET
- // This function is adapted from gcm_init_v8. xC2 is t3.
- ld1 {v17.2d}, [x1] // load H
- movi v19.16b, #0xe1
- shl v19.2d, v19.2d, #57 // 0xc2.0
- ext v3.16b, v17.16b, v17.16b, #8
- ushr v18.2d, v19.2d, #63
- dup v17.4s, v17.s[1]
- ext v16.16b, v18.16b, v19.16b, #8 // t0=0xc2....01
- ushr v18.2d, v3.2d, #63
- sshr v17.4s, v17.4s, #31 // broadcast carry bit
- and v18.16b, v18.16b, v16.16b
- shl v3.2d, v3.2d, #1
- ext v18.16b, v18.16b, v18.16b, #8
- and v16.16b, v16.16b, v17.16b
- orr v3.16b, v3.16b, v18.16b // H<<<=1
- eor v5.16b, v3.16b, v16.16b // twisted H
- st1 {v5.2d}, [x0] // store Htable[0]
- ret
-
-
-.globl gcm_gmult_neon
-
-.def gcm_gmult_neon
- .type 32
-.endef
-.align 4
-gcm_gmult_neon:
- AARCH64_VALID_CALL_TARGET
- ld1 {v3.16b}, [x0] // load Xi
- ld1 {v5.1d}, [x1], #8 // load twisted H
- ld1 {v6.1d}, [x1]
- adrp x9, Lmasks // load constants
- add x9, x9, :lo12:Lmasks
- ld1 {v24.2d, v25.2d}, [x9]
- rev64 v3.16b, v3.16b // byteswap Xi
- ext v3.16b, v3.16b, v3.16b, #8
- eor v7.8b, v5.8b, v6.8b // Karatsuba pre-processing
-
- mov x3, #16
- b Lgmult_neon
-
-
-.globl gcm_ghash_neon
-
-.def gcm_ghash_neon
- .type 32
-.endef
-.align 4
-gcm_ghash_neon:
- AARCH64_VALID_CALL_TARGET
- ld1 {v0.16b}, [x0] // load Xi
- ld1 {v5.1d}, [x1], #8 // load twisted H
- ld1 {v6.1d}, [x1]
- adrp x9, Lmasks // load constants
- add x9, x9, :lo12:Lmasks
- ld1 {v24.2d, v25.2d}, [x9]
- rev64 v0.16b, v0.16b // byteswap Xi
- ext v0.16b, v0.16b, v0.16b, #8
- eor v7.8b, v5.8b, v6.8b // Karatsuba pre-processing
-
-Loop_neon:
- ld1 {v3.16b}, [x2], #16 // load inp
- rev64 v3.16b, v3.16b // byteswap inp
- ext v3.16b, v3.16b, v3.16b, #8
- eor v3.16b, v3.16b, v0.16b // inp ^= Xi
-
-Lgmult_neon:
- // Split the input into v3 and v4. (The upper halves are unused,
- // so it is okay to leave them alone.)
- ins v4.d[0], v3.d[1]
- ext v16.8b, v5.8b, v5.8b, #1 // A1
- pmull v16.8h, v16.8b, v3.8b // F = A1*B
- ext v0.8b, v3.8b, v3.8b, #1 // B1
- pmull v0.8h, v5.8b, v0.8b // E = A*B1
- ext v17.8b, v5.8b, v5.8b, #2 // A2
- pmull v17.8h, v17.8b, v3.8b // H = A2*B
- ext v19.8b, v3.8b, v3.8b, #2 // B2
- pmull v19.8h, v5.8b, v19.8b // G = A*B2
- ext v18.8b, v5.8b, v5.8b, #3 // A3
- eor v16.16b, v16.16b, v0.16b // L = E + F
- pmull v18.8h, v18.8b, v3.8b // J = A3*B
- ext v0.8b, v3.8b, v3.8b, #3 // B3
- eor v17.16b, v17.16b, v19.16b // M = G + H
- pmull v0.8h, v5.8b, v0.8b // I = A*B3
-
- // Here we diverge from the 32-bit version. It computes the following
- // (instructions reordered for clarity):
- //
- // veor $t0#lo, $t0#lo, $t0#hi @ t0 = P0 + P1 (L)
- // vand $t0#hi, $t0#hi, $k48
- // veor $t0#lo, $t0#lo, $t0#hi
- //
- // veor $t1#lo, $t1#lo, $t1#hi @ t1 = P2 + P3 (M)
- // vand $t1#hi, $t1#hi, $k32
- // veor $t1#lo, $t1#lo, $t1#hi
- //
- // veor $t2#lo, $t2#lo, $t2#hi @ t2 = P4 + P5 (N)
- // vand $t2#hi, $t2#hi, $k16
- // veor $t2#lo, $t2#lo, $t2#hi
- //
- // veor $t3#lo, $t3#lo, $t3#hi @ t3 = P6 + P7 (K)
- // vmov.i64 $t3#hi, #0
- //
- // $kN is a mask with the bottom N bits set. AArch64 cannot compute on
- // upper halves of SIMD registers, so we must split each half into
- // separate registers. To compensate, we pair computations up and
- // parallelize.
-
- ext v19.8b, v3.8b, v3.8b, #4 // B4
- eor v18.16b, v18.16b, v0.16b // N = I + J
- pmull v19.8h, v5.8b, v19.8b // K = A*B4
-
- // This can probably be scheduled more efficiently. For now, we just
- // pair up independent instructions.
- zip1 v20.2d, v16.2d, v17.2d
- zip1 v22.2d, v18.2d, v19.2d
- zip2 v21.2d, v16.2d, v17.2d
- zip2 v23.2d, v18.2d, v19.2d
- eor v20.16b, v20.16b, v21.16b
- eor v22.16b, v22.16b, v23.16b
- and v21.16b, v21.16b, v24.16b
- and v23.16b, v23.16b, v25.16b
- eor v20.16b, v20.16b, v21.16b
- eor v22.16b, v22.16b, v23.16b
- zip1 v16.2d, v20.2d, v21.2d
- zip1 v18.2d, v22.2d, v23.2d
- zip2 v17.2d, v20.2d, v21.2d
- zip2 v19.2d, v22.2d, v23.2d
-
- ext v16.16b, v16.16b, v16.16b, #15 // t0 = t0 << 8
- ext v17.16b, v17.16b, v17.16b, #14 // t1 = t1 << 16
- pmull v0.8h, v5.8b, v3.8b // D = A*B
- ext v19.16b, v19.16b, v19.16b, #12 // t3 = t3 << 32
- ext v18.16b, v18.16b, v18.16b, #13 // t2 = t2 << 24
- eor v16.16b, v16.16b, v17.16b
- eor v18.16b, v18.16b, v19.16b
- eor v0.16b, v0.16b, v16.16b
- eor v0.16b, v0.16b, v18.16b
- eor v3.8b, v3.8b, v4.8b // Karatsuba pre-processing
- ext v16.8b, v7.8b, v7.8b, #1 // A1
- pmull v16.8h, v16.8b, v3.8b // F = A1*B
- ext v1.8b, v3.8b, v3.8b, #1 // B1
- pmull v1.8h, v7.8b, v1.8b // E = A*B1
- ext v17.8b, v7.8b, v7.8b, #2 // A2
- pmull v17.8h, v17.8b, v3.8b // H = A2*B
- ext v19.8b, v3.8b, v3.8b, #2 // B2
- pmull v19.8h, v7.8b, v19.8b // G = A*B2
- ext v18.8b, v7.8b, v7.8b, #3 // A3
- eor v16.16b, v16.16b, v1.16b // L = E + F
- pmull v18.8h, v18.8b, v3.8b // J = A3*B
- ext v1.8b, v3.8b, v3.8b, #3 // B3
- eor v17.16b, v17.16b, v19.16b // M = G + H
- pmull v1.8h, v7.8b, v1.8b // I = A*B3
-
- // Here we diverge from the 32-bit version. It computes the following
- // (instructions reordered for clarity):
- //
- // veor $t0#lo, $t0#lo, $t0#hi @ t0 = P0 + P1 (L)
- // vand $t0#hi, $t0#hi, $k48
- // veor $t0#lo, $t0#lo, $t0#hi
- //
- // veor $t1#lo, $t1#lo, $t1#hi @ t1 = P2 + P3 (M)
- // vand $t1#hi, $t1#hi, $k32
- // veor $t1#lo, $t1#lo, $t1#hi
- //
- // veor $t2#lo, $t2#lo, $t2#hi @ t2 = P4 + P5 (N)
- // vand $t2#hi, $t2#hi, $k16
- // veor $t2#lo, $t2#lo, $t2#hi
- //
- // veor $t3#lo, $t3#lo, $t3#hi @ t3 = P6 + P7 (K)
- // vmov.i64 $t3#hi, #0
- //
- // $kN is a mask with the bottom N bits set. AArch64 cannot compute on
- // upper halves of SIMD registers, so we must split each half into
- // separate registers. To compensate, we pair computations up and
- // parallelize.
-
- ext v19.8b, v3.8b, v3.8b, #4 // B4
- eor v18.16b, v18.16b, v1.16b // N = I + J
- pmull v19.8h, v7.8b, v19.8b // K = A*B4
-
- // This can probably be scheduled more efficiently. For now, we just
- // pair up independent instructions.
- zip1 v20.2d, v16.2d, v17.2d
- zip1 v22.2d, v18.2d, v19.2d
- zip2 v21.2d, v16.2d, v17.2d
- zip2 v23.2d, v18.2d, v19.2d
- eor v20.16b, v20.16b, v21.16b
- eor v22.16b, v22.16b, v23.16b
- and v21.16b, v21.16b, v24.16b
- and v23.16b, v23.16b, v25.16b
- eor v20.16b, v20.16b, v21.16b
- eor v22.16b, v22.16b, v23.16b
- zip1 v16.2d, v20.2d, v21.2d
- zip1 v18.2d, v22.2d, v23.2d
- zip2 v17.2d, v20.2d, v21.2d
- zip2 v19.2d, v22.2d, v23.2d
-
- ext v16.16b, v16.16b, v16.16b, #15 // t0 = t0 << 8
- ext v17.16b, v17.16b, v17.16b, #14 // t1 = t1 << 16
- pmull v1.8h, v7.8b, v3.8b // D = A*B
- ext v19.16b, v19.16b, v19.16b, #12 // t3 = t3 << 32
- ext v18.16b, v18.16b, v18.16b, #13 // t2 = t2 << 24
- eor v16.16b, v16.16b, v17.16b
- eor v18.16b, v18.16b, v19.16b
- eor v1.16b, v1.16b, v16.16b
- eor v1.16b, v1.16b, v18.16b
- ext v16.8b, v6.8b, v6.8b, #1 // A1
- pmull v16.8h, v16.8b, v4.8b // F = A1*B
- ext v2.8b, v4.8b, v4.8b, #1 // B1
- pmull v2.8h, v6.8b, v2.8b // E = A*B1
- ext v17.8b, v6.8b, v6.8b, #2 // A2
- pmull v17.8h, v17.8b, v4.8b // H = A2*B
- ext v19.8b, v4.8b, v4.8b, #2 // B2
- pmull v19.8h, v6.8b, v19.8b // G = A*B2
- ext v18.8b, v6.8b, v6.8b, #3 // A3
- eor v16.16b, v16.16b, v2.16b // L = E + F
- pmull v18.8h, v18.8b, v4.8b // J = A3*B
- ext v2.8b, v4.8b, v4.8b, #3 // B3
- eor v17.16b, v17.16b, v19.16b // M = G + H
- pmull v2.8h, v6.8b, v2.8b // I = A*B3
-
- // Here we diverge from the 32-bit version. It computes the following
- // (instructions reordered for clarity):
- //
- // veor $t0#lo, $t0#lo, $t0#hi @ t0 = P0 + P1 (L)
- // vand $t0#hi, $t0#hi, $k48
- // veor $t0#lo, $t0#lo, $t0#hi
- //
- // veor $t1#lo, $t1#lo, $t1#hi @ t1 = P2 + P3 (M)
- // vand $t1#hi, $t1#hi, $k32
- // veor $t1#lo, $t1#lo, $t1#hi
- //
- // veor $t2#lo, $t2#lo, $t2#hi @ t2 = P4 + P5 (N)
- // vand $t2#hi, $t2#hi, $k16
- // veor $t2#lo, $t2#lo, $t2#hi
- //
- // veor $t3#lo, $t3#lo, $t3#hi @ t3 = P6 + P7 (K)
- // vmov.i64 $t3#hi, #0
- //
- // $kN is a mask with the bottom N bits set. AArch64 cannot compute on
- // upper halves of SIMD registers, so we must split each half into
- // separate registers. To compensate, we pair computations up and
- // parallelize.
-
- ext v19.8b, v4.8b, v4.8b, #4 // B4
- eor v18.16b, v18.16b, v2.16b // N = I + J
- pmull v19.8h, v6.8b, v19.8b // K = A*B4
-
- // This can probably be scheduled more efficiently. For now, we just
- // pair up independent instructions.
- zip1 v20.2d, v16.2d, v17.2d
- zip1 v22.2d, v18.2d, v19.2d
- zip2 v21.2d, v16.2d, v17.2d
- zip2 v23.2d, v18.2d, v19.2d
- eor v20.16b, v20.16b, v21.16b
- eor v22.16b, v22.16b, v23.16b
- and v21.16b, v21.16b, v24.16b
- and v23.16b, v23.16b, v25.16b
- eor v20.16b, v20.16b, v21.16b
- eor v22.16b, v22.16b, v23.16b
- zip1 v16.2d, v20.2d, v21.2d
- zip1 v18.2d, v22.2d, v23.2d
- zip2 v17.2d, v20.2d, v21.2d
- zip2 v19.2d, v22.2d, v23.2d
-
- ext v16.16b, v16.16b, v16.16b, #15 // t0 = t0 << 8
- ext v17.16b, v17.16b, v17.16b, #14 // t1 = t1 << 16
- pmull v2.8h, v6.8b, v4.8b // D = A*B
- ext v19.16b, v19.16b, v19.16b, #12 // t3 = t3 << 32
- ext v18.16b, v18.16b, v18.16b, #13 // t2 = t2 << 24
- eor v16.16b, v16.16b, v17.16b
- eor v18.16b, v18.16b, v19.16b
- eor v2.16b, v2.16b, v16.16b
- eor v2.16b, v2.16b, v18.16b
- ext v16.16b, v0.16b, v2.16b, #8
- eor v1.16b, v1.16b, v0.16b // Karatsuba post-processing
- eor v1.16b, v1.16b, v2.16b
- eor v1.16b, v1.16b, v16.16b // Xm overlaps Xh.lo and Xl.hi
- ins v0.d[1], v1.d[0] // Xh|Xl - 256-bit result
- // This is a no-op due to the ins instruction below.
- // ins v2.d[0], v1.d[1]
-
- // equivalent of reduction_avx from ghash-x86_64.pl
- shl v17.2d, v0.2d, #57 // 1st phase
- shl v18.2d, v0.2d, #62
- eor v18.16b, v18.16b, v17.16b //
- shl v17.2d, v0.2d, #63
- eor v18.16b, v18.16b, v17.16b //
- // Note Xm contains {Xl.d[1], Xh.d[0]}.
- eor v18.16b, v18.16b, v1.16b
- ins v0.d[1], v18.d[0] // Xl.d[1] ^= t2.d[0]
- ins v2.d[0], v18.d[1] // Xh.d[0] ^= t2.d[1]
-
- ushr v18.2d, v0.2d, #1 // 2nd phase
- eor v2.16b, v2.16b,v0.16b
- eor v0.16b, v0.16b,v18.16b //
- ushr v18.2d, v18.2d, #6
- ushr v0.2d, v0.2d, #1 //
- eor v0.16b, v0.16b, v2.16b //
- eor v0.16b, v0.16b, v18.16b //
-
- subs x3, x3, #16
- bne Loop_neon
-
- rev64 v0.16b, v0.16b // byteswap Xi and write
- ext v0.16b, v0.16b, v0.16b, #8
- st1 {v0.16b}, [x0]
-
- ret
-
-
-.section .rodata
-.align 4
-Lmasks:
-.quad 0x0000ffffffffffff // k48
-.quad 0x00000000ffffffff // k32
-.quad 0x000000000000ffff // k16
-.quad 0x0000000000000000 // k0
-.byte 71,72,65,83,72,32,102,111,114,32,65,82,77,118,56,44,32,100,101,114,105,118,101,100,32,102,114,111,109,32,65,82,77,118,52,32,118,101,114,115,105,111,110,32,98,121,32,60,97,112,112,114,111,64,111,112,101,110,115,115,108,46,111,114,103,62,0
-.align 2
-.align 2
-#endif // !OPENSSL_NO_ASM && defined(OPENSSL_AARCH64) && defined(_WIN32)
diff --git a/win-aarch64/crypto/fipsmodule/ghashv8-armv8-win.S b/win-aarch64/crypto/fipsmodule/ghashv8-armv8-win.S
deleted file mode 100644
index 0be9ac67..00000000
--- a/win-aarch64/crypto/fipsmodule/ghashv8-armv8-win.S
+++ /dev/null
@@ -1,573 +0,0 @@
-// This file is generated from a similarly-named Perl script in the BoringSSL
-// source tree. Do not edit by hand.
-
-#include <openssl/asm_base.h>
-
-#if !defined(OPENSSL_NO_ASM) && defined(OPENSSL_AARCH64) && defined(_WIN32)
-#include <openssl/arm_arch.h>
-
-#if __ARM_MAX_ARCH__>=7
-.text
-.arch armv8-a+crypto
-.globl gcm_init_v8
-
-.def gcm_init_v8
- .type 32
-.endef
-.align 4
-gcm_init_v8:
- AARCH64_VALID_CALL_TARGET
- ld1 {v17.2d},[x1] //load input H
- movi v19.16b,#0xe1
- shl v19.2d,v19.2d,#57 //0xc2.0
- ext v3.16b,v17.16b,v17.16b,#8
- ushr v18.2d,v19.2d,#63
- dup v17.4s,v17.s[1]
- ext v16.16b,v18.16b,v19.16b,#8 //t0=0xc2....01
- ushr v18.2d,v3.2d,#63
- sshr v17.4s,v17.4s,#31 //broadcast carry bit
- and v18.16b,v18.16b,v16.16b
- shl v3.2d,v3.2d,#1
- ext v18.16b,v18.16b,v18.16b,#8
- and v16.16b,v16.16b,v17.16b
- orr v3.16b,v3.16b,v18.16b //H<<<=1
- eor v20.16b,v3.16b,v16.16b //twisted H
- st1 {v20.2d},[x0],#16 //store Htable[0]
-
- //calculate H^2
- ext v16.16b,v20.16b,v20.16b,#8 //Karatsuba pre-processing
- pmull v0.1q,v20.1d,v20.1d
- eor v16.16b,v16.16b,v20.16b
- pmull2 v2.1q,v20.2d,v20.2d
- pmull v1.1q,v16.1d,v16.1d
-
- ext v17.16b,v0.16b,v2.16b,#8 //Karatsuba post-processing
- eor v18.16b,v0.16b,v2.16b
- eor v1.16b,v1.16b,v17.16b
- eor v1.16b,v1.16b,v18.16b
- pmull v18.1q,v0.1d,v19.1d //1st phase
-
- ins v2.d[0],v1.d[1]
- ins v1.d[1],v0.d[0]
- eor v0.16b,v1.16b,v18.16b
-
- ext v18.16b,v0.16b,v0.16b,#8 //2nd phase
- pmull v0.1q,v0.1d,v19.1d
- eor v18.16b,v18.16b,v2.16b
- eor v22.16b,v0.16b,v18.16b
-
- ext v17.16b,v22.16b,v22.16b,#8 //Karatsuba pre-processing
- eor v17.16b,v17.16b,v22.16b
- ext v21.16b,v16.16b,v17.16b,#8 //pack Karatsuba pre-processed
- st1 {v21.2d,v22.2d},[x0],#32 //store Htable[1..2]
- //calculate H^3 and H^4
- pmull v0.1q,v20.1d, v22.1d
- pmull v5.1q,v22.1d,v22.1d
- pmull2 v2.1q,v20.2d, v22.2d
- pmull2 v7.1q,v22.2d,v22.2d
- pmull v1.1q,v16.1d,v17.1d
- pmull v6.1q,v17.1d,v17.1d
-
- ext v16.16b,v0.16b,v2.16b,#8 //Karatsuba post-processing
- ext v17.16b,v5.16b,v7.16b,#8
- eor v18.16b,v0.16b,v2.16b
- eor v1.16b,v1.16b,v16.16b
- eor v4.16b,v5.16b,v7.16b
- eor v6.16b,v6.16b,v17.16b
- eor v1.16b,v1.16b,v18.16b
- pmull v18.1q,v0.1d,v19.1d //1st phase
- eor v6.16b,v6.16b,v4.16b
- pmull v4.1q,v5.1d,v19.1d
-
- ins v2.d[0],v1.d[1]
- ins v7.d[0],v6.d[1]
- ins v1.d[1],v0.d[0]
- ins v6.d[1],v5.d[0]
- eor v0.16b,v1.16b,v18.16b
- eor v5.16b,v6.16b,v4.16b
-
- ext v18.16b,v0.16b,v0.16b,#8 //2nd phase
- ext v4.16b,v5.16b,v5.16b,#8
- pmull v0.1q,v0.1d,v19.1d
- pmull v5.1q,v5.1d,v19.1d
- eor v18.16b,v18.16b,v2.16b
- eor v4.16b,v4.16b,v7.16b
- eor v20.16b, v0.16b,v18.16b //H^3
- eor v22.16b,v5.16b,v4.16b //H^4
-
- ext v16.16b,v20.16b, v20.16b,#8 //Karatsuba pre-processing
- ext v17.16b,v22.16b,v22.16b,#8
- eor v16.16b,v16.16b,v20.16b
- eor v17.16b,v17.16b,v22.16b
- ext v21.16b,v16.16b,v17.16b,#8 //pack Karatsuba pre-processed
- st1 {v20.2d,v21.2d,v22.2d},[x0] //store Htable[3..5]
- ret
-
-.globl gcm_gmult_v8
-
-.def gcm_gmult_v8
- .type 32
-.endef
-.align 4
-gcm_gmult_v8:
- AARCH64_VALID_CALL_TARGET
- ld1 {v17.2d},[x0] //load Xi
- movi v19.16b,#0xe1
- ld1 {v20.2d,v21.2d},[x1] //load twisted H, ...
- shl v19.2d,v19.2d,#57
-#ifndef __AARCH64EB__
- rev64 v17.16b,v17.16b
-#endif
- ext v3.16b,v17.16b,v17.16b,#8
-
- pmull v0.1q,v20.1d,v3.1d //H.lo·Xi.lo
- eor v17.16b,v17.16b,v3.16b //Karatsuba pre-processing
- pmull2 v2.1q,v20.2d,v3.2d //H.hi·Xi.hi
- pmull v1.1q,v21.1d,v17.1d //(H.lo+H.hi)·(Xi.lo+Xi.hi)
-
- ext v17.16b,v0.16b,v2.16b,#8 //Karatsuba post-processing
- eor v18.16b,v0.16b,v2.16b
- eor v1.16b,v1.16b,v17.16b
- eor v1.16b,v1.16b,v18.16b
- pmull v18.1q,v0.1d,v19.1d //1st phase of reduction
-
- ins v2.d[0],v1.d[1]
- ins v1.d[1],v0.d[0]
- eor v0.16b,v1.16b,v18.16b
-
- ext v18.16b,v0.16b,v0.16b,#8 //2nd phase of reduction
- pmull v0.1q,v0.1d,v19.1d
- eor v18.16b,v18.16b,v2.16b
- eor v0.16b,v0.16b,v18.16b
-
-#ifndef __AARCH64EB__
- rev64 v0.16b,v0.16b
-#endif
- ext v0.16b,v0.16b,v0.16b,#8
- st1 {v0.2d},[x0] //write out Xi
-
- ret
-
-.globl gcm_ghash_v8
-
-.def gcm_ghash_v8
- .type 32
-.endef
-.align 4
-gcm_ghash_v8:
- AARCH64_VALID_CALL_TARGET
- cmp x3,#64
- b.hs Lgcm_ghash_v8_4x
- ld1 {v0.2d},[x0] //load [rotated] Xi
- //"[rotated]" means that
- //loaded value would have
- //to be rotated in order to
- //make it appear as in
- //algorithm specification
- subs x3,x3,#32 //see if x3 is 32 or larger
- mov x12,#16 //x12 is used as post-
- //increment for input pointer;
- //as loop is modulo-scheduled
- //x12 is zeroed just in time
- //to preclude overstepping
- //inp[len], which means that
- //last block[s] are actually
- //loaded twice, but last
- //copy is not processed
- ld1 {v20.2d,v21.2d},[x1],#32 //load twisted H, ..., H^2
- movi v19.16b,#0xe1
- ld1 {v22.2d},[x1]
- csel x12,xzr,x12,eq //is it time to zero x12?
- ext v0.16b,v0.16b,v0.16b,#8 //rotate Xi
- ld1 {v16.2d},[x2],#16 //load [rotated] I[0]
- shl v19.2d,v19.2d,#57 //compose 0xc2.0 constant
-#ifndef __AARCH64EB__
- rev64 v16.16b,v16.16b
- rev64 v0.16b,v0.16b
-#endif
- ext v3.16b,v16.16b,v16.16b,#8 //rotate I[0]
- b.lo Lodd_tail_v8 //x3 was less than 32
- ld1 {v17.2d},[x2],x12 //load [rotated] I[1]
-#ifndef __AARCH64EB__
- rev64 v17.16b,v17.16b
-#endif
- ext v7.16b,v17.16b,v17.16b,#8
- eor v3.16b,v3.16b,v0.16b //I[i]^=Xi
- pmull v4.1q,v20.1d,v7.1d //H·Ii+1
- eor v17.16b,v17.16b,v7.16b //Karatsuba pre-processing
- pmull2 v6.1q,v20.2d,v7.2d
- b Loop_mod2x_v8
-
-.align 4
-Loop_mod2x_v8:
- ext v18.16b,v3.16b,v3.16b,#8
- subs x3,x3,#32 //is there more data?
- pmull v0.1q,v22.1d,v3.1d //H^2.lo·Xi.lo
- csel x12,xzr,x12,lo //is it time to zero x12?
-
- pmull v5.1q,v21.1d,v17.1d
- eor v18.16b,v18.16b,v3.16b //Karatsuba pre-processing
- pmull2 v2.1q,v22.2d,v3.2d //H^2.hi·Xi.hi
- eor v0.16b,v0.16b,v4.16b //accumulate
- pmull2 v1.1q,v21.2d,v18.2d //(H^2.lo+H^2.hi)·(Xi.lo+Xi.hi)
- ld1 {v16.2d},[x2],x12 //load [rotated] I[i+2]
-
- eor v2.16b,v2.16b,v6.16b
- csel x12,xzr,x12,eq //is it time to zero x12?
- eor v1.16b,v1.16b,v5.16b
-
- ext v17.16b,v0.16b,v2.16b,#8 //Karatsuba post-processing
- eor v18.16b,v0.16b,v2.16b
- eor v1.16b,v1.16b,v17.16b
- ld1 {v17.2d},[x2],x12 //load [rotated] I[i+3]
-#ifndef __AARCH64EB__
- rev64 v16.16b,v16.16b
-#endif
- eor v1.16b,v1.16b,v18.16b
- pmull v18.1q,v0.1d,v19.1d //1st phase of reduction
-
-#ifndef __AARCH64EB__
- rev64 v17.16b,v17.16b
-#endif
- ins v2.d[0],v1.d[1]
- ins v1.d[1],v0.d[0]
- ext v7.16b,v17.16b,v17.16b,#8
- ext v3.16b,v16.16b,v16.16b,#8
- eor v0.16b,v1.16b,v18.16b
- pmull v4.1q,v20.1d,v7.1d //H·Ii+1
- eor v3.16b,v3.16b,v2.16b //accumulate v3.16b early
-
- ext v18.16b,v0.16b,v0.16b,#8 //2nd phase of reduction
- pmull v0.1q,v0.1d,v19.1d
- eor v3.16b,v3.16b,v18.16b
- eor v17.16b,v17.16b,v7.16b //Karatsuba pre-processing
- eor v3.16b,v3.16b,v0.16b
- pmull2 v6.1q,v20.2d,v7.2d
- b.hs Loop_mod2x_v8 //there was at least 32 more bytes
-
- eor v2.16b,v2.16b,v18.16b
- ext v3.16b,v16.16b,v16.16b,#8 //re-construct v3.16b
- adds x3,x3,#32 //re-construct x3
- eor v0.16b,v0.16b,v2.16b //re-construct v0.16b
- b.eq Ldone_v8 //is x3 zero?
-Lodd_tail_v8:
- ext v18.16b,v0.16b,v0.16b,#8
- eor v3.16b,v3.16b,v0.16b //inp^=Xi
- eor v17.16b,v16.16b,v18.16b //v17.16b is rotated inp^Xi
-
- pmull v0.1q,v20.1d,v3.1d //H.lo·Xi.lo
- eor v17.16b,v17.16b,v3.16b //Karatsuba pre-processing
- pmull2 v2.1q,v20.2d,v3.2d //H.hi·Xi.hi
- pmull v1.1q,v21.1d,v17.1d //(H.lo+H.hi)·(Xi.lo+Xi.hi)
-
- ext v17.16b,v0.16b,v2.16b,#8 //Karatsuba post-processing
- eor v18.16b,v0.16b,v2.16b
- eor v1.16b,v1.16b,v17.16b
- eor v1.16b,v1.16b,v18.16b
- pmull v18.1q,v0.1d,v19.1d //1st phase of reduction
-
- ins v2.d[0],v1.d[1]
- ins v1.d[1],v0.d[0]
- eor v0.16b,v1.16b,v18.16b
-
- ext v18.16b,v0.16b,v0.16b,#8 //2nd phase of reduction
- pmull v0.1q,v0.1d,v19.1d
- eor v18.16b,v18.16b,v2.16b
- eor v0.16b,v0.16b,v18.16b
-
-Ldone_v8:
-#ifndef __AARCH64EB__
- rev64 v0.16b,v0.16b
-#endif
- ext v0.16b,v0.16b,v0.16b,#8
- st1 {v0.2d},[x0] //write out Xi
-
- ret
-
-.def gcm_ghash_v8_4x
- .type 32
-.endef
-.align 4
-gcm_ghash_v8_4x:
-Lgcm_ghash_v8_4x:
- ld1 {v0.2d},[x0] //load [rotated] Xi
- ld1 {v20.2d,v21.2d,v22.2d},[x1],#48 //load twisted H, ..., H^2
- movi v19.16b,#0xe1
- ld1 {v26.2d,v27.2d,v28.2d},[x1] //load twisted H^3, ..., H^4
- shl v19.2d,v19.2d,#57 //compose 0xc2.0 constant
-
- ld1 {v4.2d,v5.2d,v6.2d,v7.2d},[x2],#64
-#ifndef __AARCH64EB__
- rev64 v0.16b,v0.16b
- rev64 v5.16b,v5.16b
- rev64 v6.16b,v6.16b
- rev64 v7.16b,v7.16b
- rev64 v4.16b,v4.16b
-#endif
- ext v25.16b,v7.16b,v7.16b,#8
- ext v24.16b,v6.16b,v6.16b,#8
- ext v23.16b,v5.16b,v5.16b,#8
-
- pmull v29.1q,v20.1d,v25.1d //H·Ii+3
- eor v7.16b,v7.16b,v25.16b
- pmull2 v31.1q,v20.2d,v25.2d
- pmull v30.1q,v21.1d,v7.1d
-
- pmull v16.1q,v22.1d,v24.1d //H^2·Ii+2
- eor v6.16b,v6.16b,v24.16b
- pmull2 v24.1q,v22.2d,v24.2d
- pmull2 v6.1q,v21.2d,v6.2d
-
- eor v29.16b,v29.16b,v16.16b
- eor v31.16b,v31.16b,v24.16b
- eor v30.16b,v30.16b,v6.16b
-
- pmull v7.1q,v26.1d,v23.1d //H^3·Ii+1
- eor v5.16b,v5.16b,v23.16b
- pmull2 v23.1q,v26.2d,v23.2d
- pmull v5.1q,v27.1d,v5.1d
-
- eor v29.16b,v29.16b,v7.16b
- eor v31.16b,v31.16b,v23.16b
- eor v30.16b,v30.16b,v5.16b
-
- subs x3,x3,#128
- b.lo Ltail4x
-
- b Loop4x
-
-.align 4
-Loop4x:
- eor v16.16b,v4.16b,v0.16b
- ld1 {v4.2d,v5.2d,v6.2d,v7.2d},[x2],#64
- ext v3.16b,v16.16b,v16.16b,#8
-#ifndef __AARCH64EB__
- rev64 v5.16b,v5.16b
- rev64 v6.16b,v6.16b
- rev64 v7.16b,v7.16b
- rev64 v4.16b,v4.16b
-#endif
-
- pmull v0.1q,v28.1d,v3.1d //H^4·(Xi+Ii)
- eor v16.16b,v16.16b,v3.16b
- pmull2 v2.1q,v28.2d,v3.2d
- ext v25.16b,v7.16b,v7.16b,#8
- pmull2 v1.1q,v27.2d,v16.2d
-
- eor v0.16b,v0.16b,v29.16b
- eor v2.16b,v2.16b,v31.16b
- ext v24.16b,v6.16b,v6.16b,#8
- eor v1.16b,v1.16b,v30.16b
- ext v23.16b,v5.16b,v5.16b,#8
-
- ext v17.16b,v0.16b,v2.16b,#8 //Karatsuba post-processing
- eor v18.16b,v0.16b,v2.16b
- pmull v29.1q,v20.1d,v25.1d //H·Ii+3
- eor v7.16b,v7.16b,v25.16b
- eor v1.16b,v1.16b,v17.16b
- pmull2 v31.1q,v20.2d,v25.2d
- eor v1.16b,v1.16b,v18.16b
- pmull v30.1q,v21.1d,v7.1d
-
- pmull v18.1q,v0.1d,v19.1d //1st phase of reduction
- ins v2.d[0],v1.d[1]
- ins v1.d[1],v0.d[0]
- pmull v16.1q,v22.1d,v24.1d //H^2·Ii+2
- eor v6.16b,v6.16b,v24.16b
- pmull2 v24.1q,v22.2d,v24.2d
- eor v0.16b,v1.16b,v18.16b
- pmull2 v6.1q,v21.2d,v6.2d
-
- eor v29.16b,v29.16b,v16.16b
- eor v31.16b,v31.16b,v24.16b
- eor v30.16b,v30.16b,v6.16b
-
- ext v18.16b,v0.16b,v0.16b,#8 //2nd phase of reduction
- pmull v0.1q,v0.1d,v19.1d
- pmull v7.1q,v26.1d,v23.1d //H^3·Ii+1
- eor v5.16b,v5.16b,v23.16b
- eor v18.16b,v18.16b,v2.16b
- pmull2 v23.1q,v26.2d,v23.2d
- pmull v5.1q,v27.1d,v5.1d
-
- eor v0.16b,v0.16b,v18.16b
- eor v29.16b,v29.16b,v7.16b
- eor v31.16b,v31.16b,v23.16b
- ext v0.16b,v0.16b,v0.16b,#8
- eor v30.16b,v30.16b,v5.16b
-
- subs x3,x3,#64
- b.hs Loop4x
-
-Ltail4x:
- eor v16.16b,v4.16b,v0.16b
- ext v3.16b,v16.16b,v16.16b,#8
-
- pmull v0.1q,v28.1d,v3.1d //H^4·(Xi+Ii)
- eor v16.16b,v16.16b,v3.16b
- pmull2 v2.1q,v28.2d,v3.2d
- pmull2 v1.1q,v27.2d,v16.2d
-
- eor v0.16b,v0.16b,v29.16b
- eor v2.16b,v2.16b,v31.16b
- eor v1.16b,v1.16b,v30.16b
-
- adds x3,x3,#64
- b.eq Ldone4x
-
- cmp x3,#32
- b.lo Lone
- b.eq Ltwo
-Lthree:
- ext v17.16b,v0.16b,v2.16b,#8 //Karatsuba post-processing
- eor v18.16b,v0.16b,v2.16b
- eor v1.16b,v1.16b,v17.16b
- ld1 {v4.2d,v5.2d,v6.2d},[x2]
- eor v1.16b,v1.16b,v18.16b
-#ifndef __AARCH64EB__
- rev64 v5.16b,v5.16b
- rev64 v6.16b,v6.16b
- rev64 v4.16b,v4.16b
-#endif
-
- pmull v18.1q,v0.1d,v19.1d //1st phase of reduction
- ins v2.d[0],v1.d[1]
- ins v1.d[1],v0.d[0]
- ext v24.16b,v6.16b,v6.16b,#8
- ext v23.16b,v5.16b,v5.16b,#8
- eor v0.16b,v1.16b,v18.16b
-
- pmull v29.1q,v20.1d,v24.1d //H·Ii+2
- eor v6.16b,v6.16b,v24.16b
-
- ext v18.16b,v0.16b,v0.16b,#8 //2nd phase of reduction
- pmull v0.1q,v0.1d,v19.1d
- eor v18.16b,v18.16b,v2.16b
- pmull2 v31.1q,v20.2d,v24.2d
- pmull v30.1q,v21.1d,v6.1d
- eor v0.16b,v0.16b,v18.16b
- pmull v7.1q,v22.1d,v23.1d //H^2·Ii+1
- eor v5.16b,v5.16b,v23.16b
- ext v0.16b,v0.16b,v0.16b,#8
-
- pmull2 v23.1q,v22.2d,v23.2d
- eor v16.16b,v4.16b,v0.16b
- pmull2 v5.1q,v21.2d,v5.2d
- ext v3.16b,v16.16b,v16.16b,#8
-
- eor v29.16b,v29.16b,v7.16b
- eor v31.16b,v31.16b,v23.16b
- eor v30.16b,v30.16b,v5.16b
-
- pmull v0.1q,v26.1d,v3.1d //H^3·(Xi+Ii)
- eor v16.16b,v16.16b,v3.16b
- pmull2 v2.1q,v26.2d,v3.2d
- pmull v1.1q,v27.1d,v16.1d
-
- eor v0.16b,v0.16b,v29.16b
- eor v2.16b,v2.16b,v31.16b
- eor v1.16b,v1.16b,v30.16b
- b Ldone4x
-
-.align 4
-Ltwo:
- ext v17.16b,v0.16b,v2.16b,#8 //Karatsuba post-processing
- eor v18.16b,v0.16b,v2.16b
- eor v1.16b,v1.16b,v17.16b
- ld1 {v4.2d,v5.2d},[x2]
- eor v1.16b,v1.16b,v18.16b
-#ifndef __AARCH64EB__
- rev64 v5.16b,v5.16b
- rev64 v4.16b,v4.16b
-#endif
-
- pmull v18.1q,v0.1d,v19.1d //1st phase of reduction
- ins v2.d[0],v1.d[1]
- ins v1.d[1],v0.d[0]
- ext v23.16b,v5.16b,v5.16b,#8
- eor v0.16b,v1.16b,v18.16b
-
- ext v18.16b,v0.16b,v0.16b,#8 //2nd phase of reduction
- pmull v0.1q,v0.1d,v19.1d
- eor v18.16b,v18.16b,v2.16b
- eor v0.16b,v0.16b,v18.16b
- ext v0.16b,v0.16b,v0.16b,#8
-
- pmull v29.1q,v20.1d,v23.1d //H·Ii+1
- eor v5.16b,v5.16b,v23.16b
-
- eor v16.16b,v4.16b,v0.16b
- ext v3.16b,v16.16b,v16.16b,#8
-
- pmull2 v31.1q,v20.2d,v23.2d
- pmull v30.1q,v21.1d,v5.1d
-
- pmull v0.1q,v22.1d,v3.1d //H^2·(Xi+Ii)
- eor v16.16b,v16.16b,v3.16b
- pmull2 v2.1q,v22.2d,v3.2d
- pmull2 v1.1q,v21.2d,v16.2d
-
- eor v0.16b,v0.16b,v29.16b
- eor v2.16b,v2.16b,v31.16b
- eor v1.16b,v1.16b,v30.16b
- b Ldone4x
-
-.align 4
-Lone:
- ext v17.16b,v0.16b,v2.16b,#8 //Karatsuba post-processing
- eor v18.16b,v0.16b,v2.16b
- eor v1.16b,v1.16b,v17.16b
- ld1 {v4.2d},[x2]
- eor v1.16b,v1.16b,v18.16b
-#ifndef __AARCH64EB__
- rev64 v4.16b,v4.16b
-#endif
-
- pmull v18.1q,v0.1d,v19.1d //1st phase of reduction
- ins v2.d[0],v1.d[1]
- ins v1.d[1],v0.d[0]
- eor v0.16b,v1.16b,v18.16b
-
- ext v18.16b,v0.16b,v0.16b,#8 //2nd phase of reduction
- pmull v0.1q,v0.1d,v19.1d
- eor v18.16b,v18.16b,v2.16b
- eor v0.16b,v0.16b,v18.16b
- ext v0.16b,v0.16b,v0.16b,#8
-
- eor v16.16b,v4.16b,v0.16b
- ext v3.16b,v16.16b,v16.16b,#8
-
- pmull v0.1q,v20.1d,v3.1d
- eor v16.16b,v16.16b,v3.16b
- pmull2 v2.1q,v20.2d,v3.2d
- pmull v1.1q,v21.1d,v16.1d
-
-Ldone4x:
- ext v17.16b,v0.16b,v2.16b,#8 //Karatsuba post-processing
- eor v18.16b,v0.16b,v2.16b
- eor v1.16b,v1.16b,v17.16b
- eor v1.16b,v1.16b,v18.16b
-
- pmull v18.1q,v0.1d,v19.1d //1st phase of reduction
- ins v2.d[0],v1.d[1]
- ins v1.d[1],v0.d[0]
- eor v0.16b,v1.16b,v18.16b
-
- ext v18.16b,v0.16b,v0.16b,#8 //2nd phase of reduction
- pmull v0.1q,v0.1d,v19.1d
- eor v18.16b,v18.16b,v2.16b
- eor v0.16b,v0.16b,v18.16b
- ext v0.16b,v0.16b,v0.16b,#8
-
-#ifndef __AARCH64EB__
- rev64 v0.16b,v0.16b
-#endif
- st1 {v0.2d},[x0] //write out Xi
-
- ret
-
-.byte 71,72,65,83,72,32,102,111,114,32,65,82,77,118,56,44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112,114,111,64,111,112,101,110,115,115,108,46,111,114,103,62,0
-.align 2
-.align 2
-#endif
-#endif // !OPENSSL_NO_ASM && defined(OPENSSL_AARCH64) && defined(_WIN32)
diff --git a/win-aarch64/crypto/fipsmodule/p256-armv8-asm-win.S b/win-aarch64/crypto/fipsmodule/p256-armv8-asm-win.S
deleted file mode 100644
index a55d20d2..00000000
--- a/win-aarch64/crypto/fipsmodule/p256-armv8-asm-win.S
+++ /dev/null
@@ -1,1766 +0,0 @@
-// This file is generated from a similarly-named Perl script in the BoringSSL
-// source tree. Do not edit by hand.
-
-#include <openssl/asm_base.h>
-
-#if !defined(OPENSSL_NO_ASM) && defined(OPENSSL_AARCH64) && defined(_WIN32)
-#include "openssl/arm_arch.h"
-
-.section .rodata
-.align 5
-Lpoly:
-.quad 0xffffffffffffffff,0x00000000ffffffff,0x0000000000000000,0xffffffff00000001
-LRR: // 2^512 mod P precomputed for NIST P256 polynomial
-.quad 0x0000000000000003,0xfffffffbffffffff,0xfffffffffffffffe,0x00000004fffffffd
-Lone_mont:
-.quad 0x0000000000000001,0xffffffff00000000,0xffffffffffffffff,0x00000000fffffffe
-Lone:
-.quad 1,0,0,0
-Lord:
-.quad 0xf3b9cac2fc632551,0xbce6faada7179e84,0xffffffffffffffff,0xffffffff00000000
-LordK:
-.quad 0xccd1c8aaee00bc4f
-.byte 69,67,80,95,78,73,83,84,90,50,53,54,32,102,111,114,32,65,82,77,118,56,44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112,114,111,64,111,112,101,110,115,115,108,46,111,114,103,62,0
-.align 2
-.text
-
-// void ecp_nistz256_mul_mont(BN_ULONG x0[4],const BN_ULONG x1[4],
-// const BN_ULONG x2[4]);
-.globl ecp_nistz256_mul_mont
-
-.def ecp_nistz256_mul_mont
- .type 32
-.endef
-.align 4
-ecp_nistz256_mul_mont:
- AARCH64_SIGN_LINK_REGISTER
- stp x29,x30,[sp,#-32]!
- add x29,sp,#0
- stp x19,x20,[sp,#16]
-
- ldr x3,[x2] // bp[0]
- ldp x4,x5,[x1]
- ldp x6,x7,[x1,#16]
- adrp x13,Lpoly
- add x13,x13,:lo12:Lpoly
- ldr x12,[x13,#8]
- ldr x13,[x13,#24]
-
- bl __ecp_nistz256_mul_mont
-
- ldp x19,x20,[sp,#16]
- ldp x29,x30,[sp],#32
- AARCH64_VALIDATE_LINK_REGISTER
- ret
-
-
-// void ecp_nistz256_sqr_mont(BN_ULONG x0[4],const BN_ULONG x1[4]);
-.globl ecp_nistz256_sqr_mont
-
-.def ecp_nistz256_sqr_mont
- .type 32
-.endef
-.align 4
-ecp_nistz256_sqr_mont:
- AARCH64_SIGN_LINK_REGISTER
- stp x29,x30,[sp,#-32]!
- add x29,sp,#0
- stp x19,x20,[sp,#16]
-
- ldp x4,x5,[x1]
- ldp x6,x7,[x1,#16]
- adrp x13,Lpoly
- add x13,x13,:lo12:Lpoly
- ldr x12,[x13,#8]
- ldr x13,[x13,#24]
-
- bl __ecp_nistz256_sqr_mont
-
- ldp x19,x20,[sp,#16]
- ldp x29,x30,[sp],#32
- AARCH64_VALIDATE_LINK_REGISTER
- ret
-
-
-// void ecp_nistz256_div_by_2(BN_ULONG x0[4],const BN_ULONG x1[4]);
-.globl ecp_nistz256_div_by_2
-
-.def ecp_nistz256_div_by_2
- .type 32
-.endef
-.align 4
-ecp_nistz256_div_by_2:
- AARCH64_SIGN_LINK_REGISTER
- stp x29,x30,[sp,#-16]!
- add x29,sp,#0
-
- ldp x14,x15,[x1]
- ldp x16,x17,[x1,#16]
- adrp x13,Lpoly
- add x13,x13,:lo12:Lpoly
- ldr x12,[x13,#8]
- ldr x13,[x13,#24]
-
- bl __ecp_nistz256_div_by_2
-
- ldp x29,x30,[sp],#16
- AARCH64_VALIDATE_LINK_REGISTER
- ret
-
-
-// void ecp_nistz256_mul_by_2(BN_ULONG x0[4],const BN_ULONG x1[4]);
-.globl ecp_nistz256_mul_by_2
-
-.def ecp_nistz256_mul_by_2
- .type 32
-.endef
-.align 4
-ecp_nistz256_mul_by_2:
- AARCH64_SIGN_LINK_REGISTER
- stp x29,x30,[sp,#-16]!
- add x29,sp,#0
-
- ldp x14,x15,[x1]
- ldp x16,x17,[x1,#16]
- adrp x13,Lpoly
- add x13,x13,:lo12:Lpoly
- ldr x12,[x13,#8]
- ldr x13,[x13,#24]
- mov x8,x14
- mov x9,x15
- mov x10,x16
- mov x11,x17
-
- bl __ecp_nistz256_add_to // ret = a+a // 2*a
-
- ldp x29,x30,[sp],#16
- AARCH64_VALIDATE_LINK_REGISTER
- ret
-
-
-// void ecp_nistz256_mul_by_3(BN_ULONG x0[4],const BN_ULONG x1[4]);
-.globl ecp_nistz256_mul_by_3
-
-.def ecp_nistz256_mul_by_3
- .type 32
-.endef
-.align 4
-ecp_nistz256_mul_by_3:
- AARCH64_SIGN_LINK_REGISTER
- stp x29,x30,[sp,#-16]!
- add x29,sp,#0
-
- ldp x14,x15,[x1]
- ldp x16,x17,[x1,#16]
- adrp x13,Lpoly
- add x13,x13,:lo12:Lpoly
- ldr x12,[x13,#8]
- ldr x13,[x13,#24]
- mov x8,x14
- mov x9,x15
- mov x10,x16
- mov x11,x17
- mov x4,x14
- mov x5,x15
- mov x6,x16
- mov x7,x17
-
- bl __ecp_nistz256_add_to // ret = a+a // 2*a
-
- mov x8,x4
- mov x9,x5
- mov x10,x6
- mov x11,x7
-
- bl __ecp_nistz256_add_to // ret += a // 2*a+a=3*a
-
- ldp x29,x30,[sp],#16
- AARCH64_VALIDATE_LINK_REGISTER
- ret
-
-
-// void ecp_nistz256_sub(BN_ULONG x0[4],const BN_ULONG x1[4],
-// const BN_ULONG x2[4]);
-.globl ecp_nistz256_sub
-
-.def ecp_nistz256_sub
- .type 32
-.endef
-.align 4
-ecp_nistz256_sub:
- AARCH64_SIGN_LINK_REGISTER
- stp x29,x30,[sp,#-16]!
- add x29,sp,#0
-
- ldp x14,x15,[x1]
- ldp x16,x17,[x1,#16]
- adrp x13,Lpoly
- add x13,x13,:lo12:Lpoly
- ldr x12,[x13,#8]
- ldr x13,[x13,#24]
-
- bl __ecp_nistz256_sub_from
-
- ldp x29,x30,[sp],#16
- AARCH64_VALIDATE_LINK_REGISTER
- ret
-
-
-// void ecp_nistz256_neg(BN_ULONG x0[4],const BN_ULONG x1[4]);
-.globl ecp_nistz256_neg
-
-.def ecp_nistz256_neg
- .type 32
-.endef
-.align 4
-ecp_nistz256_neg:
- AARCH64_SIGN_LINK_REGISTER
- stp x29,x30,[sp,#-16]!
- add x29,sp,#0
-
- mov x2,x1
- mov x14,xzr // a = 0
- mov x15,xzr
- mov x16,xzr
- mov x17,xzr
- adrp x13,Lpoly
- add x13,x13,:lo12:Lpoly
- ldr x12,[x13,#8]
- ldr x13,[x13,#24]
-
- bl __ecp_nistz256_sub_from
-
- ldp x29,x30,[sp],#16
- AARCH64_VALIDATE_LINK_REGISTER
- ret
-
-
-// note that __ecp_nistz256_mul_mont expects a[0-3] input pre-loaded
-// to x4-x7 and b[0] - to x3
-.def __ecp_nistz256_mul_mont
- .type 32
-.endef
-.align 4
-__ecp_nistz256_mul_mont:
- mul x14,x4,x3 // a[0]*b[0]
- umulh x8,x4,x3
-
- mul x15,x5,x3 // a[1]*b[0]
- umulh x9,x5,x3
-
- mul x16,x6,x3 // a[2]*b[0]
- umulh x10,x6,x3
-
- mul x17,x7,x3 // a[3]*b[0]
- umulh x11,x7,x3
- ldr x3,[x2,#8] // b[1]
-
- adds x15,x15,x8 // accumulate high parts of multiplication
- lsl x8,x14,#32
- adcs x16,x16,x9
- lsr x9,x14,#32
- adcs x17,x17,x10
- adc x19,xzr,x11
- mov x20,xzr
- subs x10,x14,x8 // "*0xffff0001"
- sbc x11,x14,x9
- adds x14,x15,x8 // +=acc[0]<<96 and omit acc[0]
- mul x8,x4,x3 // lo(a[0]*b[i])
- adcs x15,x16,x9
- mul x9,x5,x3 // lo(a[1]*b[i])
- adcs x16,x17,x10 // +=acc[0]*0xffff0001
- mul x10,x6,x3 // lo(a[2]*b[i])
- adcs x17,x19,x11
- mul x11,x7,x3 // lo(a[3]*b[i])
- adc x19,x20,xzr
-
- adds x14,x14,x8 // accumulate low parts of multiplication
- umulh x8,x4,x3 // hi(a[0]*b[i])
- adcs x15,x15,x9
- umulh x9,x5,x3 // hi(a[1]*b[i])
- adcs x16,x16,x10
- umulh x10,x6,x3 // hi(a[2]*b[i])
- adcs x17,x17,x11
- umulh x11,x7,x3 // hi(a[3]*b[i])
- adc x19,x19,xzr
- ldr x3,[x2,#8*(1+1)] // b[1+1]
- adds x15,x15,x8 // accumulate high parts of multiplication
- lsl x8,x14,#32
- adcs x16,x16,x9
- lsr x9,x14,#32
- adcs x17,x17,x10
- adcs x19,x19,x11
- adc x20,xzr,xzr
- subs x10,x14,x8 // "*0xffff0001"
- sbc x11,x14,x9
- adds x14,x15,x8 // +=acc[0]<<96 and omit acc[0]
- mul x8,x4,x3 // lo(a[0]*b[i])
- adcs x15,x16,x9
- mul x9,x5,x3 // lo(a[1]*b[i])
- adcs x16,x17,x10 // +=acc[0]*0xffff0001
- mul x10,x6,x3 // lo(a[2]*b[i])
- adcs x17,x19,x11
- mul x11,x7,x3 // lo(a[3]*b[i])
- adc x19,x20,xzr
-
- adds x14,x14,x8 // accumulate low parts of multiplication
- umulh x8,x4,x3 // hi(a[0]*b[i])
- adcs x15,x15,x9
- umulh x9,x5,x3 // hi(a[1]*b[i])
- adcs x16,x16,x10
- umulh x10,x6,x3 // hi(a[2]*b[i])
- adcs x17,x17,x11
- umulh x11,x7,x3 // hi(a[3]*b[i])
- adc x19,x19,xzr
- ldr x3,[x2,#8*(2+1)] // b[2+1]
- adds x15,x15,x8 // accumulate high parts of multiplication
- lsl x8,x14,#32
- adcs x16,x16,x9
- lsr x9,x14,#32
- adcs x17,x17,x10
- adcs x19,x19,x11
- adc x20,xzr,xzr
- subs x10,x14,x8 // "*0xffff0001"
- sbc x11,x14,x9
- adds x14,x15,x8 // +=acc[0]<<96 and omit acc[0]
- mul x8,x4,x3 // lo(a[0]*b[i])
- adcs x15,x16,x9
- mul x9,x5,x3 // lo(a[1]*b[i])
- adcs x16,x17,x10 // +=acc[0]*0xffff0001
- mul x10,x6,x3 // lo(a[2]*b[i])
- adcs x17,x19,x11
- mul x11,x7,x3 // lo(a[3]*b[i])
- adc x19,x20,xzr
-
- adds x14,x14,x8 // accumulate low parts of multiplication
- umulh x8,x4,x3 // hi(a[0]*b[i])
- adcs x15,x15,x9
- umulh x9,x5,x3 // hi(a[1]*b[i])
- adcs x16,x16,x10
- umulh x10,x6,x3 // hi(a[2]*b[i])
- adcs x17,x17,x11
- umulh x11,x7,x3 // hi(a[3]*b[i])
- adc x19,x19,xzr
- adds x15,x15,x8 // accumulate high parts of multiplication
- lsl x8,x14,#32
- adcs x16,x16,x9
- lsr x9,x14,#32
- adcs x17,x17,x10
- adcs x19,x19,x11
- adc x20,xzr,xzr
- // last reduction
- subs x10,x14,x8 // "*0xffff0001"
- sbc x11,x14,x9
- adds x14,x15,x8 // +=acc[0]<<96 and omit acc[0]
- adcs x15,x16,x9
- adcs x16,x17,x10 // +=acc[0]*0xffff0001
- adcs x17,x19,x11
- adc x19,x20,xzr
-
- adds x8,x14,#1 // subs x8,x14,#-1 // tmp = ret-modulus
- sbcs x9,x15,x12
- sbcs x10,x16,xzr
- sbcs x11,x17,x13
- sbcs xzr,x19,xzr // did it borrow?
-
- csel x14,x14,x8,lo // ret = borrow ? ret : ret-modulus
- csel x15,x15,x9,lo
- csel x16,x16,x10,lo
- stp x14,x15,[x0]
- csel x17,x17,x11,lo
- stp x16,x17,[x0,#16]
-
- ret
-
-
-// note that __ecp_nistz256_sqr_mont expects a[0-3] input pre-loaded
-// to x4-x7
-.def __ecp_nistz256_sqr_mont
- .type 32
-.endef
-.align 4
-__ecp_nistz256_sqr_mont:
- // | | | | | |a1*a0| |
- // | | | | |a2*a0| | |
- // | |a3*a2|a3*a0| | | |
- // | | | |a2*a1| | | |
- // | | |a3*a1| | | | |
- // *| | | | | | | | 2|
- // +|a3*a3|a2*a2|a1*a1|a0*a0|
- // |--+--+--+--+--+--+--+--|
- // |A7|A6|A5|A4|A3|A2|A1|A0|, where Ax is , i.e. follow
- //
- // "can't overflow" below mark carrying into high part of
- // multiplication result, which can't overflow, because it
- // can never be all ones.
-
- mul x15,x5,x4 // a[1]*a[0]
- umulh x9,x5,x4
- mul x16,x6,x4 // a[2]*a[0]
- umulh x10,x6,x4
- mul x17,x7,x4 // a[3]*a[0]
- umulh x19,x7,x4
-
- adds x16,x16,x9 // accumulate high parts of multiplication
- mul x8,x6,x5 // a[2]*a[1]
- umulh x9,x6,x5
- adcs x17,x17,x10
- mul x10,x7,x5 // a[3]*a[1]
- umulh x11,x7,x5
- adc x19,x19,xzr // can't overflow
-
- mul x20,x7,x6 // a[3]*a[2]
- umulh x1,x7,x6
-
- adds x9,x9,x10 // accumulate high parts of multiplication
- mul x14,x4,x4 // a[0]*a[0]
- adc x10,x11,xzr // can't overflow
-
- adds x17,x17,x8 // accumulate low parts of multiplication
- umulh x4,x4,x4
- adcs x19,x19,x9
- mul x9,x5,x5 // a[1]*a[1]
- adcs x20,x20,x10
- umulh x5,x5,x5
- adc x1,x1,xzr // can't overflow
-
- adds x15,x15,x15 // acc[1-6]*=2
- mul x10,x6,x6 // a[2]*a[2]
- adcs x16,x16,x16
- umulh x6,x6,x6
- adcs x17,x17,x17
- mul x11,x7,x7 // a[3]*a[3]
- adcs x19,x19,x19
- umulh x7,x7,x7
- adcs x20,x20,x20
- adcs x1,x1,x1
- adc x2,xzr,xzr
-
- adds x15,x15,x4 // +a[i]*a[i]
- adcs x16,x16,x9
- adcs x17,x17,x5
- adcs x19,x19,x10
- adcs x20,x20,x6
- lsl x8,x14,#32
- adcs x1,x1,x11
- lsr x9,x14,#32
- adc x2,x2,x7
- subs x10,x14,x8 // "*0xffff0001"
- sbc x11,x14,x9
- adds x14,x15,x8 // +=acc[0]<<96 and omit acc[0]
- adcs x15,x16,x9
- lsl x8,x14,#32
- adcs x16,x17,x10 // +=acc[0]*0xffff0001
- lsr x9,x14,#32
- adc x17,x11,xzr // can't overflow
- subs x10,x14,x8 // "*0xffff0001"
- sbc x11,x14,x9
- adds x14,x15,x8 // +=acc[0]<<96 and omit acc[0]
- adcs x15,x16,x9
- lsl x8,x14,#32
- adcs x16,x17,x10 // +=acc[0]*0xffff0001
- lsr x9,x14,#32
- adc x17,x11,xzr // can't overflow
- subs x10,x14,x8 // "*0xffff0001"
- sbc x11,x14,x9
- adds x14,x15,x8 // +=acc[0]<<96 and omit acc[0]
- adcs x15,x16,x9
- lsl x8,x14,#32
- adcs x16,x17,x10 // +=acc[0]*0xffff0001
- lsr x9,x14,#32
- adc x17,x11,xzr // can't overflow
- subs x10,x14,x8 // "*0xffff0001"
- sbc x11,x14,x9
- adds x14,x15,x8 // +=acc[0]<<96 and omit acc[0]
- adcs x15,x16,x9
- adcs x16,x17,x10 // +=acc[0]*0xffff0001
- adc x17,x11,xzr // can't overflow
-
- adds x14,x14,x19 // accumulate upper half
- adcs x15,x15,x20
- adcs x16,x16,x1
- adcs x17,x17,x2
- adc x19,xzr,xzr
-
- adds x8,x14,#1 // subs x8,x14,#-1 // tmp = ret-modulus
- sbcs x9,x15,x12
- sbcs x10,x16,xzr
- sbcs x11,x17,x13
- sbcs xzr,x19,xzr // did it borrow?
-
- csel x14,x14,x8,lo // ret = borrow ? ret : ret-modulus
- csel x15,x15,x9,lo
- csel x16,x16,x10,lo
- stp x14,x15,[x0]
- csel x17,x17,x11,lo
- stp x16,x17,[x0,#16]
-
- ret
-
-
-// Note that __ecp_nistz256_add_to expects both input vectors pre-loaded to
-// x4-x7 and x8-x11. This is done because it's used in multiple
-// contexts, e.g. in multiplication by 2 and 3...
-.def __ecp_nistz256_add_to
- .type 32
-.endef
-.align 4
-__ecp_nistz256_add_to:
- adds x14,x14,x8 // ret = a+b
- adcs x15,x15,x9
- adcs x16,x16,x10
- adcs x17,x17,x11
- adc x1,xzr,xzr // zap x1
-
- adds x8,x14,#1 // subs x8,x4,#-1 // tmp = ret-modulus
- sbcs x9,x15,x12
- sbcs x10,x16,xzr
- sbcs x11,x17,x13
- sbcs xzr,x1,xzr // did subtraction borrow?
-
- csel x14,x14,x8,lo // ret = borrow ? ret : ret-modulus
- csel x15,x15,x9,lo
- csel x16,x16,x10,lo
- stp x14,x15,[x0]
- csel x17,x17,x11,lo
- stp x16,x17,[x0,#16]
-
- ret
-
-
-.def __ecp_nistz256_sub_from
- .type 32
-.endef
-.align 4
-__ecp_nistz256_sub_from:
- ldp x8,x9,[x2]
- ldp x10,x11,[x2,#16]
- subs x14,x14,x8 // ret = a-b
- sbcs x15,x15,x9
- sbcs x16,x16,x10
- sbcs x17,x17,x11
- sbc x1,xzr,xzr // zap x1
-
- subs x8,x14,#1 // adds x8,x4,#-1 // tmp = ret+modulus
- adcs x9,x15,x12
- adcs x10,x16,xzr
- adc x11,x17,x13
- cmp x1,xzr // did subtraction borrow?
-
- csel x14,x14,x8,eq // ret = borrow ? ret+modulus : ret
- csel x15,x15,x9,eq
- csel x16,x16,x10,eq
- stp x14,x15,[x0]
- csel x17,x17,x11,eq
- stp x16,x17,[x0,#16]
-
- ret
-
-
-.def __ecp_nistz256_sub_morf
- .type 32
-.endef
-.align 4
-__ecp_nistz256_sub_morf:
- ldp x8,x9,[x2]
- ldp x10,x11,[x2,#16]
- subs x14,x8,x14 // ret = b-a
- sbcs x15,x9,x15
- sbcs x16,x10,x16
- sbcs x17,x11,x17
- sbc x1,xzr,xzr // zap x1
-
- subs x8,x14,#1 // adds x8,x4,#-1 // tmp = ret+modulus
- adcs x9,x15,x12
- adcs x10,x16,xzr
- adc x11,x17,x13
- cmp x1,xzr // did subtraction borrow?
-
- csel x14,x14,x8,eq // ret = borrow ? ret+modulus : ret
- csel x15,x15,x9,eq
- csel x16,x16,x10,eq
- stp x14,x15,[x0]
- csel x17,x17,x11,eq
- stp x16,x17,[x0,#16]
-
- ret
-
-
-.def __ecp_nistz256_div_by_2
- .type 32
-.endef
-.align 4
-__ecp_nistz256_div_by_2:
- subs x8,x14,#1 // adds x8,x4,#-1 // tmp = a+modulus
- adcs x9,x15,x12
- adcs x10,x16,xzr
- adcs x11,x17,x13
- adc x1,xzr,xzr // zap x1
- tst x14,#1 // is a even?
-
- csel x14,x14,x8,eq // ret = even ? a : a+modulus
- csel x15,x15,x9,eq
- csel x16,x16,x10,eq
- csel x17,x17,x11,eq
- csel x1,xzr,x1,eq
-
- lsr x14,x14,#1 // ret >>= 1
- orr x14,x14,x15,lsl#63
- lsr x15,x15,#1
- orr x15,x15,x16,lsl#63
- lsr x16,x16,#1
- orr x16,x16,x17,lsl#63
- lsr x17,x17,#1
- stp x14,x15,[x0]
- orr x17,x17,x1,lsl#63
- stp x16,x17,[x0,#16]
-
- ret
-
-.globl ecp_nistz256_point_double
-
-.def ecp_nistz256_point_double
- .type 32
-.endef
-.align 5
-ecp_nistz256_point_double:
- AARCH64_SIGN_LINK_REGISTER
- stp x29,x30,[sp,#-96]!
- add x29,sp,#0
- stp x19,x20,[sp,#16]
- stp x21,x22,[sp,#32]
- sub sp,sp,#32*4
-
-Ldouble_shortcut:
- ldp x14,x15,[x1,#32]
- mov x21,x0
- ldp x16,x17,[x1,#48]
- mov x22,x1
- adrp x13,Lpoly
- add x13,x13,:lo12:Lpoly
- ldr x12,[x13,#8]
- mov x8,x14
- ldr x13,[x13,#24]
- mov x9,x15
- ldp x4,x5,[x22,#64] // forward load for p256_sqr_mont
- mov x10,x16
- mov x11,x17
- ldp x6,x7,[x22,#64+16]
- add x0,sp,#0
- bl __ecp_nistz256_add_to // p256_mul_by_2(S, in_y);
-
- add x0,sp,#64
- bl __ecp_nistz256_sqr_mont // p256_sqr_mont(Zsqr, in_z);
-
- ldp x8,x9,[x22]
- ldp x10,x11,[x22,#16]
- mov x4,x14 // put Zsqr aside for p256_sub
- mov x5,x15
- mov x6,x16
- mov x7,x17
- add x0,sp,#32
- bl __ecp_nistz256_add_to // p256_add(M, Zsqr, in_x);
-
- add x2,x22,#0
- mov x14,x4 // restore Zsqr
- mov x15,x5
- ldp x4,x5,[sp,#0] // forward load for p256_sqr_mont
- mov x16,x6
- mov x17,x7
- ldp x6,x7,[sp,#0+16]
- add x0,sp,#64
- bl __ecp_nistz256_sub_morf // p256_sub(Zsqr, in_x, Zsqr);
-
- add x0,sp,#0
- bl __ecp_nistz256_sqr_mont // p256_sqr_mont(S, S);
-
- ldr x3,[x22,#32]
- ldp x4,x5,[x22,#64]
- ldp x6,x7,[x22,#64+16]
- add x2,x22,#32
- add x0,sp,#96
- bl __ecp_nistz256_mul_mont // p256_mul_mont(tmp0, in_z, in_y);
-
- mov x8,x14
- mov x9,x15
- ldp x4,x5,[sp,#0] // forward load for p256_sqr_mont
- mov x10,x16
- mov x11,x17
- ldp x6,x7,[sp,#0+16]
- add x0,x21,#64
- bl __ecp_nistz256_add_to // p256_mul_by_2(res_z, tmp0);
-
- add x0,sp,#96
- bl __ecp_nistz256_sqr_mont // p256_sqr_mont(tmp0, S);
-
- ldr x3,[sp,#64] // forward load for p256_mul_mont
- ldp x4,x5,[sp,#32]
- ldp x6,x7,[sp,#32+16]
- add x0,x21,#32
- bl __ecp_nistz256_div_by_2 // p256_div_by_2(res_y, tmp0);
-
- add x2,sp,#64
- add x0,sp,#32
- bl __ecp_nistz256_mul_mont // p256_mul_mont(M, M, Zsqr);
-
- mov x8,x14 // duplicate M
- mov x9,x15
- mov x10,x16
- mov x11,x17
- mov x4,x14 // put M aside
- mov x5,x15
- mov x6,x16
- mov x7,x17
- add x0,sp,#32
- bl __ecp_nistz256_add_to
- mov x8,x4 // restore M
- mov x9,x5
- ldr x3,[x22] // forward load for p256_mul_mont
- mov x10,x6
- ldp x4,x5,[sp,#0]
- mov x11,x7
- ldp x6,x7,[sp,#0+16]
- bl __ecp_nistz256_add_to // p256_mul_by_3(M, M);
-
- add x2,x22,#0
- add x0,sp,#0
- bl __ecp_nistz256_mul_mont // p256_mul_mont(S, S, in_x);
-
- mov x8,x14
- mov x9,x15
- ldp x4,x5,[sp,#32] // forward load for p256_sqr_mont
- mov x10,x16
- mov x11,x17
- ldp x6,x7,[sp,#32+16]
- add x0,sp,#96
- bl __ecp_nistz256_add_to // p256_mul_by_2(tmp0, S);
-
- add x0,x21,#0
- bl __ecp_nistz256_sqr_mont // p256_sqr_mont(res_x, M);
-
- add x2,sp,#96
- bl __ecp_nistz256_sub_from // p256_sub(res_x, res_x, tmp0);
-
- add x2,sp,#0
- add x0,sp,#0
- bl __ecp_nistz256_sub_morf // p256_sub(S, S, res_x);
-
- ldr x3,[sp,#32]
- mov x4,x14 // copy S
- mov x5,x15
- mov x6,x16
- mov x7,x17
- add x2,sp,#32
- bl __ecp_nistz256_mul_mont // p256_mul_mont(S, S, M);
-
- add x2,x21,#32
- add x0,x21,#32
- bl __ecp_nistz256_sub_from // p256_sub(res_y, S, res_y);
-
- add sp,x29,#0 // destroy frame
- ldp x19,x20,[x29,#16]
- ldp x21,x22,[x29,#32]
- ldp x29,x30,[sp],#96
- AARCH64_VALIDATE_LINK_REGISTER
- ret
-
-.globl ecp_nistz256_point_add
-
-.def ecp_nistz256_point_add
- .type 32
-.endef
-.align 5
-ecp_nistz256_point_add:
- AARCH64_SIGN_LINK_REGISTER
- stp x29,x30,[sp,#-96]!
- add x29,sp,#0
- stp x19,x20,[sp,#16]
- stp x21,x22,[sp,#32]
- stp x23,x24,[sp,#48]
- stp x25,x26,[sp,#64]
- stp x27,x28,[sp,#80]
- sub sp,sp,#32*12
-
- ldp x4,x5,[x2,#64] // in2_z
- ldp x6,x7,[x2,#64+16]
- mov x21,x0
- mov x22,x1
- mov x23,x2
- adrp x13,Lpoly
- add x13,x13,:lo12:Lpoly
- ldr x12,[x13,#8]
- ldr x13,[x13,#24]
- orr x8,x4,x5
- orr x10,x6,x7
- orr x25,x8,x10
- cmp x25,#0
- csetm x25,ne // ~in2infty
- add x0,sp,#192
- bl __ecp_nistz256_sqr_mont // p256_sqr_mont(Z2sqr, in2_z);
-
- ldp x4,x5,[x22,#64] // in1_z
- ldp x6,x7,[x22,#64+16]
- orr x8,x4,x5
- orr x10,x6,x7
- orr x24,x8,x10
- cmp x24,#0
- csetm x24,ne // ~in1infty
- add x0,sp,#128
- bl __ecp_nistz256_sqr_mont // p256_sqr_mont(Z1sqr, in1_z);
-
- ldr x3,[x23,#64]
- ldp x4,x5,[sp,#192]
- ldp x6,x7,[sp,#192+16]
- add x2,x23,#64
- add x0,sp,#320
- bl __ecp_nistz256_mul_mont // p256_mul_mont(S1, Z2sqr, in2_z);
-
- ldr x3,[x22,#64]
- ldp x4,x5,[sp,#128]
- ldp x6,x7,[sp,#128+16]
- add x2,x22,#64
- add x0,sp,#352
- bl __ecp_nistz256_mul_mont // p256_mul_mont(S2, Z1sqr, in1_z);
-
- ldr x3,[x22,#32]
- ldp x4,x5,[sp,#320]
- ldp x6,x7,[sp,#320+16]
- add x2,x22,#32
- add x0,sp,#320
- bl __ecp_nistz256_mul_mont // p256_mul_mont(S1, S1, in1_y);
-
- ldr x3,[x23,#32]
- ldp x4,x5,[sp,#352]
- ldp x6,x7,[sp,#352+16]
- add x2,x23,#32
- add x0,sp,#352
- bl __ecp_nistz256_mul_mont // p256_mul_mont(S2, S2, in2_y);
-
- add x2,sp,#320
- ldr x3,[sp,#192] // forward load for p256_mul_mont
- ldp x4,x5,[x22]
- ldp x6,x7,[x22,#16]
- add x0,sp,#160
- bl __ecp_nistz256_sub_from // p256_sub(R, S2, S1);
-
- orr x14,x14,x15 // see if result is zero
- orr x16,x16,x17
- orr x26,x14,x16 // ~is_equal(S1,S2)
-
- add x2,sp,#192
- add x0,sp,#256
- bl __ecp_nistz256_mul_mont // p256_mul_mont(U1, in1_x, Z2sqr);
-
- ldr x3,[sp,#128]
- ldp x4,x5,[x23]
- ldp x6,x7,[x23,#16]
- add x2,sp,#128
- add x0,sp,#288
- bl __ecp_nistz256_mul_mont // p256_mul_mont(U2, in2_x, Z1sqr);
-
- add x2,sp,#256
- ldp x4,x5,[sp,#160] // forward load for p256_sqr_mont
- ldp x6,x7,[sp,#160+16]
- add x0,sp,#96
- bl __ecp_nistz256_sub_from // p256_sub(H, U2, U1);
-
- orr x14,x14,x15 // see if result is zero
- orr x16,x16,x17
- orr x14,x14,x16 // ~is_equal(U1,U2)
-
- mvn x27,x24 // -1/0 -> 0/-1
- mvn x28,x25 // -1/0 -> 0/-1
- orr x14,x14,x27
- orr x14,x14,x28
- orr x14,x14,x26
- cbnz x14,Ladd_proceed // if(~is_equal(U1,U2) | in1infty | in2infty | ~is_equal(S1,S2))
-
-Ladd_double:
- mov x1,x22
- mov x0,x21
- ldp x23,x24,[x29,#48]
- ldp x25,x26,[x29,#64]
- ldp x27,x28,[x29,#80]
- add sp,sp,#256 // #256 is from #32*(12-4). difference in stack frames
- b Ldouble_shortcut
-
-.align 4
-Ladd_proceed:
- add x0,sp,#192
- bl __ecp_nistz256_sqr_mont // p256_sqr_mont(Rsqr, R);
-
- ldr x3,[x22,#64]
- ldp x4,x5,[sp,#96]
- ldp x6,x7,[sp,#96+16]
- add x2,x22,#64
- add x0,sp,#64
- bl __ecp_nistz256_mul_mont // p256_mul_mont(res_z, H, in1_z);
-
- ldp x4,x5,[sp,#96]
- ldp x6,x7,[sp,#96+16]
- add x0,sp,#128
- bl __ecp_nistz256_sqr_mont // p256_sqr_mont(Hsqr, H);
-
- ldr x3,[x23,#64]
- ldp x4,x5,[sp,#64]
- ldp x6,x7,[sp,#64+16]
- add x2,x23,#64
- add x0,sp,#64
- bl __ecp_nistz256_mul_mont // p256_mul_mont(res_z, res_z, in2_z);
-
- ldr x3,[sp,#96]
- ldp x4,x5,[sp,#128]
- ldp x6,x7,[sp,#128+16]
- add x2,sp,#96
- add x0,sp,#224
- bl __ecp_nistz256_mul_mont // p256_mul_mont(Hcub, Hsqr, H);
-
- ldr x3,[sp,#128]
- ldp x4,x5,[sp,#256]
- ldp x6,x7,[sp,#256+16]
- add x2,sp,#128
- add x0,sp,#288
- bl __ecp_nistz256_mul_mont // p256_mul_mont(U2, U1, Hsqr);
-
- mov x8,x14
- mov x9,x15
- mov x10,x16
- mov x11,x17
- add x0,sp,#128
- bl __ecp_nistz256_add_to // p256_mul_by_2(Hsqr, U2);
-
- add x2,sp,#192
- add x0,sp,#0
- bl __ecp_nistz256_sub_morf // p256_sub(res_x, Rsqr, Hsqr);
-
- add x2,sp,#224
- bl __ecp_nistz256_sub_from // p256_sub(res_x, res_x, Hcub);
-
- add x2,sp,#288
- ldr x3,[sp,#224] // forward load for p256_mul_mont
- ldp x4,x5,[sp,#320]
- ldp x6,x7,[sp,#320+16]
- add x0,sp,#32
- bl __ecp_nistz256_sub_morf // p256_sub(res_y, U2, res_x);
-
- add x2,sp,#224
- add x0,sp,#352
- bl __ecp_nistz256_mul_mont // p256_mul_mont(S2, S1, Hcub);
-
- ldr x3,[sp,#160]
- ldp x4,x5,[sp,#32]
- ldp x6,x7,[sp,#32+16]
- add x2,sp,#160
- add x0,sp,#32
- bl __ecp_nistz256_mul_mont // p256_mul_mont(res_y, res_y, R);
-
- add x2,sp,#352
- bl __ecp_nistz256_sub_from // p256_sub(res_y, res_y, S2);
-
- ldp x4,x5,[sp,#0] // res
- ldp x6,x7,[sp,#0+16]
- ldp x8,x9,[x23] // in2
- ldp x10,x11,[x23,#16]
- ldp x14,x15,[x22,#0] // in1
- cmp x24,#0 // ~, remember?
- ldp x16,x17,[x22,#0+16]
- csel x8,x4,x8,ne
- csel x9,x5,x9,ne
- ldp x4,x5,[sp,#0+0+32] // res
- csel x10,x6,x10,ne
- csel x11,x7,x11,ne
- cmp x25,#0 // ~, remember?
- ldp x6,x7,[sp,#0+0+48]
- csel x14,x8,x14,ne
- csel x15,x9,x15,ne
- ldp x8,x9,[x23,#0+32] // in2
- csel x16,x10,x16,ne
- csel x17,x11,x17,ne
- ldp x10,x11,[x23,#0+48]
- stp x14,x15,[x21,#0]
- stp x16,x17,[x21,#0+16]
- ldp x14,x15,[x22,#32] // in1
- cmp x24,#0 // ~, remember?
- ldp x16,x17,[x22,#32+16]
- csel x8,x4,x8,ne
- csel x9,x5,x9,ne
- ldp x4,x5,[sp,#0+32+32] // res
- csel x10,x6,x10,ne
- csel x11,x7,x11,ne
- cmp x25,#0 // ~, remember?
- ldp x6,x7,[sp,#0+32+48]
- csel x14,x8,x14,ne
- csel x15,x9,x15,ne
- ldp x8,x9,[x23,#32+32] // in2
- csel x16,x10,x16,ne
- csel x17,x11,x17,ne
- ldp x10,x11,[x23,#32+48]
- stp x14,x15,[x21,#32]
- stp x16,x17,[x21,#32+16]
- ldp x14,x15,[x22,#64] // in1
- cmp x24,#0 // ~, remember?
- ldp x16,x17,[x22,#64+16]
- csel x8,x4,x8,ne
- csel x9,x5,x9,ne
- csel x10,x6,x10,ne
- csel x11,x7,x11,ne
- cmp x25,#0 // ~, remember?
- csel x14,x8,x14,ne
- csel x15,x9,x15,ne
- csel x16,x10,x16,ne
- csel x17,x11,x17,ne
- stp x14,x15,[x21,#64]
- stp x16,x17,[x21,#64+16]
-
-Ladd_done:
- add sp,x29,#0 // destroy frame
- ldp x19,x20,[x29,#16]
- ldp x21,x22,[x29,#32]
- ldp x23,x24,[x29,#48]
- ldp x25,x26,[x29,#64]
- ldp x27,x28,[x29,#80]
- ldp x29,x30,[sp],#96
- AARCH64_VALIDATE_LINK_REGISTER
- ret
-
-.globl ecp_nistz256_point_add_affine
-
-.def ecp_nistz256_point_add_affine
- .type 32
-.endef
-.align 5
-ecp_nistz256_point_add_affine:
- AARCH64_SIGN_LINK_REGISTER
- stp x29,x30,[sp,#-80]!
- add x29,sp,#0
- stp x19,x20,[sp,#16]
- stp x21,x22,[sp,#32]
- stp x23,x24,[sp,#48]
- stp x25,x26,[sp,#64]
- sub sp,sp,#32*10
-
- mov x21,x0
- mov x22,x1
- mov x23,x2
- adrp x13,Lpoly
- add x13,x13,:lo12:Lpoly
- ldr x12,[x13,#8]
- ldr x13,[x13,#24]
-
- ldp x4,x5,[x1,#64] // in1_z
- ldp x6,x7,[x1,#64+16]
- orr x8,x4,x5
- orr x10,x6,x7
- orr x24,x8,x10
- cmp x24,#0
- csetm x24,ne // ~in1infty
-
- ldp x14,x15,[x2] // in2_x
- ldp x16,x17,[x2,#16]
- ldp x8,x9,[x2,#32] // in2_y
- ldp x10,x11,[x2,#48]
- orr x14,x14,x15
- orr x16,x16,x17
- orr x8,x8,x9
- orr x10,x10,x11
- orr x14,x14,x16
- orr x8,x8,x10
- orr x25,x14,x8
- cmp x25,#0
- csetm x25,ne // ~in2infty
-
- add x0,sp,#128
- bl __ecp_nistz256_sqr_mont // p256_sqr_mont(Z1sqr, in1_z);
-
- mov x4,x14
- mov x5,x15
- mov x6,x16
- mov x7,x17
- ldr x3,[x23]
- add x2,x23,#0
- add x0,sp,#96
- bl __ecp_nistz256_mul_mont // p256_mul_mont(U2, Z1sqr, in2_x);
-
- add x2,x22,#0
- ldr x3,[x22,#64] // forward load for p256_mul_mont
- ldp x4,x5,[sp,#128]
- ldp x6,x7,[sp,#128+16]
- add x0,sp,#160
- bl __ecp_nistz256_sub_from // p256_sub(H, U2, in1_x);
-
- add x2,x22,#64
- add x0,sp,#128
- bl __ecp_nistz256_mul_mont // p256_mul_mont(S2, Z1sqr, in1_z);
-
- ldr x3,[x22,#64]
- ldp x4,x5,[sp,#160]
- ldp x6,x7,[sp,#160+16]
- add x2,x22,#64
- add x0,sp,#64
- bl __ecp_nistz256_mul_mont // p256_mul_mont(res_z, H, in1_z);
-
- ldr x3,[x23,#32]
- ldp x4,x5,[sp,#128]
- ldp x6,x7,[sp,#128+16]
- add x2,x23,#32
- add x0,sp,#128
- bl __ecp_nistz256_mul_mont // p256_mul_mont(S2, S2, in2_y);
-
- add x2,x22,#32
- ldp x4,x5,[sp,#160] // forward load for p256_sqr_mont
- ldp x6,x7,[sp,#160+16]
- add x0,sp,#192
- bl __ecp_nistz256_sub_from // p256_sub(R, S2, in1_y);
-
- add x0,sp,#224
- bl __ecp_nistz256_sqr_mont // p256_sqr_mont(Hsqr, H);
-
- ldp x4,x5,[sp,#192]
- ldp x6,x7,[sp,#192+16]
- add x0,sp,#288
- bl __ecp_nistz256_sqr_mont // p256_sqr_mont(Rsqr, R);
-
- ldr x3,[sp,#160]
- ldp x4,x5,[sp,#224]
- ldp x6,x7,[sp,#224+16]
- add x2,sp,#160
- add x0,sp,#256
- bl __ecp_nistz256_mul_mont // p256_mul_mont(Hcub, Hsqr, H);
-
- ldr x3,[x22]
- ldp x4,x5,[sp,#224]
- ldp x6,x7,[sp,#224+16]
- add x2,x22,#0
- add x0,sp,#96
- bl __ecp_nistz256_mul_mont // p256_mul_mont(U2, in1_x, Hsqr);
-
- mov x8,x14
- mov x9,x15
- mov x10,x16
- mov x11,x17
- add x0,sp,#224
- bl __ecp_nistz256_add_to // p256_mul_by_2(Hsqr, U2);
-
- add x2,sp,#288
- add x0,sp,#0
- bl __ecp_nistz256_sub_morf // p256_sub(res_x, Rsqr, Hsqr);
-
- add x2,sp,#256
- bl __ecp_nistz256_sub_from // p256_sub(res_x, res_x, Hcub);
-
- add x2,sp,#96
- ldr x3,[x22,#32] // forward load for p256_mul_mont
- ldp x4,x5,[sp,#256]
- ldp x6,x7,[sp,#256+16]
- add x0,sp,#32
- bl __ecp_nistz256_sub_morf // p256_sub(res_y, U2, res_x);
-
- add x2,x22,#32
- add x0,sp,#128
- bl __ecp_nistz256_mul_mont // p256_mul_mont(S2, in1_y, Hcub);
-
- ldr x3,[sp,#192]
- ldp x4,x5,[sp,#32]
- ldp x6,x7,[sp,#32+16]
- add x2,sp,#192
- add x0,sp,#32
- bl __ecp_nistz256_mul_mont // p256_mul_mont(res_y, res_y, R);
-
- add x2,sp,#128
- bl __ecp_nistz256_sub_from // p256_sub(res_y, res_y, S2);
-
- ldp x4,x5,[sp,#0] // res
- ldp x6,x7,[sp,#0+16]
- ldp x8,x9,[x23] // in2
- ldp x10,x11,[x23,#16]
- ldp x14,x15,[x22,#0] // in1
- cmp x24,#0 // ~, remember?
- ldp x16,x17,[x22,#0+16]
- csel x8,x4,x8,ne
- csel x9,x5,x9,ne
- ldp x4,x5,[sp,#0+0+32] // res
- csel x10,x6,x10,ne
- csel x11,x7,x11,ne
- cmp x25,#0 // ~, remember?
- ldp x6,x7,[sp,#0+0+48]
- csel x14,x8,x14,ne
- csel x15,x9,x15,ne
- ldp x8,x9,[x23,#0+32] // in2
- csel x16,x10,x16,ne
- csel x17,x11,x17,ne
- ldp x10,x11,[x23,#0+48]
- stp x14,x15,[x21,#0]
- stp x16,x17,[x21,#0+16]
- adrp x23,Lone_mont-64
- add x23,x23,:lo12:Lone_mont-64
- ldp x14,x15,[x22,#32] // in1
- cmp x24,#0 // ~, remember?
- ldp x16,x17,[x22,#32+16]
- csel x8,x4,x8,ne
- csel x9,x5,x9,ne
- ldp x4,x5,[sp,#0+32+32] // res
- csel x10,x6,x10,ne
- csel x11,x7,x11,ne
- cmp x25,#0 // ~, remember?
- ldp x6,x7,[sp,#0+32+48]
- csel x14,x8,x14,ne
- csel x15,x9,x15,ne
- ldp x8,x9,[x23,#32+32] // in2
- csel x16,x10,x16,ne
- csel x17,x11,x17,ne
- ldp x10,x11,[x23,#32+48]
- stp x14,x15,[x21,#32]
- stp x16,x17,[x21,#32+16]
- ldp x14,x15,[x22,#64] // in1
- cmp x24,#0 // ~, remember?
- ldp x16,x17,[x22,#64+16]
- csel x8,x4,x8,ne
- csel x9,x5,x9,ne
- csel x10,x6,x10,ne
- csel x11,x7,x11,ne
- cmp x25,#0 // ~, remember?
- csel x14,x8,x14,ne
- csel x15,x9,x15,ne
- csel x16,x10,x16,ne
- csel x17,x11,x17,ne
- stp x14,x15,[x21,#64]
- stp x16,x17,[x21,#64+16]
-
- add sp,x29,#0 // destroy frame
- ldp x19,x20,[x29,#16]
- ldp x21,x22,[x29,#32]
- ldp x23,x24,[x29,#48]
- ldp x25,x26,[x29,#64]
- ldp x29,x30,[sp],#80
- AARCH64_VALIDATE_LINK_REGISTER
- ret
-
-////////////////////////////////////////////////////////////////////////
-// void ecp_nistz256_ord_mul_mont(uint64_t res[4], uint64_t a[4],
-// uint64_t b[4]);
-.globl ecp_nistz256_ord_mul_mont
-
-.def ecp_nistz256_ord_mul_mont
- .type 32
-.endef
-.align 4
-ecp_nistz256_ord_mul_mont:
- AARCH64_VALID_CALL_TARGET
- // Armv8.3-A PAuth: even though x30 is pushed to stack it is not popped later.
- stp x29,x30,[sp,#-64]!
- add x29,sp,#0
- stp x19,x20,[sp,#16]
- stp x21,x22,[sp,#32]
- stp x23,x24,[sp,#48]
-
- adrp x23,Lord
- add x23,x23,:lo12:Lord
- ldr x3,[x2] // bp[0]
- ldp x4,x5,[x1]
- ldp x6,x7,[x1,#16]
-
- ldp x12,x13,[x23,#0]
- ldp x21,x22,[x23,#16]
- ldr x23,[x23,#32]
-
- mul x14,x4,x3 // a[0]*b[0]
- umulh x8,x4,x3
-
- mul x15,x5,x3 // a[1]*b[0]
- umulh x9,x5,x3
-
- mul x16,x6,x3 // a[2]*b[0]
- umulh x10,x6,x3
-
- mul x17,x7,x3 // a[3]*b[0]
- umulh x19,x7,x3
-
- mul x24,x14,x23
-
- adds x15,x15,x8 // accumulate high parts of multiplication
- adcs x16,x16,x9
- adcs x17,x17,x10
- adc x19,x19,xzr
- mov x20,xzr
- ldr x3,[x2,#8*1] // b[i]
-
- lsl x8,x24,#32
- subs x16,x16,x24
- lsr x9,x24,#32
- sbcs x17,x17,x8
- sbcs x19,x19,x9
- sbc x20,x20,xzr
-
- subs xzr,x14,#1
- umulh x9,x12,x24
- mul x10,x13,x24
- umulh x11,x13,x24
-
- adcs x10,x10,x9
- mul x8,x4,x3
- adc x11,x11,xzr
- mul x9,x5,x3
-
- adds x14,x15,x10
- mul x10,x6,x3
- adcs x15,x16,x11
- mul x11,x7,x3
- adcs x16,x17,x24
- adcs x17,x19,x24
- adc x19,x20,xzr
-
- adds x14,x14,x8 // accumulate low parts
- umulh x8,x4,x3
- adcs x15,x15,x9
- umulh x9,x5,x3
- adcs x16,x16,x10
- umulh x10,x6,x3
- adcs x17,x17,x11
- umulh x11,x7,x3
- adc x19,x19,xzr
- mul x24,x14,x23
- adds x15,x15,x8 // accumulate high parts
- adcs x16,x16,x9
- adcs x17,x17,x10
- adcs x19,x19,x11
- adc x20,xzr,xzr
- ldr x3,[x2,#8*2] // b[i]
-
- lsl x8,x24,#32
- subs x16,x16,x24
- lsr x9,x24,#32
- sbcs x17,x17,x8
- sbcs x19,x19,x9
- sbc x20,x20,xzr
-
- subs xzr,x14,#1
- umulh x9,x12,x24
- mul x10,x13,x24
- umulh x11,x13,x24
-
- adcs x10,x10,x9
- mul x8,x4,x3
- adc x11,x11,xzr
- mul x9,x5,x3
-
- adds x14,x15,x10
- mul x10,x6,x3
- adcs x15,x16,x11
- mul x11,x7,x3
- adcs x16,x17,x24
- adcs x17,x19,x24
- adc x19,x20,xzr
-
- adds x14,x14,x8 // accumulate low parts
- umulh x8,x4,x3
- adcs x15,x15,x9
- umulh x9,x5,x3
- adcs x16,x16,x10
- umulh x10,x6,x3
- adcs x17,x17,x11
- umulh x11,x7,x3
- adc x19,x19,xzr
- mul x24,x14,x23
- adds x15,x15,x8 // accumulate high parts
- adcs x16,x16,x9
- adcs x17,x17,x10
- adcs x19,x19,x11
- adc x20,xzr,xzr
- ldr x3,[x2,#8*3] // b[i]
-
- lsl x8,x24,#32
- subs x16,x16,x24
- lsr x9,x24,#32
- sbcs x17,x17,x8
- sbcs x19,x19,x9
- sbc x20,x20,xzr
-
- subs xzr,x14,#1
- umulh x9,x12,x24
- mul x10,x13,x24
- umulh x11,x13,x24
-
- adcs x10,x10,x9
- mul x8,x4,x3
- adc x11,x11,xzr
- mul x9,x5,x3
-
- adds x14,x15,x10
- mul x10,x6,x3
- adcs x15,x16,x11
- mul x11,x7,x3
- adcs x16,x17,x24
- adcs x17,x19,x24
- adc x19,x20,xzr
-
- adds x14,x14,x8 // accumulate low parts
- umulh x8,x4,x3
- adcs x15,x15,x9
- umulh x9,x5,x3
- adcs x16,x16,x10
- umulh x10,x6,x3
- adcs x17,x17,x11
- umulh x11,x7,x3
- adc x19,x19,xzr
- mul x24,x14,x23
- adds x15,x15,x8 // accumulate high parts
- adcs x16,x16,x9
- adcs x17,x17,x10
- adcs x19,x19,x11
- adc x20,xzr,xzr
- lsl x8,x24,#32 // last reduction
- subs x16,x16,x24
- lsr x9,x24,#32
- sbcs x17,x17,x8
- sbcs x19,x19,x9
- sbc x20,x20,xzr
-
- subs xzr,x14,#1
- umulh x9,x12,x24
- mul x10,x13,x24
- umulh x11,x13,x24
-
- adcs x10,x10,x9
- adc x11,x11,xzr
-
- adds x14,x15,x10
- adcs x15,x16,x11
- adcs x16,x17,x24
- adcs x17,x19,x24
- adc x19,x20,xzr
-
- subs x8,x14,x12 // ret -= modulus
- sbcs x9,x15,x13
- sbcs x10,x16,x21
- sbcs x11,x17,x22
- sbcs xzr,x19,xzr
-
- csel x14,x14,x8,lo // ret = borrow ? ret : ret-modulus
- csel x15,x15,x9,lo
- csel x16,x16,x10,lo
- stp x14,x15,[x0]
- csel x17,x17,x11,lo
- stp x16,x17,[x0,#16]
-
- ldp x19,x20,[sp,#16]
- ldp x21,x22,[sp,#32]
- ldp x23,x24,[sp,#48]
- ldr x29,[sp],#64
- ret
-
-
-////////////////////////////////////////////////////////////////////////
-// void ecp_nistz256_ord_sqr_mont(uint64_t res[4], uint64_t a[4],
-// uint64_t rep);
-.globl ecp_nistz256_ord_sqr_mont
-
-.def ecp_nistz256_ord_sqr_mont
- .type 32
-.endef
-.align 4
-ecp_nistz256_ord_sqr_mont:
- AARCH64_VALID_CALL_TARGET
- // Armv8.3-A PAuth: even though x30 is pushed to stack it is not popped later.
- stp x29,x30,[sp,#-64]!
- add x29,sp,#0
- stp x19,x20,[sp,#16]
- stp x21,x22,[sp,#32]
- stp x23,x24,[sp,#48]
-
- adrp x23,Lord
- add x23,x23,:lo12:Lord
- ldp x4,x5,[x1]
- ldp x6,x7,[x1,#16]
-
- ldp x12,x13,[x23,#0]
- ldp x21,x22,[x23,#16]
- ldr x23,[x23,#32]
- b Loop_ord_sqr
-
-.align 4
-Loop_ord_sqr:
- sub x2,x2,#1
- ////////////////////////////////////////////////////////////////
- // | | | | | |a1*a0| |
- // | | | | |a2*a0| | |
- // | |a3*a2|a3*a0| | | |
- // | | | |a2*a1| | | |
- // | | |a3*a1| | | | |
- // *| | | | | | | | 2|
- // +|a3*a3|a2*a2|a1*a1|a0*a0|
- // |--+--+--+--+--+--+--+--|
- // |A7|A6|A5|A4|A3|A2|A1|A0|, where Ax is , i.e. follow
- //
- // "can't overflow" below mark carrying into high part of
- // multiplication result, which can't overflow, because it
- // can never be all ones.
-
- mul x15,x5,x4 // a[1]*a[0]
- umulh x9,x5,x4
- mul x16,x6,x4 // a[2]*a[0]
- umulh x10,x6,x4
- mul x17,x7,x4 // a[3]*a[0]
- umulh x19,x7,x4
-
- adds x16,x16,x9 // accumulate high parts of multiplication
- mul x8,x6,x5 // a[2]*a[1]
- umulh x9,x6,x5
- adcs x17,x17,x10
- mul x10,x7,x5 // a[3]*a[1]
- umulh x11,x7,x5
- adc x19,x19,xzr // can't overflow
-
- mul x20,x7,x6 // a[3]*a[2]
- umulh x1,x7,x6
-
- adds x9,x9,x10 // accumulate high parts of multiplication
- mul x14,x4,x4 // a[0]*a[0]
- adc x10,x11,xzr // can't overflow
-
- adds x17,x17,x8 // accumulate low parts of multiplication
- umulh x4,x4,x4
- adcs x19,x19,x9
- mul x9,x5,x5 // a[1]*a[1]
- adcs x20,x20,x10
- umulh x5,x5,x5
- adc x1,x1,xzr // can't overflow
-
- adds x15,x15,x15 // acc[1-6]*=2
- mul x10,x6,x6 // a[2]*a[2]
- adcs x16,x16,x16
- umulh x6,x6,x6
- adcs x17,x17,x17
- mul x11,x7,x7 // a[3]*a[3]
- adcs x19,x19,x19
- umulh x7,x7,x7
- adcs x20,x20,x20
- adcs x1,x1,x1
- adc x3,xzr,xzr
-
- adds x15,x15,x4 // +a[i]*a[i]
- mul x24,x14,x23
- adcs x16,x16,x9
- adcs x17,x17,x5
- adcs x19,x19,x10
- adcs x20,x20,x6
- adcs x1,x1,x11
- adc x3,x3,x7
- subs xzr,x14,#1
- umulh x9,x12,x24
- mul x10,x13,x24
- umulh x11,x13,x24
-
- adcs x10,x10,x9
- adc x11,x11,xzr
-
- adds x14,x15,x10
- adcs x15,x16,x11
- adcs x16,x17,x24
- adc x17,xzr,x24 // can't overflow
- mul x11,x14,x23
- lsl x8,x24,#32
- subs x15,x15,x24
- lsr x9,x24,#32
- sbcs x16,x16,x8
- sbc x17,x17,x9 // can't borrow
- subs xzr,x14,#1
- umulh x9,x12,x11
- mul x10,x13,x11
- umulh x24,x13,x11
-
- adcs x10,x10,x9
- adc x24,x24,xzr
-
- adds x14,x15,x10
- adcs x15,x16,x24
- adcs x16,x17,x11
- adc x17,xzr,x11 // can't overflow
- mul x24,x14,x23
- lsl x8,x11,#32
- subs x15,x15,x11
- lsr x9,x11,#32
- sbcs x16,x16,x8
- sbc x17,x17,x9 // can't borrow
- subs xzr,x14,#1
- umulh x9,x12,x24
- mul x10,x13,x24
- umulh x11,x13,x24
-
- adcs x10,x10,x9
- adc x11,x11,xzr
-
- adds x14,x15,x10
- adcs x15,x16,x11
- adcs x16,x17,x24
- adc x17,xzr,x24 // can't overflow
- mul x11,x14,x23
- lsl x8,x24,#32
- subs x15,x15,x24
- lsr x9,x24,#32
- sbcs x16,x16,x8
- sbc x17,x17,x9 // can't borrow
- subs xzr,x14,#1
- umulh x9,x12,x11
- mul x10,x13,x11
- umulh x24,x13,x11
-
- adcs x10,x10,x9
- adc x24,x24,xzr
-
- adds x14,x15,x10
- adcs x15,x16,x24
- adcs x16,x17,x11
- adc x17,xzr,x11 // can't overflow
- lsl x8,x11,#32
- subs x15,x15,x11
- lsr x9,x11,#32
- sbcs x16,x16,x8
- sbc x17,x17,x9 // can't borrow
- adds x14,x14,x19 // accumulate upper half
- adcs x15,x15,x20
- adcs x16,x16,x1
- adcs x17,x17,x3
- adc x19,xzr,xzr
-
- subs x8,x14,x12 // ret -= modulus
- sbcs x9,x15,x13
- sbcs x10,x16,x21
- sbcs x11,x17,x22
- sbcs xzr,x19,xzr
-
- csel x4,x14,x8,lo // ret = borrow ? ret : ret-modulus
- csel x5,x15,x9,lo
- csel x6,x16,x10,lo
- csel x7,x17,x11,lo
-
- cbnz x2,Loop_ord_sqr
-
- stp x4,x5,[x0]
- stp x6,x7,[x0,#16]
-
- ldp x19,x20,[sp,#16]
- ldp x21,x22,[sp,#32]
- ldp x23,x24,[sp,#48]
- ldr x29,[sp],#64
- ret
-
-////////////////////////////////////////////////////////////////////////
-// void ecp_nistz256_select_w5(uint64_t *val, uint64_t *in_t, int index);
-.globl ecp_nistz256_select_w5
-
-.def ecp_nistz256_select_w5
- .type 32
-.endef
-.align 4
-ecp_nistz256_select_w5:
- AARCH64_VALID_CALL_TARGET
-
- // x10 := x0
- // w9 := 0; loop counter and incremented internal index
- mov x10, x0
- mov w9, #0
-
- // [v16-v21] := 0
- movi v16.16b, #0
- movi v17.16b, #0
- movi v18.16b, #0
- movi v19.16b, #0
- movi v20.16b, #0
- movi v21.16b, #0
-
-Lselect_w5_loop:
- // Loop 16 times.
-
- // Increment index (loop counter); tested at the end of the loop
- add w9, w9, #1
-
- // [v22-v27] := Load a (3*256-bit = 6*128-bit) table entry starting at x1
- // and advance x1 to point to the next entry
- ld1 {v22.2d, v23.2d, v24.2d, v25.2d}, [x1],#64
-
- // x11 := (w9 == w2)? All 1s : All 0s
- cmp w9, w2
- csetm x11, eq
-
- // continue loading ...
- ld1 {v26.2d, v27.2d}, [x1],#32
-
- // duplicate mask_64 into Mask (all 0s or all 1s)
- dup v3.2d, x11
-
- // [v16-v19] := (Mask == all 1s)? [v22-v25] : [v16-v19]
- // i.e., values in output registers will remain the same if w9 != w2
- bit v16.16b, v22.16b, v3.16b
- bit v17.16b, v23.16b, v3.16b
-
- bit v18.16b, v24.16b, v3.16b
- bit v19.16b, v25.16b, v3.16b
-
- bit v20.16b, v26.16b, v3.16b
- bit v21.16b, v27.16b, v3.16b
-
- // If bit #4 is not 0 (i.e. idx_ctr < 16) loop back
- tbz w9, #4, Lselect_w5_loop
-
- // Write [v16-v21] to memory at the output pointer
- st1 {v16.2d, v17.2d, v18.2d, v19.2d}, [x10],#64
- st1 {v20.2d, v21.2d}, [x10]
-
- ret
-
-
-
-////////////////////////////////////////////////////////////////////////
-// void ecp_nistz256_select_w7(uint64_t *val, uint64_t *in_t, int index);
-.globl ecp_nistz256_select_w7
-
-.def ecp_nistz256_select_w7
- .type 32
-.endef
-.align 4
-ecp_nistz256_select_w7:
- AARCH64_VALID_CALL_TARGET
-
- // w9 := 0; loop counter and incremented internal index
- mov w9, #0
-
- // [v16-v21] := 0
- movi v16.16b, #0
- movi v17.16b, #0
- movi v18.16b, #0
- movi v19.16b, #0
-
-Lselect_w7_loop:
- // Loop 64 times.
-
- // Increment index (loop counter); tested at the end of the loop
- add w9, w9, #1
-
- // [v22-v25] := Load a (2*256-bit = 4*128-bit) table entry starting at x1
- // and advance x1 to point to the next entry
- ld1 {v22.2d, v23.2d, v24.2d, v25.2d}, [x1],#64
-
- // x11 := (w9 == w2)? All 1s : All 0s
- cmp w9, w2
- csetm x11, eq
-
- // duplicate mask_64 into Mask (all 0s or all 1s)
- dup v3.2d, x11
-
- // [v16-v19] := (Mask == all 1s)? [v22-v25] : [v16-v19]
- // i.e., values in output registers will remain the same if w9 != w2
- bit v16.16b, v22.16b, v3.16b
- bit v17.16b, v23.16b, v3.16b
-
- bit v18.16b, v24.16b, v3.16b
- bit v19.16b, v25.16b, v3.16b
-
- // If bit #6 is not 0 (i.e. idx_ctr < 64) loop back
- tbz w9, #6, Lselect_w7_loop
-
- // Write [v16-v19] to memory at the output pointer
- st1 {v16.2d, v17.2d, v18.2d, v19.2d}, [x0]
-
- ret
-
-#endif // !OPENSSL_NO_ASM && defined(OPENSSL_AARCH64) && defined(_WIN32)
diff --git a/win-aarch64/crypto/fipsmodule/p256_beeu-armv8-asm-win.S b/win-aarch64/crypto/fipsmodule/p256_beeu-armv8-asm-win.S
deleted file mode 100644
index ac6eb17c..00000000
--- a/win-aarch64/crypto/fipsmodule/p256_beeu-armv8-asm-win.S
+++ /dev/null
@@ -1,309 +0,0 @@
-// This file is generated from a similarly-named Perl script in the BoringSSL
-// source tree. Do not edit by hand.
-
-#include <openssl/asm_base.h>
-
-#if !defined(OPENSSL_NO_ASM) && defined(OPENSSL_AARCH64) && defined(_WIN32)
-#include "openssl/arm_arch.h"
-
-.text
-.globl beeu_mod_inverse_vartime
-
-
-.align 4
-beeu_mod_inverse_vartime:
- // Reserve enough space for 14 8-byte registers on the stack
- // in the first stp call for x29, x30.
- // Then store the remaining callee-saved registers.
- //
- // | x29 | x30 | x19 | x20 | ... | x27 | x28 | x0 | x2 |
- // ^ ^
- // sp <------------------- 112 bytes ----------------> old sp
- // x29 (FP)
- //
- AARCH64_SIGN_LINK_REGISTER
- stp x29,x30,[sp,#-112]!
- add x29,sp,#0
- stp x19,x20,[sp,#16]
- stp x21,x22,[sp,#32]
- stp x23,x24,[sp,#48]
- stp x25,x26,[sp,#64]
- stp x27,x28,[sp,#80]
- stp x0,x2,[sp,#96]
-
- // B = b3..b0 := a
- ldp x25,x26,[x1]
- ldp x27,x28,[x1,#16]
-
- // n3..n0 := n
- // Note: the value of input params are changed in the following.
- ldp x0,x1,[x2]
- ldp x2,x30,[x2,#16]
-
- // A = a3..a0 := n
- mov x21, x0
- mov x22, x1
- mov x23, x2
- mov x24, x30
-
- // X = x4..x0 := 1
- mov x3, #1
- eor x4, x4, x4
- eor x5, x5, x5
- eor x6, x6, x6
- eor x7, x7, x7
-
- // Y = y4..y0 := 0
- eor x8, x8, x8
- eor x9, x9, x9
- eor x10, x10, x10
- eor x11, x11, x11
- eor x12, x12, x12
-
-Lbeeu_loop:
- // if B == 0, jump to .Lbeeu_loop_end
- orr x14, x25, x26
- orr x14, x14, x27
-
- // reverse the bit order of x25. This is needed for clz after this macro
- rbit x15, x25
-
- orr x14, x14, x28
- cbz x14,Lbeeu_loop_end
-
-
- // 0 < B < |n|,
- // 0 < A <= |n|,
- // (1) X*a == B (mod |n|),
- // (2) (-1)*Y*a == A (mod |n|)
-
- // Now divide B by the maximum possible power of two in the
- // integers, and divide X by the same value mod |n|.
- // When we're done, (1) still holds.
-
- // shift := number of trailing 0s in x25
- // ( = number of leading 0s in x15; see the "rbit" instruction in TEST_B_ZERO)
- clz x13, x15
-
- // If there is no shift, goto shift_A_Y
- cbz x13, Lbeeu_shift_A_Y
-
- // Shift B right by "x13" bits
- neg x14, x13
- lsr x25, x25, x13
- lsl x15, x26, x14
-
- lsr x26, x26, x13
- lsl x19, x27, x14
-
- orr x25, x25, x15
-
- lsr x27, x27, x13
- lsl x20, x28, x14
-
- orr x26, x26, x19
-
- lsr x28, x28, x13
-
- orr x27, x27, x20
-
-
- // Shift X right by "x13" bits, adding n whenever X becomes odd.
- // x13--;
- // x14 := 0; needed in the addition to the most significant word in SHIFT1
- eor x14, x14, x14
-Lbeeu_shift_loop_X:
- tbz x3, #0, Lshift1_0
- adds x3, x3, x0
- adcs x4, x4, x1
- adcs x5, x5, x2
- adcs x6, x6, x30
- adc x7, x7, x14
-Lshift1_0:
- // var0 := [var1|var0]<64..1>;
- // i.e. concatenate var1 and var0,
- // extract bits <64..1> from the resulting 128-bit value
- // and put them in var0
- extr x3, x4, x3, #1
- extr x4, x5, x4, #1
- extr x5, x6, x5, #1
- extr x6, x7, x6, #1
- lsr x7, x7, #1
-
- subs x13, x13, #1
- bne Lbeeu_shift_loop_X
-
- // Note: the steps above perform the same sequence as in p256_beeu-x86_64-asm.pl
- // with the following differences:
- // - "x13" is set directly to the number of trailing 0s in B
- // (using rbit and clz instructions)
- // - The loop is only used to call SHIFT1(X)
- // and x13 is decreased while executing the X loop.
- // - SHIFT256(B, x13) is performed before right-shifting X; they are independent
-
-Lbeeu_shift_A_Y:
- // Same for A and Y.
- // Afterwards, (2) still holds.
- // Reverse the bit order of x21
- // x13 := number of trailing 0s in x21 (= number of leading 0s in x15)
- rbit x15, x21
- clz x13, x15
-
- // If there is no shift, goto |B-A|, X+Y update
- cbz x13, Lbeeu_update_B_X_or_A_Y
-
- // Shift A right by "x13" bits
- neg x14, x13
- lsr x21, x21, x13
- lsl x15, x22, x14
-
- lsr x22, x22, x13
- lsl x19, x23, x14
-
- orr x21, x21, x15
-
- lsr x23, x23, x13
- lsl x20, x24, x14
-
- orr x22, x22, x19
-
- lsr x24, x24, x13
-
- orr x23, x23, x20
-
-
- // Shift Y right by "x13" bits, adding n whenever Y becomes odd.
- // x13--;
- // x14 := 0; needed in the addition to the most significant word in SHIFT1
- eor x14, x14, x14
-Lbeeu_shift_loop_Y:
- tbz x8, #0, Lshift1_1
- adds x8, x8, x0
- adcs x9, x9, x1
- adcs x10, x10, x2
- adcs x11, x11, x30
- adc x12, x12, x14
-Lshift1_1:
- // var0 := [var1|var0]<64..1>;
- // i.e. concatenate var1 and var0,
- // extract bits <64..1> from the resulting 128-bit value
- // and put them in var0
- extr x8, x9, x8, #1
- extr x9, x10, x9, #1
- extr x10, x11, x10, #1
- extr x11, x12, x11, #1
- lsr x12, x12, #1
-
- subs x13, x13, #1
- bne Lbeeu_shift_loop_Y
-
-Lbeeu_update_B_X_or_A_Y:
- // Try T := B - A; if cs, continue with B > A (cs: carry set = no borrow)
- // Note: this is a case of unsigned arithmetic, where T fits in 4 64-bit words
- // without taking a sign bit if generated. The lack of a carry would
- // indicate a negative result. See, for example,
- // https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/condition-codes-1-condition-flags-and-codes
- subs x14, x25, x21
- sbcs x15, x26, x22
- sbcs x19, x27, x23
- sbcs x20, x28, x24
- bcs Lbeeu_B_greater_than_A
-
- // Else A > B =>
- // A := A - B; Y := Y + X; goto beginning of the loop
- subs x21, x21, x25
- sbcs x22, x22, x26
- sbcs x23, x23, x27
- sbcs x24, x24, x28
-
- adds x8, x8, x3
- adcs x9, x9, x4
- adcs x10, x10, x5
- adcs x11, x11, x6
- adc x12, x12, x7
- b Lbeeu_loop
-
-Lbeeu_B_greater_than_A:
- // Continue with B > A =>
- // B := B - A; X := X + Y; goto beginning of the loop
- mov x25, x14
- mov x26, x15
- mov x27, x19
- mov x28, x20
-
- adds x3, x3, x8
- adcs x4, x4, x9
- adcs x5, x5, x10
- adcs x6, x6, x11
- adc x7, x7, x12
- b Lbeeu_loop
-
-Lbeeu_loop_end:
- // The Euclid's algorithm loop ends when A == gcd(a,n);
- // this would be 1, when a and n are co-prime (i.e. do not have a common factor).
- // Since (-1)*Y*a == A (mod |n|), Y>0
- // then out = -Y mod n
-
- // Verify that A = 1 ==> (-1)*Y*a = A = 1 (mod |n|)
- // Is A-1 == 0?
- // If not, fail.
- sub x14, x21, #1
- orr x14, x14, x22
- orr x14, x14, x23
- orr x14, x14, x24
- cbnz x14, Lbeeu_err
-
- // If Y>n ==> Y:=Y-n
-Lbeeu_reduction_loop:
- // x_i := y_i - n_i (X is no longer needed, use it as temp)
- // (x14 = 0 from above)
- subs x3, x8, x0
- sbcs x4, x9, x1
- sbcs x5, x10, x2
- sbcs x6, x11, x30
- sbcs x7, x12, x14
-
- // If result is non-negative (i.e., cs = carry set = no borrow),
- // y_i := x_i; goto reduce again
- // else
- // y_i := y_i; continue
- csel x8, x3, x8, cs
- csel x9, x4, x9, cs
- csel x10, x5, x10, cs
- csel x11, x6, x11, cs
- csel x12, x7, x12, cs
- bcs Lbeeu_reduction_loop
-
- // Now Y < n (Y cannot be equal to n, since the inverse cannot be 0)
- // out = -Y = n-Y
- subs x8, x0, x8
- sbcs x9, x1, x9
- sbcs x10, x2, x10
- sbcs x11, x30, x11
-
- // Save Y in output (out (x0) was saved on the stack)
- ldr x3, [sp,#96]
- stp x8, x9, [x3]
- stp x10, x11, [x3,#16]
- // return 1 (success)
- mov x0, #1
- b Lbeeu_finish
-
-Lbeeu_err:
- // return 0 (error)
- eor x0, x0, x0
-
-Lbeeu_finish:
- // Restore callee-saved registers, except x0, x2
- add sp,x29,#0
- ldp x19,x20,[sp,#16]
- ldp x21,x22,[sp,#32]
- ldp x23,x24,[sp,#48]
- ldp x25,x26,[sp,#64]
- ldp x27,x28,[sp,#80]
- ldp x29,x30,[sp],#112
-
- AARCH64_VALIDATE_LINK_REGISTER
- ret
-
-#endif // !OPENSSL_NO_ASM && defined(OPENSSL_AARCH64) && defined(_WIN32)
diff --git a/win-aarch64/crypto/fipsmodule/sha1-armv8-win.S b/win-aarch64/crypto/fipsmodule/sha1-armv8-win.S
deleted file mode 100644
index f8c8b861..00000000
--- a/win-aarch64/crypto/fipsmodule/sha1-armv8-win.S
+++ /dev/null
@@ -1,1222 +0,0 @@
-// This file is generated from a similarly-named Perl script in the BoringSSL
-// source tree. Do not edit by hand.
-
-#include <openssl/asm_base.h>
-
-#if !defined(OPENSSL_NO_ASM) && defined(OPENSSL_AARCH64) && defined(_WIN32)
-#include <openssl/arm_arch.h>
-
-.text
-
-.globl sha1_block_data_order_nohw
-
-.def sha1_block_data_order_nohw
- .type 32
-.endef
-.align 6
-sha1_block_data_order_nohw:
- // Armv8.3-A PAuth: even though x30 is pushed to stack it is not popped later.
- AARCH64_VALID_CALL_TARGET
-
- stp x29,x30,[sp,#-96]!
- add x29,sp,#0
- stp x19,x20,[sp,#16]
- stp x21,x22,[sp,#32]
- stp x23,x24,[sp,#48]
- stp x25,x26,[sp,#64]
- stp x27,x28,[sp,#80]
-
- ldp w20,w21,[x0]
- ldp w22,w23,[x0,#8]
- ldr w24,[x0,#16]
-
-Loop:
- ldr x3,[x1],#64
- movz w28,#0x7999
- sub x2,x2,#1
- movk w28,#0x5a82,lsl#16
-#ifdef __AARCH64EB__
- ror x3,x3,#32
-#else
- rev32 x3,x3
-#endif
- add w24,w24,w28 // warm it up
- add w24,w24,w3
- lsr x4,x3,#32
- ldr x5,[x1,#-56]
- bic w25,w23,w21
- and w26,w22,w21
- ror w27,w20,#27
- add w23,w23,w28 // future e+=K
- orr w25,w25,w26
- add w24,w24,w27 // e+=rot(a,5)
- ror w21,w21,#2
- add w23,w23,w4 // future e+=X[i]
- add w24,w24,w25 // e+=F(b,c,d)
-#ifdef __AARCH64EB__
- ror x5,x5,#32
-#else
- rev32 x5,x5
-#endif
- bic w25,w22,w20
- and w26,w21,w20
- ror w27,w24,#27
- add w22,w22,w28 // future e+=K
- orr w25,w25,w26
- add w23,w23,w27 // e+=rot(a,5)
- ror w20,w20,#2
- add w22,w22,w5 // future e+=X[i]
- add w23,w23,w25 // e+=F(b,c,d)
- lsr x6,x5,#32
- ldr x7,[x1,#-48]
- bic w25,w21,w24
- and w26,w20,w24
- ror w27,w23,#27
- add w21,w21,w28 // future e+=K
- orr w25,w25,w26
- add w22,w22,w27 // e+=rot(a,5)
- ror w24,w24,#2
- add w21,w21,w6 // future e+=X[i]
- add w22,w22,w25 // e+=F(b,c,d)
-#ifdef __AARCH64EB__
- ror x7,x7,#32
-#else
- rev32 x7,x7
-#endif
- bic w25,w20,w23
- and w26,w24,w23
- ror w27,w22,#27
- add w20,w20,w28 // future e+=K
- orr w25,w25,w26
- add w21,w21,w27 // e+=rot(a,5)
- ror w23,w23,#2
- add w20,w20,w7 // future e+=X[i]
- add w21,w21,w25 // e+=F(b,c,d)
- lsr x8,x7,#32
- ldr x9,[x1,#-40]
- bic w25,w24,w22
- and w26,w23,w22
- ror w27,w21,#27
- add w24,w24,w28 // future e+=K
- orr w25,w25,w26
- add w20,w20,w27 // e+=rot(a,5)
- ror w22,w22,#2
- add w24,w24,w8 // future e+=X[i]
- add w20,w20,w25 // e+=F(b,c,d)
-#ifdef __AARCH64EB__
- ror x9,x9,#32
-#else
- rev32 x9,x9
-#endif
- bic w25,w23,w21
- and w26,w22,w21
- ror w27,w20,#27
- add w23,w23,w28 // future e+=K
- orr w25,w25,w26
- add w24,w24,w27 // e+=rot(a,5)
- ror w21,w21,#2
- add w23,w23,w9 // future e+=X[i]
- add w24,w24,w25 // e+=F(b,c,d)
- lsr x10,x9,#32
- ldr x11,[x1,#-32]
- bic w25,w22,w20
- and w26,w21,w20
- ror w27,w24,#27
- add w22,w22,w28 // future e+=K
- orr w25,w25,w26
- add w23,w23,w27 // e+=rot(a,5)
- ror w20,w20,#2
- add w22,w22,w10 // future e+=X[i]
- add w23,w23,w25 // e+=F(b,c,d)
-#ifdef __AARCH64EB__
- ror x11,x11,#32
-#else
- rev32 x11,x11
-#endif
- bic w25,w21,w24
- and w26,w20,w24
- ror w27,w23,#27
- add w21,w21,w28 // future e+=K
- orr w25,w25,w26
- add w22,w22,w27 // e+=rot(a,5)
- ror w24,w24,#2
- add w21,w21,w11 // future e+=X[i]
- add w22,w22,w25 // e+=F(b,c,d)
- lsr x12,x11,#32
- ldr x13,[x1,#-24]
- bic w25,w20,w23
- and w26,w24,w23
- ror w27,w22,#27
- add w20,w20,w28 // future e+=K
- orr w25,w25,w26
- add w21,w21,w27 // e+=rot(a,5)
- ror w23,w23,#2
- add w20,w20,w12 // future e+=X[i]
- add w21,w21,w25 // e+=F(b,c,d)
-#ifdef __AARCH64EB__
- ror x13,x13,#32
-#else
- rev32 x13,x13
-#endif
- bic w25,w24,w22
- and w26,w23,w22
- ror w27,w21,#27
- add w24,w24,w28 // future e+=K
- orr w25,w25,w26
- add w20,w20,w27 // e+=rot(a,5)
- ror w22,w22,#2
- add w24,w24,w13 // future e+=X[i]
- add w20,w20,w25 // e+=F(b,c,d)
- lsr x14,x13,#32
- ldr x15,[x1,#-16]
- bic w25,w23,w21
- and w26,w22,w21
- ror w27,w20,#27
- add w23,w23,w28 // future e+=K
- orr w25,w25,w26
- add w24,w24,w27 // e+=rot(a,5)
- ror w21,w21,#2
- add w23,w23,w14 // future e+=X[i]
- add w24,w24,w25 // e+=F(b,c,d)
-#ifdef __AARCH64EB__
- ror x15,x15,#32
-#else
- rev32 x15,x15
-#endif
- bic w25,w22,w20
- and w26,w21,w20
- ror w27,w24,#27
- add w22,w22,w28 // future e+=K
- orr w25,w25,w26
- add w23,w23,w27 // e+=rot(a,5)
- ror w20,w20,#2
- add w22,w22,w15 // future e+=X[i]
- add w23,w23,w25 // e+=F(b,c,d)
- lsr x16,x15,#32
- ldr x17,[x1,#-8]
- bic w25,w21,w24
- and w26,w20,w24
- ror w27,w23,#27
- add w21,w21,w28 // future e+=K
- orr w25,w25,w26
- add w22,w22,w27 // e+=rot(a,5)
- ror w24,w24,#2
- add w21,w21,w16 // future e+=X[i]
- add w22,w22,w25 // e+=F(b,c,d)
-#ifdef __AARCH64EB__
- ror x17,x17,#32
-#else
- rev32 x17,x17
-#endif
- bic w25,w20,w23
- and w26,w24,w23
- ror w27,w22,#27
- add w20,w20,w28 // future e+=K
- orr w25,w25,w26
- add w21,w21,w27 // e+=rot(a,5)
- ror w23,w23,#2
- add w20,w20,w17 // future e+=X[i]
- add w21,w21,w25 // e+=F(b,c,d)
- lsr x19,x17,#32
- eor w3,w3,w5
- bic w25,w24,w22
- and w26,w23,w22
- ror w27,w21,#27
- eor w3,w3,w11
- add w24,w24,w28 // future e+=K
- orr w25,w25,w26
- add w20,w20,w27 // e+=rot(a,5)
- eor w3,w3,w16
- ror w22,w22,#2
- add w24,w24,w19 // future e+=X[i]
- add w20,w20,w25 // e+=F(b,c,d)
- ror w3,w3,#31
- eor w4,w4,w6
- bic w25,w23,w21
- and w26,w22,w21
- ror w27,w20,#27
- eor w4,w4,w12
- add w23,w23,w28 // future e+=K
- orr w25,w25,w26
- add w24,w24,w27 // e+=rot(a,5)
- eor w4,w4,w17
- ror w21,w21,#2
- add w23,w23,w3 // future e+=X[i]
- add w24,w24,w25 // e+=F(b,c,d)
- ror w4,w4,#31
- eor w5,w5,w7
- bic w25,w22,w20
- and w26,w21,w20
- ror w27,w24,#27
- eor w5,w5,w13
- add w22,w22,w28 // future e+=K
- orr w25,w25,w26
- add w23,w23,w27 // e+=rot(a,5)
- eor w5,w5,w19
- ror w20,w20,#2
- add w22,w22,w4 // future e+=X[i]
- add w23,w23,w25 // e+=F(b,c,d)
- ror w5,w5,#31
- eor w6,w6,w8
- bic w25,w21,w24
- and w26,w20,w24
- ror w27,w23,#27
- eor w6,w6,w14
- add w21,w21,w28 // future e+=K
- orr w25,w25,w26
- add w22,w22,w27 // e+=rot(a,5)
- eor w6,w6,w3
- ror w24,w24,#2
- add w21,w21,w5 // future e+=X[i]
- add w22,w22,w25 // e+=F(b,c,d)
- ror w6,w6,#31
- eor w7,w7,w9
- bic w25,w20,w23
- and w26,w24,w23
- ror w27,w22,#27
- eor w7,w7,w15
- add w20,w20,w28 // future e+=K
- orr w25,w25,w26
- add w21,w21,w27 // e+=rot(a,5)
- eor w7,w7,w4
- ror w23,w23,#2
- add w20,w20,w6 // future e+=X[i]
- add w21,w21,w25 // e+=F(b,c,d)
- ror w7,w7,#31
- movz w28,#0xeba1
- movk w28,#0x6ed9,lsl#16
- eor w8,w8,w10
- bic w25,w24,w22
- and w26,w23,w22
- ror w27,w21,#27
- eor w8,w8,w16
- add w24,w24,w28 // future e+=K
- orr w25,w25,w26
- add w20,w20,w27 // e+=rot(a,5)
- eor w8,w8,w5
- ror w22,w22,#2
- add w24,w24,w7 // future e+=X[i]
- add w20,w20,w25 // e+=F(b,c,d)
- ror w8,w8,#31
- eor w9,w9,w11
- eor w25,w23,w21
- ror w27,w20,#27
- add w23,w23,w28 // future e+=K
- eor w9,w9,w17
- eor w25,w25,w22
- add w24,w24,w27 // e+=rot(a,5)
- ror w21,w21,#2
- eor w9,w9,w6
- add w23,w23,w8 // future e+=X[i]
- add w24,w24,w25 // e+=F(b,c,d)
- ror w9,w9,#31
- eor w10,w10,w12
- eor w25,w22,w20
- ror w27,w24,#27
- add w22,w22,w28 // future e+=K
- eor w10,w10,w19
- eor w25,w25,w21
- add w23,w23,w27 // e+=rot(a,5)
- ror w20,w20,#2
- eor w10,w10,w7
- add w22,w22,w9 // future e+=X[i]
- add w23,w23,w25 // e+=F(b,c,d)
- ror w10,w10,#31
- eor w11,w11,w13
- eor w25,w21,w24
- ror w27,w23,#27
- add w21,w21,w28 // future e+=K
- eor w11,w11,w3
- eor w25,w25,w20
- add w22,w22,w27 // e+=rot(a,5)
- ror w24,w24,#2
- eor w11,w11,w8
- add w21,w21,w10 // future e+=X[i]
- add w22,w22,w25 // e+=F(b,c,d)
- ror w11,w11,#31
- eor w12,w12,w14
- eor w25,w20,w23
- ror w27,w22,#27
- add w20,w20,w28 // future e+=K
- eor w12,w12,w4
- eor w25,w25,w24
- add w21,w21,w27 // e+=rot(a,5)
- ror w23,w23,#2
- eor w12,w12,w9
- add w20,w20,w11 // future e+=X[i]
- add w21,w21,w25 // e+=F(b,c,d)
- ror w12,w12,#31
- eor w13,w13,w15
- eor w25,w24,w22
- ror w27,w21,#27
- add w24,w24,w28 // future e+=K
- eor w13,w13,w5
- eor w25,w25,w23
- add w20,w20,w27 // e+=rot(a,5)
- ror w22,w22,#2
- eor w13,w13,w10
- add w24,w24,w12 // future e+=X[i]
- add w20,w20,w25 // e+=F(b,c,d)
- ror w13,w13,#31
- eor w14,w14,w16
- eor w25,w23,w21
- ror w27,w20,#27
- add w23,w23,w28 // future e+=K
- eor w14,w14,w6
- eor w25,w25,w22
- add w24,w24,w27 // e+=rot(a,5)
- ror w21,w21,#2
- eor w14,w14,w11
- add w23,w23,w13 // future e+=X[i]
- add w24,w24,w25 // e+=F(b,c,d)
- ror w14,w14,#31
- eor w15,w15,w17
- eor w25,w22,w20
- ror w27,w24,#27
- add w22,w22,w28 // future e+=K
- eor w15,w15,w7
- eor w25,w25,w21
- add w23,w23,w27 // e+=rot(a,5)
- ror w20,w20,#2
- eor w15,w15,w12
- add w22,w22,w14 // future e+=X[i]
- add w23,w23,w25 // e+=F(b,c,d)
- ror w15,w15,#31
- eor w16,w16,w19
- eor w25,w21,w24
- ror w27,w23,#27
- add w21,w21,w28 // future e+=K
- eor w16,w16,w8
- eor w25,w25,w20
- add w22,w22,w27 // e+=rot(a,5)
- ror w24,w24,#2
- eor w16,w16,w13
- add w21,w21,w15 // future e+=X[i]
- add w22,w22,w25 // e+=F(b,c,d)
- ror w16,w16,#31
- eor w17,w17,w3
- eor w25,w20,w23
- ror w27,w22,#27
- add w20,w20,w28 // future e+=K
- eor w17,w17,w9
- eor w25,w25,w24
- add w21,w21,w27 // e+=rot(a,5)
- ror w23,w23,#2
- eor w17,w17,w14
- add w20,w20,w16 // future e+=X[i]
- add w21,w21,w25 // e+=F(b,c,d)
- ror w17,w17,#31
- eor w19,w19,w4
- eor w25,w24,w22
- ror w27,w21,#27
- add w24,w24,w28 // future e+=K
- eor w19,w19,w10
- eor w25,w25,w23
- add w20,w20,w27 // e+=rot(a,5)
- ror w22,w22,#2
- eor w19,w19,w15
- add w24,w24,w17 // future e+=X[i]
- add w20,w20,w25 // e+=F(b,c,d)
- ror w19,w19,#31
- eor w3,w3,w5
- eor w25,w23,w21
- ror w27,w20,#27
- add w23,w23,w28 // future e+=K
- eor w3,w3,w11
- eor w25,w25,w22
- add w24,w24,w27 // e+=rot(a,5)
- ror w21,w21,#2
- eor w3,w3,w16
- add w23,w23,w19 // future e+=X[i]
- add w24,w24,w25 // e+=F(b,c,d)
- ror w3,w3,#31
- eor w4,w4,w6
- eor w25,w22,w20
- ror w27,w24,#27
- add w22,w22,w28 // future e+=K
- eor w4,w4,w12
- eor w25,w25,w21
- add w23,w23,w27 // e+=rot(a,5)
- ror w20,w20,#2
- eor w4,w4,w17
- add w22,w22,w3 // future e+=X[i]
- add w23,w23,w25 // e+=F(b,c,d)
- ror w4,w4,#31
- eor w5,w5,w7
- eor w25,w21,w24
- ror w27,w23,#27
- add w21,w21,w28 // future e+=K
- eor w5,w5,w13
- eor w25,w25,w20
- add w22,w22,w27 // e+=rot(a,5)
- ror w24,w24,#2
- eor w5,w5,w19
- add w21,w21,w4 // future e+=X[i]
- add w22,w22,w25 // e+=F(b,c,d)
- ror w5,w5,#31
- eor w6,w6,w8
- eor w25,w20,w23
- ror w27,w22,#27
- add w20,w20,w28 // future e+=K
- eor w6,w6,w14
- eor w25,w25,w24
- add w21,w21,w27 // e+=rot(a,5)
- ror w23,w23,#2
- eor w6,w6,w3
- add w20,w20,w5 // future e+=X[i]
- add w21,w21,w25 // e+=F(b,c,d)
- ror w6,w6,#31
- eor w7,w7,w9
- eor w25,w24,w22
- ror w27,w21,#27
- add w24,w24,w28 // future e+=K
- eor w7,w7,w15
- eor w25,w25,w23
- add w20,w20,w27 // e+=rot(a,5)
- ror w22,w22,#2
- eor w7,w7,w4
- add w24,w24,w6 // future e+=X[i]
- add w20,w20,w25 // e+=F(b,c,d)
- ror w7,w7,#31
- eor w8,w8,w10
- eor w25,w23,w21
- ror w27,w20,#27
- add w23,w23,w28 // future e+=K
- eor w8,w8,w16
- eor w25,w25,w22
- add w24,w24,w27 // e+=rot(a,5)
- ror w21,w21,#2
- eor w8,w8,w5
- add w23,w23,w7 // future e+=X[i]
- add w24,w24,w25 // e+=F(b,c,d)
- ror w8,w8,#31
- eor w9,w9,w11
- eor w25,w22,w20
- ror w27,w24,#27
- add w22,w22,w28 // future e+=K
- eor w9,w9,w17
- eor w25,w25,w21
- add w23,w23,w27 // e+=rot(a,5)
- ror w20,w20,#2
- eor w9,w9,w6
- add w22,w22,w8 // future e+=X[i]
- add w23,w23,w25 // e+=F(b,c,d)
- ror w9,w9,#31
- eor w10,w10,w12
- eor w25,w21,w24
- ror w27,w23,#27
- add w21,w21,w28 // future e+=K
- eor w10,w10,w19
- eor w25,w25,w20
- add w22,w22,w27 // e+=rot(a,5)
- ror w24,w24,#2
- eor w10,w10,w7
- add w21,w21,w9 // future e+=X[i]
- add w22,w22,w25 // e+=F(b,c,d)
- ror w10,w10,#31
- eor w11,w11,w13
- eor w25,w20,w23
- ror w27,w22,#27
- add w20,w20,w28 // future e+=K
- eor w11,w11,w3
- eor w25,w25,w24
- add w21,w21,w27 // e+=rot(a,5)
- ror w23,w23,#2
- eor w11,w11,w8
- add w20,w20,w10 // future e+=X[i]
- add w21,w21,w25 // e+=F(b,c,d)
- ror w11,w11,#31
- movz w28,#0xbcdc
- movk w28,#0x8f1b,lsl#16
- eor w12,w12,w14
- eor w25,w24,w22
- ror w27,w21,#27
- add w24,w24,w28 // future e+=K
- eor w12,w12,w4
- eor w25,w25,w23
- add w20,w20,w27 // e+=rot(a,5)
- ror w22,w22,#2
- eor w12,w12,w9
- add w24,w24,w11 // future e+=X[i]
- add w20,w20,w25 // e+=F(b,c,d)
- ror w12,w12,#31
- orr w25,w21,w22
- and w26,w21,w22
- eor w13,w13,w15
- ror w27,w20,#27
- and w25,w25,w23
- add w23,w23,w28 // future e+=K
- eor w13,w13,w5
- add w24,w24,w27 // e+=rot(a,5)
- orr w25,w25,w26
- ror w21,w21,#2
- eor w13,w13,w10
- add w23,w23,w12 // future e+=X[i]
- add w24,w24,w25 // e+=F(b,c,d)
- ror w13,w13,#31
- orr w25,w20,w21
- and w26,w20,w21
- eor w14,w14,w16
- ror w27,w24,#27
- and w25,w25,w22
- add w22,w22,w28 // future e+=K
- eor w14,w14,w6
- add w23,w23,w27 // e+=rot(a,5)
- orr w25,w25,w26
- ror w20,w20,#2
- eor w14,w14,w11
- add w22,w22,w13 // future e+=X[i]
- add w23,w23,w25 // e+=F(b,c,d)
- ror w14,w14,#31
- orr w25,w24,w20
- and w26,w24,w20
- eor w15,w15,w17
- ror w27,w23,#27
- and w25,w25,w21
- add w21,w21,w28 // future e+=K
- eor w15,w15,w7
- add w22,w22,w27 // e+=rot(a,5)
- orr w25,w25,w26
- ror w24,w24,#2
- eor w15,w15,w12
- add w21,w21,w14 // future e+=X[i]
- add w22,w22,w25 // e+=F(b,c,d)
- ror w15,w15,#31
- orr w25,w23,w24
- and w26,w23,w24
- eor w16,w16,w19
- ror w27,w22,#27
- and w25,w25,w20
- add w20,w20,w28 // future e+=K
- eor w16,w16,w8
- add w21,w21,w27 // e+=rot(a,5)
- orr w25,w25,w26
- ror w23,w23,#2
- eor w16,w16,w13
- add w20,w20,w15 // future e+=X[i]
- add w21,w21,w25 // e+=F(b,c,d)
- ror w16,w16,#31
- orr w25,w22,w23
- and w26,w22,w23
- eor w17,w17,w3
- ror w27,w21,#27
- and w25,w25,w24
- add w24,w24,w28 // future e+=K
- eor w17,w17,w9
- add w20,w20,w27 // e+=rot(a,5)
- orr w25,w25,w26
- ror w22,w22,#2
- eor w17,w17,w14
- add w24,w24,w16 // future e+=X[i]
- add w20,w20,w25 // e+=F(b,c,d)
- ror w17,w17,#31
- orr w25,w21,w22
- and w26,w21,w22
- eor w19,w19,w4
- ror w27,w20,#27
- and w25,w25,w23
- add w23,w23,w28 // future e+=K
- eor w19,w19,w10
- add w24,w24,w27 // e+=rot(a,5)
- orr w25,w25,w26
- ror w21,w21,#2
- eor w19,w19,w15
- add w23,w23,w17 // future e+=X[i]
- add w24,w24,w25 // e+=F(b,c,d)
- ror w19,w19,#31
- orr w25,w20,w21
- and w26,w20,w21
- eor w3,w3,w5
- ror w27,w24,#27
- and w25,w25,w22
- add w22,w22,w28 // future e+=K
- eor w3,w3,w11
- add w23,w23,w27 // e+=rot(a,5)
- orr w25,w25,w26
- ror w20,w20,#2
- eor w3,w3,w16
- add w22,w22,w19 // future e+=X[i]
- add w23,w23,w25 // e+=F(b,c,d)
- ror w3,w3,#31
- orr w25,w24,w20
- and w26,w24,w20
- eor w4,w4,w6
- ror w27,w23,#27
- and w25,w25,w21
- add w21,w21,w28 // future e+=K
- eor w4,w4,w12
- add w22,w22,w27 // e+=rot(a,5)
- orr w25,w25,w26
- ror w24,w24,#2
- eor w4,w4,w17
- add w21,w21,w3 // future e+=X[i]
- add w22,w22,w25 // e+=F(b,c,d)
- ror w4,w4,#31
- orr w25,w23,w24
- and w26,w23,w24
- eor w5,w5,w7
- ror w27,w22,#27
- and w25,w25,w20
- add w20,w20,w28 // future e+=K
- eor w5,w5,w13
- add w21,w21,w27 // e+=rot(a,5)
- orr w25,w25,w26
- ror w23,w23,#2
- eor w5,w5,w19
- add w20,w20,w4 // future e+=X[i]
- add w21,w21,w25 // e+=F(b,c,d)
- ror w5,w5,#31
- orr w25,w22,w23
- and w26,w22,w23
- eor w6,w6,w8
- ror w27,w21,#27
- and w25,w25,w24
- add w24,w24,w28 // future e+=K
- eor w6,w6,w14
- add w20,w20,w27 // e+=rot(a,5)
- orr w25,w25,w26
- ror w22,w22,#2
- eor w6,w6,w3
- add w24,w24,w5 // future e+=X[i]
- add w20,w20,w25 // e+=F(b,c,d)
- ror w6,w6,#31
- orr w25,w21,w22
- and w26,w21,w22
- eor w7,w7,w9
- ror w27,w20,#27
- and w25,w25,w23
- add w23,w23,w28 // future e+=K
- eor w7,w7,w15
- add w24,w24,w27 // e+=rot(a,5)
- orr w25,w25,w26
- ror w21,w21,#2
- eor w7,w7,w4
- add w23,w23,w6 // future e+=X[i]
- add w24,w24,w25 // e+=F(b,c,d)
- ror w7,w7,#31
- orr w25,w20,w21
- and w26,w20,w21
- eor w8,w8,w10
- ror w27,w24,#27
- and w25,w25,w22
- add w22,w22,w28 // future e+=K
- eor w8,w8,w16
- add w23,w23,w27 // e+=rot(a,5)
- orr w25,w25,w26
- ror w20,w20,#2
- eor w8,w8,w5
- add w22,w22,w7 // future e+=X[i]
- add w23,w23,w25 // e+=F(b,c,d)
- ror w8,w8,#31
- orr w25,w24,w20
- and w26,w24,w20
- eor w9,w9,w11
- ror w27,w23,#27
- and w25,w25,w21
- add w21,w21,w28 // future e+=K
- eor w9,w9,w17
- add w22,w22,w27 // e+=rot(a,5)
- orr w25,w25,w26
- ror w24,w24,#2
- eor w9,w9,w6
- add w21,w21,w8 // future e+=X[i]
- add w22,w22,w25 // e+=F(b,c,d)
- ror w9,w9,#31
- orr w25,w23,w24
- and w26,w23,w24
- eor w10,w10,w12
- ror w27,w22,#27
- and w25,w25,w20
- add w20,w20,w28 // future e+=K
- eor w10,w10,w19
- add w21,w21,w27 // e+=rot(a,5)
- orr w25,w25,w26
- ror w23,w23,#2
- eor w10,w10,w7
- add w20,w20,w9 // future e+=X[i]
- add w21,w21,w25 // e+=F(b,c,d)
- ror w10,w10,#31
- orr w25,w22,w23
- and w26,w22,w23
- eor w11,w11,w13
- ror w27,w21,#27
- and w25,w25,w24
- add w24,w24,w28 // future e+=K
- eor w11,w11,w3
- add w20,w20,w27 // e+=rot(a,5)
- orr w25,w25,w26
- ror w22,w22,#2
- eor w11,w11,w8
- add w24,w24,w10 // future e+=X[i]
- add w20,w20,w25 // e+=F(b,c,d)
- ror w11,w11,#31
- orr w25,w21,w22
- and w26,w21,w22
- eor w12,w12,w14
- ror w27,w20,#27
- and w25,w25,w23
- add w23,w23,w28 // future e+=K
- eor w12,w12,w4
- add w24,w24,w27 // e+=rot(a,5)
- orr w25,w25,w26
- ror w21,w21,#2
- eor w12,w12,w9
- add w23,w23,w11 // future e+=X[i]
- add w24,w24,w25 // e+=F(b,c,d)
- ror w12,w12,#31
- orr w25,w20,w21
- and w26,w20,w21
- eor w13,w13,w15
- ror w27,w24,#27
- and w25,w25,w22
- add w22,w22,w28 // future e+=K
- eor w13,w13,w5
- add w23,w23,w27 // e+=rot(a,5)
- orr w25,w25,w26
- ror w20,w20,#2
- eor w13,w13,w10
- add w22,w22,w12 // future e+=X[i]
- add w23,w23,w25 // e+=F(b,c,d)
- ror w13,w13,#31
- orr w25,w24,w20
- and w26,w24,w20
- eor w14,w14,w16
- ror w27,w23,#27
- and w25,w25,w21
- add w21,w21,w28 // future e+=K
- eor w14,w14,w6
- add w22,w22,w27 // e+=rot(a,5)
- orr w25,w25,w26
- ror w24,w24,#2
- eor w14,w14,w11
- add w21,w21,w13 // future e+=X[i]
- add w22,w22,w25 // e+=F(b,c,d)
- ror w14,w14,#31
- orr w25,w23,w24
- and w26,w23,w24
- eor w15,w15,w17
- ror w27,w22,#27
- and w25,w25,w20
- add w20,w20,w28 // future e+=K
- eor w15,w15,w7
- add w21,w21,w27 // e+=rot(a,5)
- orr w25,w25,w26
- ror w23,w23,#2
- eor w15,w15,w12
- add w20,w20,w14 // future e+=X[i]
- add w21,w21,w25 // e+=F(b,c,d)
- ror w15,w15,#31
- movz w28,#0xc1d6
- movk w28,#0xca62,lsl#16
- orr w25,w22,w23
- and w26,w22,w23
- eor w16,w16,w19
- ror w27,w21,#27
- and w25,w25,w24
- add w24,w24,w28 // future e+=K
- eor w16,w16,w8
- add w20,w20,w27 // e+=rot(a,5)
- orr w25,w25,w26
- ror w22,w22,#2
- eor w16,w16,w13
- add w24,w24,w15 // future e+=X[i]
- add w20,w20,w25 // e+=F(b,c,d)
- ror w16,w16,#31
- eor w17,w17,w3
- eor w25,w23,w21
- ror w27,w20,#27
- add w23,w23,w28 // future e+=K
- eor w17,w17,w9
- eor w25,w25,w22
- add w24,w24,w27 // e+=rot(a,5)
- ror w21,w21,#2
- eor w17,w17,w14
- add w23,w23,w16 // future e+=X[i]
- add w24,w24,w25 // e+=F(b,c,d)
- ror w17,w17,#31
- eor w19,w19,w4
- eor w25,w22,w20
- ror w27,w24,#27
- add w22,w22,w28 // future e+=K
- eor w19,w19,w10
- eor w25,w25,w21
- add w23,w23,w27 // e+=rot(a,5)
- ror w20,w20,#2
- eor w19,w19,w15
- add w22,w22,w17 // future e+=X[i]
- add w23,w23,w25 // e+=F(b,c,d)
- ror w19,w19,#31
- eor w3,w3,w5
- eor w25,w21,w24
- ror w27,w23,#27
- add w21,w21,w28 // future e+=K
- eor w3,w3,w11
- eor w25,w25,w20
- add w22,w22,w27 // e+=rot(a,5)
- ror w24,w24,#2
- eor w3,w3,w16
- add w21,w21,w19 // future e+=X[i]
- add w22,w22,w25 // e+=F(b,c,d)
- ror w3,w3,#31
- eor w4,w4,w6
- eor w25,w20,w23
- ror w27,w22,#27
- add w20,w20,w28 // future e+=K
- eor w4,w4,w12
- eor w25,w25,w24
- add w21,w21,w27 // e+=rot(a,5)
- ror w23,w23,#2
- eor w4,w4,w17
- add w20,w20,w3 // future e+=X[i]
- add w21,w21,w25 // e+=F(b,c,d)
- ror w4,w4,#31
- eor w5,w5,w7
- eor w25,w24,w22
- ror w27,w21,#27
- add w24,w24,w28 // future e+=K
- eor w5,w5,w13
- eor w25,w25,w23
- add w20,w20,w27 // e+=rot(a,5)
- ror w22,w22,#2
- eor w5,w5,w19
- add w24,w24,w4 // future e+=X[i]
- add w20,w20,w25 // e+=F(b,c,d)
- ror w5,w5,#31
- eor w6,w6,w8
- eor w25,w23,w21
- ror w27,w20,#27
- add w23,w23,w28 // future e+=K
- eor w6,w6,w14
- eor w25,w25,w22
- add w24,w24,w27 // e+=rot(a,5)
- ror w21,w21,#2
- eor w6,w6,w3
- add w23,w23,w5 // future e+=X[i]
- add w24,w24,w25 // e+=F(b,c,d)
- ror w6,w6,#31
- eor w7,w7,w9
- eor w25,w22,w20
- ror w27,w24,#27
- add w22,w22,w28 // future e+=K
- eor w7,w7,w15
- eor w25,w25,w21
- add w23,w23,w27 // e+=rot(a,5)
- ror w20,w20,#2
- eor w7,w7,w4
- add w22,w22,w6 // future e+=X[i]
- add w23,w23,w25 // e+=F(b,c,d)
- ror w7,w7,#31
- eor w8,w8,w10
- eor w25,w21,w24
- ror w27,w23,#27
- add w21,w21,w28 // future e+=K
- eor w8,w8,w16
- eor w25,w25,w20
- add w22,w22,w27 // e+=rot(a,5)
- ror w24,w24,#2
- eor w8,w8,w5
- add w21,w21,w7 // future e+=X[i]
- add w22,w22,w25 // e+=F(b,c,d)
- ror w8,w8,#31
- eor w9,w9,w11
- eor w25,w20,w23
- ror w27,w22,#27
- add w20,w20,w28 // future e+=K
- eor w9,w9,w17
- eor w25,w25,w24
- add w21,w21,w27 // e+=rot(a,5)
- ror w23,w23,#2
- eor w9,w9,w6
- add w20,w20,w8 // future e+=X[i]
- add w21,w21,w25 // e+=F(b,c,d)
- ror w9,w9,#31
- eor w10,w10,w12
- eor w25,w24,w22
- ror w27,w21,#27
- add w24,w24,w28 // future e+=K
- eor w10,w10,w19
- eor w25,w25,w23
- add w20,w20,w27 // e+=rot(a,5)
- ror w22,w22,#2
- eor w10,w10,w7
- add w24,w24,w9 // future e+=X[i]
- add w20,w20,w25 // e+=F(b,c,d)
- ror w10,w10,#31
- eor w11,w11,w13
- eor w25,w23,w21
- ror w27,w20,#27
- add w23,w23,w28 // future e+=K
- eor w11,w11,w3
- eor w25,w25,w22
- add w24,w24,w27 // e+=rot(a,5)
- ror w21,w21,#2
- eor w11,w11,w8
- add w23,w23,w10 // future e+=X[i]
- add w24,w24,w25 // e+=F(b,c,d)
- ror w11,w11,#31
- eor w12,w12,w14
- eor w25,w22,w20
- ror w27,w24,#27
- add w22,w22,w28 // future e+=K
- eor w12,w12,w4
- eor w25,w25,w21
- add w23,w23,w27 // e+=rot(a,5)
- ror w20,w20,#2
- eor w12,w12,w9
- add w22,w22,w11 // future e+=X[i]
- add w23,w23,w25 // e+=F(b,c,d)
- ror w12,w12,#31
- eor w13,w13,w15
- eor w25,w21,w24
- ror w27,w23,#27
- add w21,w21,w28 // future e+=K
- eor w13,w13,w5
- eor w25,w25,w20
- add w22,w22,w27 // e+=rot(a,5)
- ror w24,w24,#2
- eor w13,w13,w10
- add w21,w21,w12 // future e+=X[i]
- add w22,w22,w25 // e+=F(b,c,d)
- ror w13,w13,#31
- eor w14,w14,w16
- eor w25,w20,w23
- ror w27,w22,#27
- add w20,w20,w28 // future e+=K
- eor w14,w14,w6
- eor w25,w25,w24
- add w21,w21,w27 // e+=rot(a,5)
- ror w23,w23,#2
- eor w14,w14,w11
- add w20,w20,w13 // future e+=X[i]
- add w21,w21,w25 // e+=F(b,c,d)
- ror w14,w14,#31
- eor w15,w15,w17
- eor w25,w24,w22
- ror w27,w21,#27
- add w24,w24,w28 // future e+=K
- eor w15,w15,w7
- eor w25,w25,w23
- add w20,w20,w27 // e+=rot(a,5)
- ror w22,w22,#2
- eor w15,w15,w12
- add w24,w24,w14 // future e+=X[i]
- add w20,w20,w25 // e+=F(b,c,d)
- ror w15,w15,#31
- eor w16,w16,w19
- eor w25,w23,w21
- ror w27,w20,#27
- add w23,w23,w28 // future e+=K
- eor w16,w16,w8
- eor w25,w25,w22
- add w24,w24,w27 // e+=rot(a,5)
- ror w21,w21,#2
- eor w16,w16,w13
- add w23,w23,w15 // future e+=X[i]
- add w24,w24,w25 // e+=F(b,c,d)
- ror w16,w16,#31
- eor w17,w17,w3
- eor w25,w22,w20
- ror w27,w24,#27
- add w22,w22,w28 // future e+=K
- eor w17,w17,w9
- eor w25,w25,w21
- add w23,w23,w27 // e+=rot(a,5)
- ror w20,w20,#2
- eor w17,w17,w14
- add w22,w22,w16 // future e+=X[i]
- add w23,w23,w25 // e+=F(b,c,d)
- ror w17,w17,#31
- eor w19,w19,w4
- eor w25,w21,w24
- ror w27,w23,#27
- add w21,w21,w28 // future e+=K
- eor w19,w19,w10
- eor w25,w25,w20
- add w22,w22,w27 // e+=rot(a,5)
- ror w24,w24,#2
- eor w19,w19,w15
- add w21,w21,w17 // future e+=X[i]
- add w22,w22,w25 // e+=F(b,c,d)
- ror w19,w19,#31
- ldp w4,w5,[x0]
- eor w25,w20,w23
- ror w27,w22,#27
- add w20,w20,w28 // future e+=K
- eor w25,w25,w24
- add w21,w21,w27 // e+=rot(a,5)
- ror w23,w23,#2
- add w20,w20,w19 // future e+=X[i]
- add w21,w21,w25 // e+=F(b,c,d)
- ldp w6,w7,[x0,#8]
- eor w25,w24,w22
- ror w27,w21,#27
- eor w25,w25,w23
- add w20,w20,w27 // e+=rot(a,5)
- ror w22,w22,#2
- ldr w8,[x0,#16]
- add w20,w20,w25 // e+=F(b,c,d)
- add w21,w21,w5
- add w22,w22,w6
- add w20,w20,w4
- add w23,w23,w7
- add w24,w24,w8
- stp w20,w21,[x0]
- stp w22,w23,[x0,#8]
- str w24,[x0,#16]
- cbnz x2,Loop
-
- ldp x19,x20,[sp,#16]
- ldp x21,x22,[sp,#32]
- ldp x23,x24,[sp,#48]
- ldp x25,x26,[sp,#64]
- ldp x27,x28,[sp,#80]
- ldr x29,[sp],#96
- ret
-
-.globl sha1_block_data_order_hw
-
-.def sha1_block_data_order_hw
- .type 32
-.endef
-.align 6
-sha1_block_data_order_hw:
- // Armv8.3-A PAuth: even though x30 is pushed to stack it is not popped later.
- AARCH64_VALID_CALL_TARGET
- stp x29,x30,[sp,#-16]!
- add x29,sp,#0
-
- adrp x4,Lconst
- add x4,x4,:lo12:Lconst
- eor v1.16b,v1.16b,v1.16b
- ld1 {v0.4s},[x0],#16
- ld1 {v1.s}[0],[x0]
- sub x0,x0,#16
- ld1 {v16.4s,v17.4s,v18.4s,v19.4s},[x4]
-
-Loop_hw:
- ld1 {v4.16b,v5.16b,v6.16b,v7.16b},[x1],#64
- sub x2,x2,#1
- rev32 v4.16b,v4.16b
- rev32 v5.16b,v5.16b
-
- add v20.4s,v16.4s,v4.4s
- rev32 v6.16b,v6.16b
- orr v22.16b,v0.16b,v0.16b // offload
-
- add v21.4s,v16.4s,v5.4s
- rev32 v7.16b,v7.16b
-.long 0x5e280803 //sha1h v3.16b,v0.16b
-.long 0x5e140020 //sha1c v0.16b,v1.16b,v20.4s // 0
- add v20.4s,v16.4s,v6.4s
-.long 0x5e0630a4 //sha1su0 v4.16b,v5.16b,v6.16b
-.long 0x5e280802 //sha1h v2.16b,v0.16b // 1
-.long 0x5e150060 //sha1c v0.16b,v3.16b,v21.4s
- add v21.4s,v16.4s,v7.4s
-.long 0x5e2818e4 //sha1su1 v4.16b,v7.16b
-.long 0x5e0730c5 //sha1su0 v5.16b,v6.16b,v7.16b
-.long 0x5e280803 //sha1h v3.16b,v0.16b // 2
-.long 0x5e140040 //sha1c v0.16b,v2.16b,v20.4s
- add v20.4s,v16.4s,v4.4s
-.long 0x5e281885 //sha1su1 v5.16b,v4.16b
-.long 0x5e0430e6 //sha1su0 v6.16b,v7.16b,v4.16b
-.long 0x5e280802 //sha1h v2.16b,v0.16b // 3
-.long 0x5e150060 //sha1c v0.16b,v3.16b,v21.4s
- add v21.4s,v17.4s,v5.4s
-.long 0x5e2818a6 //sha1su1 v6.16b,v5.16b
-.long 0x5e053087 //sha1su0 v7.16b,v4.16b,v5.16b
-.long 0x5e280803 //sha1h v3.16b,v0.16b // 4
-.long 0x5e140040 //sha1c v0.16b,v2.16b,v20.4s
- add v20.4s,v17.4s,v6.4s
-.long 0x5e2818c7 //sha1su1 v7.16b,v6.16b
-.long 0x5e0630a4 //sha1su0 v4.16b,v5.16b,v6.16b
-.long 0x5e280802 //sha1h v2.16b,v0.16b // 5
-.long 0x5e151060 //sha1p v0.16b,v3.16b,v21.4s
- add v21.4s,v17.4s,v7.4s
-.long 0x5e2818e4 //sha1su1 v4.16b,v7.16b
-.long 0x5e0730c5 //sha1su0 v5.16b,v6.16b,v7.16b
-.long 0x5e280803 //sha1h v3.16b,v0.16b // 6
-.long 0x5e141040 //sha1p v0.16b,v2.16b,v20.4s
- add v20.4s,v17.4s,v4.4s
-.long 0x5e281885 //sha1su1 v5.16b,v4.16b
-.long 0x5e0430e6 //sha1su0 v6.16b,v7.16b,v4.16b
-.long 0x5e280802 //sha1h v2.16b,v0.16b // 7
-.long 0x5e151060 //sha1p v0.16b,v3.16b,v21.4s
- add v21.4s,v17.4s,v5.4s
-.long 0x5e2818a6 //sha1su1 v6.16b,v5.16b
-.long 0x5e053087 //sha1su0 v7.16b,v4.16b,v5.16b
-.long 0x5e280803 //sha1h v3.16b,v0.16b // 8
-.long 0x5e141040 //sha1p v0.16b,v2.16b,v20.4s
- add v20.4s,v18.4s,v6.4s
-.long 0x5e2818c7 //sha1su1 v7.16b,v6.16b
-.long 0x5e0630a4 //sha1su0 v4.16b,v5.16b,v6.16b
-.long 0x5e280802 //sha1h v2.16b,v0.16b // 9
-.long 0x5e151060 //sha1p v0.16b,v3.16b,v21.4s
- add v21.4s,v18.4s,v7.4s
-.long 0x5e2818e4 //sha1su1 v4.16b,v7.16b
-.long 0x5e0730c5 //sha1su0 v5.16b,v6.16b,v7.16b
-.long 0x5e280803 //sha1h v3.16b,v0.16b // 10
-.long 0x5e142040 //sha1m v0.16b,v2.16b,v20.4s
- add v20.4s,v18.4s,v4.4s
-.long 0x5e281885 //sha1su1 v5.16b,v4.16b
-.long 0x5e0430e6 //sha1su0 v6.16b,v7.16b,v4.16b
-.long 0x5e280802 //sha1h v2.16b,v0.16b // 11
-.long 0x5e152060 //sha1m v0.16b,v3.16b,v21.4s
- add v21.4s,v18.4s,v5.4s
-.long 0x5e2818a6 //sha1su1 v6.16b,v5.16b
-.long 0x5e053087 //sha1su0 v7.16b,v4.16b,v5.16b
-.long 0x5e280803 //sha1h v3.16b,v0.16b // 12
-.long 0x5e142040 //sha1m v0.16b,v2.16b,v20.4s
- add v20.4s,v18.4s,v6.4s
-.long 0x5e2818c7 //sha1su1 v7.16b,v6.16b
-.long 0x5e0630a4 //sha1su0 v4.16b,v5.16b,v6.16b
-.long 0x5e280802 //sha1h v2.16b,v0.16b // 13
-.long 0x5e152060 //sha1m v0.16b,v3.16b,v21.4s
- add v21.4s,v19.4s,v7.4s
-.long 0x5e2818e4 //sha1su1 v4.16b,v7.16b
-.long 0x5e0730c5 //sha1su0 v5.16b,v6.16b,v7.16b
-.long 0x5e280803 //sha1h v3.16b,v0.16b // 14
-.long 0x5e142040 //sha1m v0.16b,v2.16b,v20.4s
- add v20.4s,v19.4s,v4.4s
-.long 0x5e281885 //sha1su1 v5.16b,v4.16b
-.long 0x5e0430e6 //sha1su0 v6.16b,v7.16b,v4.16b
-.long 0x5e280802 //sha1h v2.16b,v0.16b // 15
-.long 0x5e151060 //sha1p v0.16b,v3.16b,v21.4s
- add v21.4s,v19.4s,v5.4s
-.long 0x5e2818a6 //sha1su1 v6.16b,v5.16b
-.long 0x5e053087 //sha1su0 v7.16b,v4.16b,v5.16b
-.long 0x5e280803 //sha1h v3.16b,v0.16b // 16
-.long 0x5e141040 //sha1p v0.16b,v2.16b,v20.4s
- add v20.4s,v19.4s,v6.4s
-.long 0x5e2818c7 //sha1su1 v7.16b,v6.16b
-.long 0x5e280802 //sha1h v2.16b,v0.16b // 17
-.long 0x5e151060 //sha1p v0.16b,v3.16b,v21.4s
- add v21.4s,v19.4s,v7.4s
-
-.long 0x5e280803 //sha1h v3.16b,v0.16b // 18
-.long 0x5e141040 //sha1p v0.16b,v2.16b,v20.4s
-
-.long 0x5e280802 //sha1h v2.16b,v0.16b // 19
-.long 0x5e151060 //sha1p v0.16b,v3.16b,v21.4s
-
- add v1.4s,v1.4s,v2.4s
- add v0.4s,v0.4s,v22.4s
-
- cbnz x2,Loop_hw
-
- st1 {v0.4s},[x0],#16
- st1 {v1.s}[0],[x0]
-
- ldr x29,[sp],#16
- ret
-
-.section .rodata
-.align 6
-Lconst:
-.long 0x5a827999,0x5a827999,0x5a827999,0x5a827999 //K_00_19
-.long 0x6ed9eba1,0x6ed9eba1,0x6ed9eba1,0x6ed9eba1 //K_20_39
-.long 0x8f1bbcdc,0x8f1bbcdc,0x8f1bbcdc,0x8f1bbcdc //K_40_59
-.long 0xca62c1d6,0xca62c1d6,0xca62c1d6,0xca62c1d6 //K_60_79
-.byte 83,72,65,49,32,98,108,111,99,107,32,116,114,97,110,115,102,111,114,109,32,102,111,114,32,65,82,77,118,56,44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112,114,111,64,111,112,101,110,115,115,108,46,111,114,103,62,0
-.align 2
-.align 2
-#endif // !OPENSSL_NO_ASM && defined(OPENSSL_AARCH64) && defined(_WIN32)
diff --git a/win-aarch64/crypto/fipsmodule/sha256-armv8-win.S b/win-aarch64/crypto/fipsmodule/sha256-armv8-win.S
deleted file mode 100644
index 89d3944a..00000000
--- a/win-aarch64/crypto/fipsmodule/sha256-armv8-win.S
+++ /dev/null
@@ -1,1197 +0,0 @@
-// This file is generated from a similarly-named Perl script in the BoringSSL
-// source tree. Do not edit by hand.
-
-#include <openssl/asm_base.h>
-
-#if !defined(OPENSSL_NO_ASM) && defined(OPENSSL_AARCH64) && defined(_WIN32)
-// Copyright 2014-2020 The OpenSSL Project Authors. All Rights Reserved.
-//
-// Licensed under the OpenSSL license (the "License"). You may not use
-// this file except in compliance with the License. You can obtain a copy
-// in the file LICENSE in the source distribution or at
-// https://www.openssl.org/source/license.html
-
-// ====================================================================
-// Written by Andy Polyakov <appro@openssl.org> for the OpenSSL
-// project. The module is, however, dual licensed under OpenSSL and
-// CRYPTOGAMS licenses depending on where you obtain it. For further
-// details see http://www.openssl.org/~appro/cryptogams/.
-//
-// Permission to use under GPLv2 terms is granted.
-// ====================================================================
-//
-// SHA256/512 for ARMv8.
-//
-// Performance in cycles per processed byte and improvement coefficient
-// over code generated with "default" compiler:
-//
-// SHA256-hw SHA256(*) SHA512
-// Apple A7 1.97 10.5 (+33%) 6.73 (-1%(**))
-// Cortex-A53 2.38 15.5 (+115%) 10.0 (+150%(***))
-// Cortex-A57 2.31 11.6 (+86%) 7.51 (+260%(***))
-// Denver 2.01 10.5 (+26%) 6.70 (+8%)
-// X-Gene 20.0 (+100%) 12.8 (+300%(***))
-// Mongoose 2.36 13.0 (+50%) 8.36 (+33%)
-// Kryo 1.92 17.4 (+30%) 11.2 (+8%)
-//
-// (*) Software SHA256 results are of lesser relevance, presented
-// mostly for informational purposes.
-// (**) The result is a trade-off: it's possible to improve it by
-// 10% (or by 1 cycle per round), but at the cost of 20% loss
-// on Cortex-A53 (or by 4 cycles per round).
-// (***) Super-impressive coefficients over gcc-generated code are
-// indication of some compiler "pathology", most notably code
-// generated with -mgeneral-regs-only is significantly faster
-// and the gap is only 40-90%.
-
-#ifndef __KERNEL__
-# include <openssl/arm_arch.h>
-#endif
-
-.text
-
-.globl sha256_block_data_order_nohw
-
-.def sha256_block_data_order_nohw
- .type 32
-.endef
-.align 6
-sha256_block_data_order_nohw:
- AARCH64_SIGN_LINK_REGISTER
- stp x29,x30,[sp,#-128]!
- add x29,sp,#0
-
- stp x19,x20,[sp,#16]
- stp x21,x22,[sp,#32]
- stp x23,x24,[sp,#48]
- stp x25,x26,[sp,#64]
- stp x27,x28,[sp,#80]
- sub sp,sp,#4*4
-
- ldp w20,w21,[x0] // load context
- ldp w22,w23,[x0,#2*4]
- ldp w24,w25,[x0,#4*4]
- add x2,x1,x2,lsl#6 // end of input
- ldp w26,w27,[x0,#6*4]
- adrp x30,LK256
- add x30,x30,:lo12:LK256
- stp x0,x2,[x29,#96]
-
-Loop:
- ldp w3,w4,[x1],#2*4
- ldr w19,[x30],#4 // *K++
- eor w28,w21,w22 // magic seed
- str x1,[x29,#112]
-#ifndef __AARCH64EB__
- rev w3,w3 // 0
-#endif
- ror w16,w24,#6
- add w27,w27,w19 // h+=K[i]
- eor w6,w24,w24,ror#14
- and w17,w25,w24
- bic w19,w26,w24
- add w27,w27,w3 // h+=X[i]
- orr w17,w17,w19 // Ch(e,f,g)
- eor w19,w20,w21 // a^b, b^c in next round
- eor w16,w16,w6,ror#11 // Sigma1(e)
- ror w6,w20,#2
- add w27,w27,w17 // h+=Ch(e,f,g)
- eor w17,w20,w20,ror#9
- add w27,w27,w16 // h+=Sigma1(e)
- and w28,w28,w19 // (b^c)&=(a^b)
- add w23,w23,w27 // d+=h
- eor w28,w28,w21 // Maj(a,b,c)
- eor w17,w6,w17,ror#13 // Sigma0(a)
- add w27,w27,w28 // h+=Maj(a,b,c)
- ldr w28,[x30],#4 // *K++, w19 in next round
- //add w27,w27,w17 // h+=Sigma0(a)
-#ifndef __AARCH64EB__
- rev w4,w4 // 1
-#endif
- ldp w5,w6,[x1],#2*4
- add w27,w27,w17 // h+=Sigma0(a)
- ror w16,w23,#6
- add w26,w26,w28 // h+=K[i]
- eor w7,w23,w23,ror#14
- and w17,w24,w23
- bic w28,w25,w23
- add w26,w26,w4 // h+=X[i]
- orr w17,w17,w28 // Ch(e,f,g)
- eor w28,w27,w20 // a^b, b^c in next round
- eor w16,w16,w7,ror#11 // Sigma1(e)
- ror w7,w27,#2
- add w26,w26,w17 // h+=Ch(e,f,g)
- eor w17,w27,w27,ror#9
- add w26,w26,w16 // h+=Sigma1(e)
- and w19,w19,w28 // (b^c)&=(a^b)
- add w22,w22,w26 // d+=h
- eor w19,w19,w20 // Maj(a,b,c)
- eor w17,w7,w17,ror#13 // Sigma0(a)
- add w26,w26,w19 // h+=Maj(a,b,c)
- ldr w19,[x30],#4 // *K++, w28 in next round
- //add w26,w26,w17 // h+=Sigma0(a)
-#ifndef __AARCH64EB__
- rev w5,w5 // 2
-#endif
- add w26,w26,w17 // h+=Sigma0(a)
- ror w16,w22,#6
- add w25,w25,w19 // h+=K[i]
- eor w8,w22,w22,ror#14
- and w17,w23,w22
- bic w19,w24,w22
- add w25,w25,w5 // h+=X[i]
- orr w17,w17,w19 // Ch(e,f,g)
- eor w19,w26,w27 // a^b, b^c in next round
- eor w16,w16,w8,ror#11 // Sigma1(e)
- ror w8,w26,#2
- add w25,w25,w17 // h+=Ch(e,f,g)
- eor w17,w26,w26,ror#9
- add w25,w25,w16 // h+=Sigma1(e)
- and w28,w28,w19 // (b^c)&=(a^b)
- add w21,w21,w25 // d+=h
- eor w28,w28,w27 // Maj(a,b,c)
- eor w17,w8,w17,ror#13 // Sigma0(a)
- add w25,w25,w28 // h+=Maj(a,b,c)
- ldr w28,[x30],#4 // *K++, w19 in next round
- //add w25,w25,w17 // h+=Sigma0(a)
-#ifndef __AARCH64EB__
- rev w6,w6 // 3
-#endif
- ldp w7,w8,[x1],#2*4
- add w25,w25,w17 // h+=Sigma0(a)
- ror w16,w21,#6
- add w24,w24,w28 // h+=K[i]
- eor w9,w21,w21,ror#14
- and w17,w22,w21
- bic w28,w23,w21
- add w24,w24,w6 // h+=X[i]
- orr w17,w17,w28 // Ch(e,f,g)
- eor w28,w25,w26 // a^b, b^c in next round
- eor w16,w16,w9,ror#11 // Sigma1(e)
- ror w9,w25,#2
- add w24,w24,w17 // h+=Ch(e,f,g)
- eor w17,w25,w25,ror#9
- add w24,w24,w16 // h+=Sigma1(e)
- and w19,w19,w28 // (b^c)&=(a^b)
- add w20,w20,w24 // d+=h
- eor w19,w19,w26 // Maj(a,b,c)
- eor w17,w9,w17,ror#13 // Sigma0(a)
- add w24,w24,w19 // h+=Maj(a,b,c)
- ldr w19,[x30],#4 // *K++, w28 in next round
- //add w24,w24,w17 // h+=Sigma0(a)
-#ifndef __AARCH64EB__
- rev w7,w7 // 4
-#endif
- add w24,w24,w17 // h+=Sigma0(a)
- ror w16,w20,#6
- add w23,w23,w19 // h+=K[i]
- eor w10,w20,w20,ror#14
- and w17,w21,w20
- bic w19,w22,w20
- add w23,w23,w7 // h+=X[i]
- orr w17,w17,w19 // Ch(e,f,g)
- eor w19,w24,w25 // a^b, b^c in next round
- eor w16,w16,w10,ror#11 // Sigma1(e)
- ror w10,w24,#2
- add w23,w23,w17 // h+=Ch(e,f,g)
- eor w17,w24,w24,ror#9
- add w23,w23,w16 // h+=Sigma1(e)
- and w28,w28,w19 // (b^c)&=(a^b)
- add w27,w27,w23 // d+=h
- eor w28,w28,w25 // Maj(a,b,c)
- eor w17,w10,w17,ror#13 // Sigma0(a)
- add w23,w23,w28 // h+=Maj(a,b,c)
- ldr w28,[x30],#4 // *K++, w19 in next round
- //add w23,w23,w17 // h+=Sigma0(a)
-#ifndef __AARCH64EB__
- rev w8,w8 // 5
-#endif
- ldp w9,w10,[x1],#2*4
- add w23,w23,w17 // h+=Sigma0(a)
- ror w16,w27,#6
- add w22,w22,w28 // h+=K[i]
- eor w11,w27,w27,ror#14
- and w17,w20,w27
- bic w28,w21,w27
- add w22,w22,w8 // h+=X[i]
- orr w17,w17,w28 // Ch(e,f,g)
- eor w28,w23,w24 // a^b, b^c in next round
- eor w16,w16,w11,ror#11 // Sigma1(e)
- ror w11,w23,#2
- add w22,w22,w17 // h+=Ch(e,f,g)
- eor w17,w23,w23,ror#9
- add w22,w22,w16 // h+=Sigma1(e)
- and w19,w19,w28 // (b^c)&=(a^b)
- add w26,w26,w22 // d+=h
- eor w19,w19,w24 // Maj(a,b,c)
- eor w17,w11,w17,ror#13 // Sigma0(a)
- add w22,w22,w19 // h+=Maj(a,b,c)
- ldr w19,[x30],#4 // *K++, w28 in next round
- //add w22,w22,w17 // h+=Sigma0(a)
-#ifndef __AARCH64EB__
- rev w9,w9 // 6
-#endif
- add w22,w22,w17 // h+=Sigma0(a)
- ror w16,w26,#6
- add w21,w21,w19 // h+=K[i]
- eor w12,w26,w26,ror#14
- and w17,w27,w26
- bic w19,w20,w26
- add w21,w21,w9 // h+=X[i]
- orr w17,w17,w19 // Ch(e,f,g)
- eor w19,w22,w23 // a^b, b^c in next round
- eor w16,w16,w12,ror#11 // Sigma1(e)
- ror w12,w22,#2
- add w21,w21,w17 // h+=Ch(e,f,g)
- eor w17,w22,w22,ror#9
- add w21,w21,w16 // h+=Sigma1(e)
- and w28,w28,w19 // (b^c)&=(a^b)
- add w25,w25,w21 // d+=h
- eor w28,w28,w23 // Maj(a,b,c)
- eor w17,w12,w17,ror#13 // Sigma0(a)
- add w21,w21,w28 // h+=Maj(a,b,c)
- ldr w28,[x30],#4 // *K++, w19 in next round
- //add w21,w21,w17 // h+=Sigma0(a)
-#ifndef __AARCH64EB__
- rev w10,w10 // 7
-#endif
- ldp w11,w12,[x1],#2*4
- add w21,w21,w17 // h+=Sigma0(a)
- ror w16,w25,#6
- add w20,w20,w28 // h+=K[i]
- eor w13,w25,w25,ror#14
- and w17,w26,w25
- bic w28,w27,w25
- add w20,w20,w10 // h+=X[i]
- orr w17,w17,w28 // Ch(e,f,g)
- eor w28,w21,w22 // a^b, b^c in next round
- eor w16,w16,w13,ror#11 // Sigma1(e)
- ror w13,w21,#2
- add w20,w20,w17 // h+=Ch(e,f,g)
- eor w17,w21,w21,ror#9
- add w20,w20,w16 // h+=Sigma1(e)
- and w19,w19,w28 // (b^c)&=(a^b)
- add w24,w24,w20 // d+=h
- eor w19,w19,w22 // Maj(a,b,c)
- eor w17,w13,w17,ror#13 // Sigma0(a)
- add w20,w20,w19 // h+=Maj(a,b,c)
- ldr w19,[x30],#4 // *K++, w28 in next round
- //add w20,w20,w17 // h+=Sigma0(a)
-#ifndef __AARCH64EB__
- rev w11,w11 // 8
-#endif
- add w20,w20,w17 // h+=Sigma0(a)
- ror w16,w24,#6
- add w27,w27,w19 // h+=K[i]
- eor w14,w24,w24,ror#14
- and w17,w25,w24
- bic w19,w26,w24
- add w27,w27,w11 // h+=X[i]
- orr w17,w17,w19 // Ch(e,f,g)
- eor w19,w20,w21 // a^b, b^c in next round
- eor w16,w16,w14,ror#11 // Sigma1(e)
- ror w14,w20,#2
- add w27,w27,w17 // h+=Ch(e,f,g)
- eor w17,w20,w20,ror#9
- add w27,w27,w16 // h+=Sigma1(e)
- and w28,w28,w19 // (b^c)&=(a^b)
- add w23,w23,w27 // d+=h
- eor w28,w28,w21 // Maj(a,b,c)
- eor w17,w14,w17,ror#13 // Sigma0(a)
- add w27,w27,w28 // h+=Maj(a,b,c)
- ldr w28,[x30],#4 // *K++, w19 in next round
- //add w27,w27,w17 // h+=Sigma0(a)
-#ifndef __AARCH64EB__
- rev w12,w12 // 9
-#endif
- ldp w13,w14,[x1],#2*4
- add w27,w27,w17 // h+=Sigma0(a)
- ror w16,w23,#6
- add w26,w26,w28 // h+=K[i]
- eor w15,w23,w23,ror#14
- and w17,w24,w23
- bic w28,w25,w23
- add w26,w26,w12 // h+=X[i]
- orr w17,w17,w28 // Ch(e,f,g)
- eor w28,w27,w20 // a^b, b^c in next round
- eor w16,w16,w15,ror#11 // Sigma1(e)
- ror w15,w27,#2
- add w26,w26,w17 // h+=Ch(e,f,g)
- eor w17,w27,w27,ror#9
- add w26,w26,w16 // h+=Sigma1(e)
- and w19,w19,w28 // (b^c)&=(a^b)
- add w22,w22,w26 // d+=h
- eor w19,w19,w20 // Maj(a,b,c)
- eor w17,w15,w17,ror#13 // Sigma0(a)
- add w26,w26,w19 // h+=Maj(a,b,c)
- ldr w19,[x30],#4 // *K++, w28 in next round
- //add w26,w26,w17 // h+=Sigma0(a)
-#ifndef __AARCH64EB__
- rev w13,w13 // 10
-#endif
- add w26,w26,w17 // h+=Sigma0(a)
- ror w16,w22,#6
- add w25,w25,w19 // h+=K[i]
- eor w0,w22,w22,ror#14
- and w17,w23,w22
- bic w19,w24,w22
- add w25,w25,w13 // h+=X[i]
- orr w17,w17,w19 // Ch(e,f,g)
- eor w19,w26,w27 // a^b, b^c in next round
- eor w16,w16,w0,ror#11 // Sigma1(e)
- ror w0,w26,#2
- add w25,w25,w17 // h+=Ch(e,f,g)
- eor w17,w26,w26,ror#9
- add w25,w25,w16 // h+=Sigma1(e)
- and w28,w28,w19 // (b^c)&=(a^b)
- add w21,w21,w25 // d+=h
- eor w28,w28,w27 // Maj(a,b,c)
- eor w17,w0,w17,ror#13 // Sigma0(a)
- add w25,w25,w28 // h+=Maj(a,b,c)
- ldr w28,[x30],#4 // *K++, w19 in next round
- //add w25,w25,w17 // h+=Sigma0(a)
-#ifndef __AARCH64EB__
- rev w14,w14 // 11
-#endif
- ldp w15,w0,[x1],#2*4
- add w25,w25,w17 // h+=Sigma0(a)
- str w6,[sp,#12]
- ror w16,w21,#6
- add w24,w24,w28 // h+=K[i]
- eor w6,w21,w21,ror#14
- and w17,w22,w21
- bic w28,w23,w21
- add w24,w24,w14 // h+=X[i]
- orr w17,w17,w28 // Ch(e,f,g)
- eor w28,w25,w26 // a^b, b^c in next round
- eor w16,w16,w6,ror#11 // Sigma1(e)
- ror w6,w25,#2
- add w24,w24,w17 // h+=Ch(e,f,g)
- eor w17,w25,w25,ror#9
- add w24,w24,w16 // h+=Sigma1(e)
- and w19,w19,w28 // (b^c)&=(a^b)
- add w20,w20,w24 // d+=h
- eor w19,w19,w26 // Maj(a,b,c)
- eor w17,w6,w17,ror#13 // Sigma0(a)
- add w24,w24,w19 // h+=Maj(a,b,c)
- ldr w19,[x30],#4 // *K++, w28 in next round
- //add w24,w24,w17 // h+=Sigma0(a)
-#ifndef __AARCH64EB__
- rev w15,w15 // 12
-#endif
- add w24,w24,w17 // h+=Sigma0(a)
- str w7,[sp,#0]
- ror w16,w20,#6
- add w23,w23,w19 // h+=K[i]
- eor w7,w20,w20,ror#14
- and w17,w21,w20
- bic w19,w22,w20
- add w23,w23,w15 // h+=X[i]
- orr w17,w17,w19 // Ch(e,f,g)
- eor w19,w24,w25 // a^b, b^c in next round
- eor w16,w16,w7,ror#11 // Sigma1(e)
- ror w7,w24,#2
- add w23,w23,w17 // h+=Ch(e,f,g)
- eor w17,w24,w24,ror#9
- add w23,w23,w16 // h+=Sigma1(e)
- and w28,w28,w19 // (b^c)&=(a^b)
- add w27,w27,w23 // d+=h
- eor w28,w28,w25 // Maj(a,b,c)
- eor w17,w7,w17,ror#13 // Sigma0(a)
- add w23,w23,w28 // h+=Maj(a,b,c)
- ldr w28,[x30],#4 // *K++, w19 in next round
- //add w23,w23,w17 // h+=Sigma0(a)
-#ifndef __AARCH64EB__
- rev w0,w0 // 13
-#endif
- ldp w1,w2,[x1]
- add w23,w23,w17 // h+=Sigma0(a)
- str w8,[sp,#4]
- ror w16,w27,#6
- add w22,w22,w28 // h+=K[i]
- eor w8,w27,w27,ror#14
- and w17,w20,w27
- bic w28,w21,w27
- add w22,w22,w0 // h+=X[i]
- orr w17,w17,w28 // Ch(e,f,g)
- eor w28,w23,w24 // a^b, b^c in next round
- eor w16,w16,w8,ror#11 // Sigma1(e)
- ror w8,w23,#2
- add w22,w22,w17 // h+=Ch(e,f,g)
- eor w17,w23,w23,ror#9
- add w22,w22,w16 // h+=Sigma1(e)
- and w19,w19,w28 // (b^c)&=(a^b)
- add w26,w26,w22 // d+=h
- eor w19,w19,w24 // Maj(a,b,c)
- eor w17,w8,w17,ror#13 // Sigma0(a)
- add w22,w22,w19 // h+=Maj(a,b,c)
- ldr w19,[x30],#4 // *K++, w28 in next round
- //add w22,w22,w17 // h+=Sigma0(a)
-#ifndef __AARCH64EB__
- rev w1,w1 // 14
-#endif
- ldr w6,[sp,#12]
- add w22,w22,w17 // h+=Sigma0(a)
- str w9,[sp,#8]
- ror w16,w26,#6
- add w21,w21,w19 // h+=K[i]
- eor w9,w26,w26,ror#14
- and w17,w27,w26
- bic w19,w20,w26
- add w21,w21,w1 // h+=X[i]
- orr w17,w17,w19 // Ch(e,f,g)
- eor w19,w22,w23 // a^b, b^c in next round
- eor w16,w16,w9,ror#11 // Sigma1(e)
- ror w9,w22,#2
- add w21,w21,w17 // h+=Ch(e,f,g)
- eor w17,w22,w22,ror#9
- add w21,w21,w16 // h+=Sigma1(e)
- and w28,w28,w19 // (b^c)&=(a^b)
- add w25,w25,w21 // d+=h
- eor w28,w28,w23 // Maj(a,b,c)
- eor w17,w9,w17,ror#13 // Sigma0(a)
- add w21,w21,w28 // h+=Maj(a,b,c)
- ldr w28,[x30],#4 // *K++, w19 in next round
- //add w21,w21,w17 // h+=Sigma0(a)
-#ifndef __AARCH64EB__
- rev w2,w2 // 15
-#endif
- ldr w7,[sp,#0]
- add w21,w21,w17 // h+=Sigma0(a)
- str w10,[sp,#12]
- ror w16,w25,#6
- add w20,w20,w28 // h+=K[i]
- ror w9,w4,#7
- and w17,w26,w25
- ror w8,w1,#17
- bic w28,w27,w25
- ror w10,w21,#2
- add w20,w20,w2 // h+=X[i]
- eor w16,w16,w25,ror#11
- eor w9,w9,w4,ror#18
- orr w17,w17,w28 // Ch(e,f,g)
- eor w28,w21,w22 // a^b, b^c in next round
- eor w16,w16,w25,ror#25 // Sigma1(e)
- eor w10,w10,w21,ror#13
- add w20,w20,w17 // h+=Ch(e,f,g)
- and w19,w19,w28 // (b^c)&=(a^b)
- eor w8,w8,w1,ror#19
- eor w9,w9,w4,lsr#3 // sigma0(X[i+1])
- add w20,w20,w16 // h+=Sigma1(e)
- eor w19,w19,w22 // Maj(a,b,c)
- eor w17,w10,w21,ror#22 // Sigma0(a)
- eor w8,w8,w1,lsr#10 // sigma1(X[i+14])
- add w3,w3,w12
- add w24,w24,w20 // d+=h
- add w20,w20,w19 // h+=Maj(a,b,c)
- ldr w19,[x30],#4 // *K++, w28 in next round
- add w3,w3,w9
- add w20,w20,w17 // h+=Sigma0(a)
- add w3,w3,w8
-Loop_16_xx:
- ldr w8,[sp,#4]
- str w11,[sp,#0]
- ror w16,w24,#6
- add w27,w27,w19 // h+=K[i]
- ror w10,w5,#7
- and w17,w25,w24
- ror w9,w2,#17
- bic w19,w26,w24
- ror w11,w20,#2
- add w27,w27,w3 // h+=X[i]
- eor w16,w16,w24,ror#11
- eor w10,w10,w5,ror#18
- orr w17,w17,w19 // Ch(e,f,g)
- eor w19,w20,w21 // a^b, b^c in next round
- eor w16,w16,w24,ror#25 // Sigma1(e)
- eor w11,w11,w20,ror#13
- add w27,w27,w17 // h+=Ch(e,f,g)
- and w28,w28,w19 // (b^c)&=(a^b)
- eor w9,w9,w2,ror#19
- eor w10,w10,w5,lsr#3 // sigma0(X[i+1])
- add w27,w27,w16 // h+=Sigma1(e)
- eor w28,w28,w21 // Maj(a,b,c)
- eor w17,w11,w20,ror#22 // Sigma0(a)
- eor w9,w9,w2,lsr#10 // sigma1(X[i+14])
- add w4,w4,w13
- add w23,w23,w27 // d+=h
- add w27,w27,w28 // h+=Maj(a,b,c)
- ldr w28,[x30],#4 // *K++, w19 in next round
- add w4,w4,w10
- add w27,w27,w17 // h+=Sigma0(a)
- add w4,w4,w9
- ldr w9,[sp,#8]
- str w12,[sp,#4]
- ror w16,w23,#6
- add w26,w26,w28 // h+=K[i]
- ror w11,w6,#7
- and w17,w24,w23
- ror w10,w3,#17
- bic w28,w25,w23
- ror w12,w27,#2
- add w26,w26,w4 // h+=X[i]
- eor w16,w16,w23,ror#11
- eor w11,w11,w6,ror#18
- orr w17,w17,w28 // Ch(e,f,g)
- eor w28,w27,w20 // a^b, b^c in next round
- eor w16,w16,w23,ror#25 // Sigma1(e)
- eor w12,w12,w27,ror#13
- add w26,w26,w17 // h+=Ch(e,f,g)
- and w19,w19,w28 // (b^c)&=(a^b)
- eor w10,w10,w3,ror#19
- eor w11,w11,w6,lsr#3 // sigma0(X[i+1])
- add w26,w26,w16 // h+=Sigma1(e)
- eor w19,w19,w20 // Maj(a,b,c)
- eor w17,w12,w27,ror#22 // Sigma0(a)
- eor w10,w10,w3,lsr#10 // sigma1(X[i+14])
- add w5,w5,w14
- add w22,w22,w26 // d+=h
- add w26,w26,w19 // h+=Maj(a,b,c)
- ldr w19,[x30],#4 // *K++, w28 in next round
- add w5,w5,w11
- add w26,w26,w17 // h+=Sigma0(a)
- add w5,w5,w10
- ldr w10,[sp,#12]
- str w13,[sp,#8]
- ror w16,w22,#6
- add w25,w25,w19 // h+=K[i]
- ror w12,w7,#7
- and w17,w23,w22
- ror w11,w4,#17
- bic w19,w24,w22
- ror w13,w26,#2
- add w25,w25,w5 // h+=X[i]
- eor w16,w16,w22,ror#11
- eor w12,w12,w7,ror#18
- orr w17,w17,w19 // Ch(e,f,g)
- eor w19,w26,w27 // a^b, b^c in next round
- eor w16,w16,w22,ror#25 // Sigma1(e)
- eor w13,w13,w26,ror#13
- add w25,w25,w17 // h+=Ch(e,f,g)
- and w28,w28,w19 // (b^c)&=(a^b)
- eor w11,w11,w4,ror#19
- eor w12,w12,w7,lsr#3 // sigma0(X[i+1])
- add w25,w25,w16 // h+=Sigma1(e)
- eor w28,w28,w27 // Maj(a,b,c)
- eor w17,w13,w26,ror#22 // Sigma0(a)
- eor w11,w11,w4,lsr#10 // sigma1(X[i+14])
- add w6,w6,w15
- add w21,w21,w25 // d+=h
- add w25,w25,w28 // h+=Maj(a,b,c)
- ldr w28,[x30],#4 // *K++, w19 in next round
- add w6,w6,w12
- add w25,w25,w17 // h+=Sigma0(a)
- add w6,w6,w11
- ldr w11,[sp,#0]
- str w14,[sp,#12]
- ror w16,w21,#6
- add w24,w24,w28 // h+=K[i]
- ror w13,w8,#7
- and w17,w22,w21
- ror w12,w5,#17
- bic w28,w23,w21
- ror w14,w25,#2
- add w24,w24,w6 // h+=X[i]
- eor w16,w16,w21,ror#11
- eor w13,w13,w8,ror#18
- orr w17,w17,w28 // Ch(e,f,g)
- eor w28,w25,w26 // a^b, b^c in next round
- eor w16,w16,w21,ror#25 // Sigma1(e)
- eor w14,w14,w25,ror#13
- add w24,w24,w17 // h+=Ch(e,f,g)
- and w19,w19,w28 // (b^c)&=(a^b)
- eor w12,w12,w5,ror#19
- eor w13,w13,w8,lsr#3 // sigma0(X[i+1])
- add w24,w24,w16 // h+=Sigma1(e)
- eor w19,w19,w26 // Maj(a,b,c)
- eor w17,w14,w25,ror#22 // Sigma0(a)
- eor w12,w12,w5,lsr#10 // sigma1(X[i+14])
- add w7,w7,w0
- add w20,w20,w24 // d+=h
- add w24,w24,w19 // h+=Maj(a,b,c)
- ldr w19,[x30],#4 // *K++, w28 in next round
- add w7,w7,w13
- add w24,w24,w17 // h+=Sigma0(a)
- add w7,w7,w12
- ldr w12,[sp,#4]
- str w15,[sp,#0]
- ror w16,w20,#6
- add w23,w23,w19 // h+=K[i]
- ror w14,w9,#7
- and w17,w21,w20
- ror w13,w6,#17
- bic w19,w22,w20
- ror w15,w24,#2
- add w23,w23,w7 // h+=X[i]
- eor w16,w16,w20,ror#11
- eor w14,w14,w9,ror#18
- orr w17,w17,w19 // Ch(e,f,g)
- eor w19,w24,w25 // a^b, b^c in next round
- eor w16,w16,w20,ror#25 // Sigma1(e)
- eor w15,w15,w24,ror#13
- add w23,w23,w17 // h+=Ch(e,f,g)
- and w28,w28,w19 // (b^c)&=(a^b)
- eor w13,w13,w6,ror#19
- eor w14,w14,w9,lsr#3 // sigma0(X[i+1])
- add w23,w23,w16 // h+=Sigma1(e)
- eor w28,w28,w25 // Maj(a,b,c)
- eor w17,w15,w24,ror#22 // Sigma0(a)
- eor w13,w13,w6,lsr#10 // sigma1(X[i+14])
- add w8,w8,w1
- add w27,w27,w23 // d+=h
- add w23,w23,w28 // h+=Maj(a,b,c)
- ldr w28,[x30],#4 // *K++, w19 in next round
- add w8,w8,w14
- add w23,w23,w17 // h+=Sigma0(a)
- add w8,w8,w13
- ldr w13,[sp,#8]
- str w0,[sp,#4]
- ror w16,w27,#6
- add w22,w22,w28 // h+=K[i]
- ror w15,w10,#7
- and w17,w20,w27
- ror w14,w7,#17
- bic w28,w21,w27
- ror w0,w23,#2
- add w22,w22,w8 // h+=X[i]
- eor w16,w16,w27,ror#11
- eor w15,w15,w10,ror#18
- orr w17,w17,w28 // Ch(e,f,g)
- eor w28,w23,w24 // a^b, b^c in next round
- eor w16,w16,w27,ror#25 // Sigma1(e)
- eor w0,w0,w23,ror#13
- add w22,w22,w17 // h+=Ch(e,f,g)
- and w19,w19,w28 // (b^c)&=(a^b)
- eor w14,w14,w7,ror#19
- eor w15,w15,w10,lsr#3 // sigma0(X[i+1])
- add w22,w22,w16 // h+=Sigma1(e)
- eor w19,w19,w24 // Maj(a,b,c)
- eor w17,w0,w23,ror#22 // Sigma0(a)
- eor w14,w14,w7,lsr#10 // sigma1(X[i+14])
- add w9,w9,w2
- add w26,w26,w22 // d+=h
- add w22,w22,w19 // h+=Maj(a,b,c)
- ldr w19,[x30],#4 // *K++, w28 in next round
- add w9,w9,w15
- add w22,w22,w17 // h+=Sigma0(a)
- add w9,w9,w14
- ldr w14,[sp,#12]
- str w1,[sp,#8]
- ror w16,w26,#6
- add w21,w21,w19 // h+=K[i]
- ror w0,w11,#7
- and w17,w27,w26
- ror w15,w8,#17
- bic w19,w20,w26
- ror w1,w22,#2
- add w21,w21,w9 // h+=X[i]
- eor w16,w16,w26,ror#11
- eor w0,w0,w11,ror#18
- orr w17,w17,w19 // Ch(e,f,g)
- eor w19,w22,w23 // a^b, b^c in next round
- eor w16,w16,w26,ror#25 // Sigma1(e)
- eor w1,w1,w22,ror#13
- add w21,w21,w17 // h+=Ch(e,f,g)
- and w28,w28,w19 // (b^c)&=(a^b)
- eor w15,w15,w8,ror#19
- eor w0,w0,w11,lsr#3 // sigma0(X[i+1])
- add w21,w21,w16 // h+=Sigma1(e)
- eor w28,w28,w23 // Maj(a,b,c)
- eor w17,w1,w22,ror#22 // Sigma0(a)
- eor w15,w15,w8,lsr#10 // sigma1(X[i+14])
- add w10,w10,w3
- add w25,w25,w21 // d+=h
- add w21,w21,w28 // h+=Maj(a,b,c)
- ldr w28,[x30],#4 // *K++, w19 in next round
- add w10,w10,w0
- add w21,w21,w17 // h+=Sigma0(a)
- add w10,w10,w15
- ldr w15,[sp,#0]
- str w2,[sp,#12]
- ror w16,w25,#6
- add w20,w20,w28 // h+=K[i]
- ror w1,w12,#7
- and w17,w26,w25
- ror w0,w9,#17
- bic w28,w27,w25
- ror w2,w21,#2
- add w20,w20,w10 // h+=X[i]
- eor w16,w16,w25,ror#11
- eor w1,w1,w12,ror#18
- orr w17,w17,w28 // Ch(e,f,g)
- eor w28,w21,w22 // a^b, b^c in next round
- eor w16,w16,w25,ror#25 // Sigma1(e)
- eor w2,w2,w21,ror#13
- add w20,w20,w17 // h+=Ch(e,f,g)
- and w19,w19,w28 // (b^c)&=(a^b)
- eor w0,w0,w9,ror#19
- eor w1,w1,w12,lsr#3 // sigma0(X[i+1])
- add w20,w20,w16 // h+=Sigma1(e)
- eor w19,w19,w22 // Maj(a,b,c)
- eor w17,w2,w21,ror#22 // Sigma0(a)
- eor w0,w0,w9,lsr#10 // sigma1(X[i+14])
- add w11,w11,w4
- add w24,w24,w20 // d+=h
- add w20,w20,w19 // h+=Maj(a,b,c)
- ldr w19,[x30],#4 // *K++, w28 in next round
- add w11,w11,w1
- add w20,w20,w17 // h+=Sigma0(a)
- add w11,w11,w0
- ldr w0,[sp,#4]
- str w3,[sp,#0]
- ror w16,w24,#6
- add w27,w27,w19 // h+=K[i]
- ror w2,w13,#7
- and w17,w25,w24
- ror w1,w10,#17
- bic w19,w26,w24
- ror w3,w20,#2
- add w27,w27,w11 // h+=X[i]
- eor w16,w16,w24,ror#11
- eor w2,w2,w13,ror#18
- orr w17,w17,w19 // Ch(e,f,g)
- eor w19,w20,w21 // a^b, b^c in next round
- eor w16,w16,w24,ror#25 // Sigma1(e)
- eor w3,w3,w20,ror#13
- add w27,w27,w17 // h+=Ch(e,f,g)
- and w28,w28,w19 // (b^c)&=(a^b)
- eor w1,w1,w10,ror#19
- eor w2,w2,w13,lsr#3 // sigma0(X[i+1])
- add w27,w27,w16 // h+=Sigma1(e)
- eor w28,w28,w21 // Maj(a,b,c)
- eor w17,w3,w20,ror#22 // Sigma0(a)
- eor w1,w1,w10,lsr#10 // sigma1(X[i+14])
- add w12,w12,w5
- add w23,w23,w27 // d+=h
- add w27,w27,w28 // h+=Maj(a,b,c)
- ldr w28,[x30],#4 // *K++, w19 in next round
- add w12,w12,w2
- add w27,w27,w17 // h+=Sigma0(a)
- add w12,w12,w1
- ldr w1,[sp,#8]
- str w4,[sp,#4]
- ror w16,w23,#6
- add w26,w26,w28 // h+=K[i]
- ror w3,w14,#7
- and w17,w24,w23
- ror w2,w11,#17
- bic w28,w25,w23
- ror w4,w27,#2
- add w26,w26,w12 // h+=X[i]
- eor w16,w16,w23,ror#11
- eor w3,w3,w14,ror#18
- orr w17,w17,w28 // Ch(e,f,g)
- eor w28,w27,w20 // a^b, b^c in next round
- eor w16,w16,w23,ror#25 // Sigma1(e)
- eor w4,w4,w27,ror#13
- add w26,w26,w17 // h+=Ch(e,f,g)
- and w19,w19,w28 // (b^c)&=(a^b)
- eor w2,w2,w11,ror#19
- eor w3,w3,w14,lsr#3 // sigma0(X[i+1])
- add w26,w26,w16 // h+=Sigma1(e)
- eor w19,w19,w20 // Maj(a,b,c)
- eor w17,w4,w27,ror#22 // Sigma0(a)
- eor w2,w2,w11,lsr#10 // sigma1(X[i+14])
- add w13,w13,w6
- add w22,w22,w26 // d+=h
- add w26,w26,w19 // h+=Maj(a,b,c)
- ldr w19,[x30],#4 // *K++, w28 in next round
- add w13,w13,w3
- add w26,w26,w17 // h+=Sigma0(a)
- add w13,w13,w2
- ldr w2,[sp,#12]
- str w5,[sp,#8]
- ror w16,w22,#6
- add w25,w25,w19 // h+=K[i]
- ror w4,w15,#7
- and w17,w23,w22
- ror w3,w12,#17
- bic w19,w24,w22
- ror w5,w26,#2
- add w25,w25,w13 // h+=X[i]
- eor w16,w16,w22,ror#11
- eor w4,w4,w15,ror#18
- orr w17,w17,w19 // Ch(e,f,g)
- eor w19,w26,w27 // a^b, b^c in next round
- eor w16,w16,w22,ror#25 // Sigma1(e)
- eor w5,w5,w26,ror#13
- add w25,w25,w17 // h+=Ch(e,f,g)
- and w28,w28,w19 // (b^c)&=(a^b)
- eor w3,w3,w12,ror#19
- eor w4,w4,w15,lsr#3 // sigma0(X[i+1])
- add w25,w25,w16 // h+=Sigma1(e)
- eor w28,w28,w27 // Maj(a,b,c)
- eor w17,w5,w26,ror#22 // Sigma0(a)
- eor w3,w3,w12,lsr#10 // sigma1(X[i+14])
- add w14,w14,w7
- add w21,w21,w25 // d+=h
- add w25,w25,w28 // h+=Maj(a,b,c)
- ldr w28,[x30],#4 // *K++, w19 in next round
- add w14,w14,w4
- add w25,w25,w17 // h+=Sigma0(a)
- add w14,w14,w3
- ldr w3,[sp,#0]
- str w6,[sp,#12]
- ror w16,w21,#6
- add w24,w24,w28 // h+=K[i]
- ror w5,w0,#7
- and w17,w22,w21
- ror w4,w13,#17
- bic w28,w23,w21
- ror w6,w25,#2
- add w24,w24,w14 // h+=X[i]
- eor w16,w16,w21,ror#11
- eor w5,w5,w0,ror#18
- orr w17,w17,w28 // Ch(e,f,g)
- eor w28,w25,w26 // a^b, b^c in next round
- eor w16,w16,w21,ror#25 // Sigma1(e)
- eor w6,w6,w25,ror#13
- add w24,w24,w17 // h+=Ch(e,f,g)
- and w19,w19,w28 // (b^c)&=(a^b)
- eor w4,w4,w13,ror#19
- eor w5,w5,w0,lsr#3 // sigma0(X[i+1])
- add w24,w24,w16 // h+=Sigma1(e)
- eor w19,w19,w26 // Maj(a,b,c)
- eor w17,w6,w25,ror#22 // Sigma0(a)
- eor w4,w4,w13,lsr#10 // sigma1(X[i+14])
- add w15,w15,w8
- add w20,w20,w24 // d+=h
- add w24,w24,w19 // h+=Maj(a,b,c)
- ldr w19,[x30],#4 // *K++, w28 in next round
- add w15,w15,w5
- add w24,w24,w17 // h+=Sigma0(a)
- add w15,w15,w4
- ldr w4,[sp,#4]
- str w7,[sp,#0]
- ror w16,w20,#6
- add w23,w23,w19 // h+=K[i]
- ror w6,w1,#7
- and w17,w21,w20
- ror w5,w14,#17
- bic w19,w22,w20
- ror w7,w24,#2
- add w23,w23,w15 // h+=X[i]
- eor w16,w16,w20,ror#11
- eor w6,w6,w1,ror#18
- orr w17,w17,w19 // Ch(e,f,g)
- eor w19,w24,w25 // a^b, b^c in next round
- eor w16,w16,w20,ror#25 // Sigma1(e)
- eor w7,w7,w24,ror#13
- add w23,w23,w17 // h+=Ch(e,f,g)
- and w28,w28,w19 // (b^c)&=(a^b)
- eor w5,w5,w14,ror#19
- eor w6,w6,w1,lsr#3 // sigma0(X[i+1])
- add w23,w23,w16 // h+=Sigma1(e)
- eor w28,w28,w25 // Maj(a,b,c)
- eor w17,w7,w24,ror#22 // Sigma0(a)
- eor w5,w5,w14,lsr#10 // sigma1(X[i+14])
- add w0,w0,w9
- add w27,w27,w23 // d+=h
- add w23,w23,w28 // h+=Maj(a,b,c)
- ldr w28,[x30],#4 // *K++, w19 in next round
- add w0,w0,w6
- add w23,w23,w17 // h+=Sigma0(a)
- add w0,w0,w5
- ldr w5,[sp,#8]
- str w8,[sp,#4]
- ror w16,w27,#6
- add w22,w22,w28 // h+=K[i]
- ror w7,w2,#7
- and w17,w20,w27
- ror w6,w15,#17
- bic w28,w21,w27
- ror w8,w23,#2
- add w22,w22,w0 // h+=X[i]
- eor w16,w16,w27,ror#11
- eor w7,w7,w2,ror#18
- orr w17,w17,w28 // Ch(e,f,g)
- eor w28,w23,w24 // a^b, b^c in next round
- eor w16,w16,w27,ror#25 // Sigma1(e)
- eor w8,w8,w23,ror#13
- add w22,w22,w17 // h+=Ch(e,f,g)
- and w19,w19,w28 // (b^c)&=(a^b)
- eor w6,w6,w15,ror#19
- eor w7,w7,w2,lsr#3 // sigma0(X[i+1])
- add w22,w22,w16 // h+=Sigma1(e)
- eor w19,w19,w24 // Maj(a,b,c)
- eor w17,w8,w23,ror#22 // Sigma0(a)
- eor w6,w6,w15,lsr#10 // sigma1(X[i+14])
- add w1,w1,w10
- add w26,w26,w22 // d+=h
- add w22,w22,w19 // h+=Maj(a,b,c)
- ldr w19,[x30],#4 // *K++, w28 in next round
- add w1,w1,w7
- add w22,w22,w17 // h+=Sigma0(a)
- add w1,w1,w6
- ldr w6,[sp,#12]
- str w9,[sp,#8]
- ror w16,w26,#6
- add w21,w21,w19 // h+=K[i]
- ror w8,w3,#7
- and w17,w27,w26
- ror w7,w0,#17
- bic w19,w20,w26
- ror w9,w22,#2
- add w21,w21,w1 // h+=X[i]
- eor w16,w16,w26,ror#11
- eor w8,w8,w3,ror#18
- orr w17,w17,w19 // Ch(e,f,g)
- eor w19,w22,w23 // a^b, b^c in next round
- eor w16,w16,w26,ror#25 // Sigma1(e)
- eor w9,w9,w22,ror#13
- add w21,w21,w17 // h+=Ch(e,f,g)
- and w28,w28,w19 // (b^c)&=(a^b)
- eor w7,w7,w0,ror#19
- eor w8,w8,w3,lsr#3 // sigma0(X[i+1])
- add w21,w21,w16 // h+=Sigma1(e)
- eor w28,w28,w23 // Maj(a,b,c)
- eor w17,w9,w22,ror#22 // Sigma0(a)
- eor w7,w7,w0,lsr#10 // sigma1(X[i+14])
- add w2,w2,w11
- add w25,w25,w21 // d+=h
- add w21,w21,w28 // h+=Maj(a,b,c)
- ldr w28,[x30],#4 // *K++, w19 in next round
- add w2,w2,w8
- add w21,w21,w17 // h+=Sigma0(a)
- add w2,w2,w7
- ldr w7,[sp,#0]
- str w10,[sp,#12]
- ror w16,w25,#6
- add w20,w20,w28 // h+=K[i]
- ror w9,w4,#7
- and w17,w26,w25
- ror w8,w1,#17
- bic w28,w27,w25
- ror w10,w21,#2
- add w20,w20,w2 // h+=X[i]
- eor w16,w16,w25,ror#11
- eor w9,w9,w4,ror#18
- orr w17,w17,w28 // Ch(e,f,g)
- eor w28,w21,w22 // a^b, b^c in next round
- eor w16,w16,w25,ror#25 // Sigma1(e)
- eor w10,w10,w21,ror#13
- add w20,w20,w17 // h+=Ch(e,f,g)
- and w19,w19,w28 // (b^c)&=(a^b)
- eor w8,w8,w1,ror#19
- eor w9,w9,w4,lsr#3 // sigma0(X[i+1])
- add w20,w20,w16 // h+=Sigma1(e)
- eor w19,w19,w22 // Maj(a,b,c)
- eor w17,w10,w21,ror#22 // Sigma0(a)
- eor w8,w8,w1,lsr#10 // sigma1(X[i+14])
- add w3,w3,w12
- add w24,w24,w20 // d+=h
- add w20,w20,w19 // h+=Maj(a,b,c)
- ldr w19,[x30],#4 // *K++, w28 in next round
- add w3,w3,w9
- add w20,w20,w17 // h+=Sigma0(a)
- add w3,w3,w8
- cbnz w19,Loop_16_xx
-
- ldp x0,x2,[x29,#96]
- ldr x1,[x29,#112]
- sub x30,x30,#260 // rewind
-
- ldp w3,w4,[x0]
- ldp w5,w6,[x0,#2*4]
- add x1,x1,#14*4 // advance input pointer
- ldp w7,w8,[x0,#4*4]
- add w20,w20,w3
- ldp w9,w10,[x0,#6*4]
- add w21,w21,w4
- add w22,w22,w5
- add w23,w23,w6
- stp w20,w21,[x0]
- add w24,w24,w7
- add w25,w25,w8
- stp w22,w23,[x0,#2*4]
- add w26,w26,w9
- add w27,w27,w10
- cmp x1,x2
- stp w24,w25,[x0,#4*4]
- stp w26,w27,[x0,#6*4]
- b.ne Loop
-
- ldp x19,x20,[x29,#16]
- add sp,sp,#4*4
- ldp x21,x22,[x29,#32]
- ldp x23,x24,[x29,#48]
- ldp x25,x26,[x29,#64]
- ldp x27,x28,[x29,#80]
- ldp x29,x30,[sp],#128
- AARCH64_VALIDATE_LINK_REGISTER
- ret
-
-
-.section .rodata
-.align 6
-
-LK256:
-.long 0x428a2f98,0x71374491,0xb5c0fbcf,0xe9b5dba5
-.long 0x3956c25b,0x59f111f1,0x923f82a4,0xab1c5ed5
-.long 0xd807aa98,0x12835b01,0x243185be,0x550c7dc3
-.long 0x72be5d74,0x80deb1fe,0x9bdc06a7,0xc19bf174
-.long 0xe49b69c1,0xefbe4786,0x0fc19dc6,0x240ca1cc
-.long 0x2de92c6f,0x4a7484aa,0x5cb0a9dc,0x76f988da
-.long 0x983e5152,0xa831c66d,0xb00327c8,0xbf597fc7
-.long 0xc6e00bf3,0xd5a79147,0x06ca6351,0x14292967
-.long 0x27b70a85,0x2e1b2138,0x4d2c6dfc,0x53380d13
-.long 0x650a7354,0x766a0abb,0x81c2c92e,0x92722c85
-.long 0xa2bfe8a1,0xa81a664b,0xc24b8b70,0xc76c51a3
-.long 0xd192e819,0xd6990624,0xf40e3585,0x106aa070
-.long 0x19a4c116,0x1e376c08,0x2748774c,0x34b0bcb5
-.long 0x391c0cb3,0x4ed8aa4a,0x5b9cca4f,0x682e6ff3
-.long 0x748f82ee,0x78a5636f,0x84c87814,0x8cc70208
-.long 0x90befffa,0xa4506ceb,0xbef9a3f7,0xc67178f2
-.long 0 //terminator
-
-.byte 83,72,65,50,53,54,32,98,108,111,99,107,32,116,114,97,110,115,102,111,114,109,32,102,111,114,32,65,82,77,118,56,44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112,114,111,64,111,112,101,110,115,115,108,46,111,114,103,62,0
-.align 2
-.align 2
-.text
-#ifndef __KERNEL__
-.globl sha256_block_data_order_hw
-
-.def sha256_block_data_order_hw
- .type 32
-.endef
-.align 6
-sha256_block_data_order_hw:
- // Armv8.3-A PAuth: even though x30 is pushed to stack it is not popped later.
- AARCH64_VALID_CALL_TARGET
- stp x29,x30,[sp,#-16]!
- add x29,sp,#0
-
- ld1 {v0.4s,v1.4s},[x0]
- adrp x3,LK256
- add x3,x3,:lo12:LK256
-
-Loop_hw:
- ld1 {v4.16b,v5.16b,v6.16b,v7.16b},[x1],#64
- sub x2,x2,#1
- ld1 {v16.4s},[x3],#16
- rev32 v4.16b,v4.16b
- rev32 v5.16b,v5.16b
- rev32 v6.16b,v6.16b
- rev32 v7.16b,v7.16b
- orr v18.16b,v0.16b,v0.16b // offload
- orr v19.16b,v1.16b,v1.16b
- ld1 {v17.4s},[x3],#16
- add v16.4s,v16.4s,v4.4s
-.long 0x5e2828a4 //sha256su0 v4.16b,v5.16b
- orr v2.16b,v0.16b,v0.16b
-.long 0x5e104020 //sha256h v0.16b,v1.16b,v16.4s
-.long 0x5e105041 //sha256h2 v1.16b,v2.16b,v16.4s
-.long 0x5e0760c4 //sha256su1 v4.16b,v6.16b,v7.16b
- ld1 {v16.4s},[x3],#16
- add v17.4s,v17.4s,v5.4s
-.long 0x5e2828c5 //sha256su0 v5.16b,v6.16b
- orr v2.16b,v0.16b,v0.16b
-.long 0x5e114020 //sha256h v0.16b,v1.16b,v17.4s
-.long 0x5e115041 //sha256h2 v1.16b,v2.16b,v17.4s
-.long 0x5e0460e5 //sha256su1 v5.16b,v7.16b,v4.16b
- ld1 {v17.4s},[x3],#16
- add v16.4s,v16.4s,v6.4s
-.long 0x5e2828e6 //sha256su0 v6.16b,v7.16b
- orr v2.16b,v0.16b,v0.16b
-.long 0x5e104020 //sha256h v0.16b,v1.16b,v16.4s
-.long 0x5e105041 //sha256h2 v1.16b,v2.16b,v16.4s
-.long 0x5e056086 //sha256su1 v6.16b,v4.16b,v5.16b
- ld1 {v16.4s},[x3],#16
- add v17.4s,v17.4s,v7.4s
-.long 0x5e282887 //sha256su0 v7.16b,v4.16b
- orr v2.16b,v0.16b,v0.16b
-.long 0x5e114020 //sha256h v0.16b,v1.16b,v17.4s
-.long 0x5e115041 //sha256h2 v1.16b,v2.16b,v17.4s
-.long 0x5e0660a7 //sha256su1 v7.16b,v5.16b,v6.16b
- ld1 {v17.4s},[x3],#16
- add v16.4s,v16.4s,v4.4s
-.long 0x5e2828a4 //sha256su0 v4.16b,v5.16b
- orr v2.16b,v0.16b,v0.16b
-.long 0x5e104020 //sha256h v0.16b,v1.16b,v16.4s
-.long 0x5e105041 //sha256h2 v1.16b,v2.16b,v16.4s
-.long 0x5e0760c4 //sha256su1 v4.16b,v6.16b,v7.16b
- ld1 {v16.4s},[x3],#16
- add v17.4s,v17.4s,v5.4s
-.long 0x5e2828c5 //sha256su0 v5.16b,v6.16b
- orr v2.16b,v0.16b,v0.16b
-.long 0x5e114020 //sha256h v0.16b,v1.16b,v17.4s
-.long 0x5e115041 //sha256h2 v1.16b,v2.16b,v17.4s
-.long 0x5e0460e5 //sha256su1 v5.16b,v7.16b,v4.16b
- ld1 {v17.4s},[x3],#16
- add v16.4s,v16.4s,v6.4s
-.long 0x5e2828e6 //sha256su0 v6.16b,v7.16b
- orr v2.16b,v0.16b,v0.16b
-.long 0x5e104020 //sha256h v0.16b,v1.16b,v16.4s
-.long 0x5e105041 //sha256h2 v1.16b,v2.16b,v16.4s
-.long 0x5e056086 //sha256su1 v6.16b,v4.16b,v5.16b
- ld1 {v16.4s},[x3],#16
- add v17.4s,v17.4s,v7.4s
-.long 0x5e282887 //sha256su0 v7.16b,v4.16b
- orr v2.16b,v0.16b,v0.16b
-.long 0x5e114020 //sha256h v0.16b,v1.16b,v17.4s
-.long 0x5e115041 //sha256h2 v1.16b,v2.16b,v17.4s
-.long 0x5e0660a7 //sha256su1 v7.16b,v5.16b,v6.16b
- ld1 {v17.4s},[x3],#16
- add v16.4s,v16.4s,v4.4s
-.long 0x5e2828a4 //sha256su0 v4.16b,v5.16b
- orr v2.16b,v0.16b,v0.16b
-.long 0x5e104020 //sha256h v0.16b,v1.16b,v16.4s
-.long 0x5e105041 //sha256h2 v1.16b,v2.16b,v16.4s
-.long 0x5e0760c4 //sha256su1 v4.16b,v6.16b,v7.16b
- ld1 {v16.4s},[x3],#16
- add v17.4s,v17.4s,v5.4s
-.long 0x5e2828c5 //sha256su0 v5.16b,v6.16b
- orr v2.16b,v0.16b,v0.16b
-.long 0x5e114020 //sha256h v0.16b,v1.16b,v17.4s
-.long 0x5e115041 //sha256h2 v1.16b,v2.16b,v17.4s
-.long 0x5e0460e5 //sha256su1 v5.16b,v7.16b,v4.16b
- ld1 {v17.4s},[x3],#16
- add v16.4s,v16.4s,v6.4s
-.long 0x5e2828e6 //sha256su0 v6.16b,v7.16b
- orr v2.16b,v0.16b,v0.16b
-.long 0x5e104020 //sha256h v0.16b,v1.16b,v16.4s
-.long 0x5e105041 //sha256h2 v1.16b,v2.16b,v16.4s
-.long 0x5e056086 //sha256su1 v6.16b,v4.16b,v5.16b
- ld1 {v16.4s},[x3],#16
- add v17.4s,v17.4s,v7.4s
-.long 0x5e282887 //sha256su0 v7.16b,v4.16b
- orr v2.16b,v0.16b,v0.16b
-.long 0x5e114020 //sha256h v0.16b,v1.16b,v17.4s
-.long 0x5e115041 //sha256h2 v1.16b,v2.16b,v17.4s
-.long 0x5e0660a7 //sha256su1 v7.16b,v5.16b,v6.16b
- ld1 {v17.4s},[x3],#16
- add v16.4s,v16.4s,v4.4s
- orr v2.16b,v0.16b,v0.16b
-.long 0x5e104020 //sha256h v0.16b,v1.16b,v16.4s
-.long 0x5e105041 //sha256h2 v1.16b,v2.16b,v16.4s
-
- ld1 {v16.4s},[x3],#16
- add v17.4s,v17.4s,v5.4s
- orr v2.16b,v0.16b,v0.16b
-.long 0x5e114020 //sha256h v0.16b,v1.16b,v17.4s
-.long 0x5e115041 //sha256h2 v1.16b,v2.16b,v17.4s
-
- ld1 {v17.4s},[x3]
- add v16.4s,v16.4s,v6.4s
- sub x3,x3,#64*4-16 // rewind
- orr v2.16b,v0.16b,v0.16b
-.long 0x5e104020 //sha256h v0.16b,v1.16b,v16.4s
-.long 0x5e105041 //sha256h2 v1.16b,v2.16b,v16.4s
-
- add v17.4s,v17.4s,v7.4s
- orr v2.16b,v0.16b,v0.16b
-.long 0x5e114020 //sha256h v0.16b,v1.16b,v17.4s
-.long 0x5e115041 //sha256h2 v1.16b,v2.16b,v17.4s
-
- add v0.4s,v0.4s,v18.4s
- add v1.4s,v1.4s,v19.4s
-
- cbnz x2,Loop_hw
-
- st1 {v0.4s,v1.4s},[x0]
-
- ldr x29,[sp],#16
- ret
-
-#endif
-#endif // !OPENSSL_NO_ASM && defined(OPENSSL_AARCH64) && defined(_WIN32)
diff --git a/win-aarch64/crypto/fipsmodule/sha512-armv8-win.S b/win-aarch64/crypto/fipsmodule/sha512-armv8-win.S
deleted file mode 100644
index 220f4891..00000000
--- a/win-aarch64/crypto/fipsmodule/sha512-armv8-win.S
+++ /dev/null
@@ -1,1600 +0,0 @@
-// This file is generated from a similarly-named Perl script in the BoringSSL
-// source tree. Do not edit by hand.
-
-#include <openssl/asm_base.h>
-
-#if !defined(OPENSSL_NO_ASM) && defined(OPENSSL_AARCH64) && defined(_WIN32)
-// Copyright 2014-2020 The OpenSSL Project Authors. All Rights Reserved.
-//
-// Licensed under the OpenSSL license (the "License"). You may not use
-// this file except in compliance with the License. You can obtain a copy
-// in the file LICENSE in the source distribution or at
-// https://www.openssl.org/source/license.html
-
-// ====================================================================
-// Written by Andy Polyakov <appro@openssl.org> for the OpenSSL
-// project. The module is, however, dual licensed under OpenSSL and
-// CRYPTOGAMS licenses depending on where you obtain it. For further
-// details see http://www.openssl.org/~appro/cryptogams/.
-//
-// Permission to use under GPLv2 terms is granted.
-// ====================================================================
-//
-// SHA256/512 for ARMv8.
-//
-// Performance in cycles per processed byte and improvement coefficient
-// over code generated with "default" compiler:
-//
-// SHA256-hw SHA256(*) SHA512
-// Apple A7 1.97 10.5 (+33%) 6.73 (-1%(**))
-// Cortex-A53 2.38 15.5 (+115%) 10.0 (+150%(***))
-// Cortex-A57 2.31 11.6 (+86%) 7.51 (+260%(***))
-// Denver 2.01 10.5 (+26%) 6.70 (+8%)
-// X-Gene 20.0 (+100%) 12.8 (+300%(***))
-// Mongoose 2.36 13.0 (+50%) 8.36 (+33%)
-// Kryo 1.92 17.4 (+30%) 11.2 (+8%)
-//
-// (*) Software SHA256 results are of lesser relevance, presented
-// mostly for informational purposes.
-// (**) The result is a trade-off: it's possible to improve it by
-// 10% (or by 1 cycle per round), but at the cost of 20% loss
-// on Cortex-A53 (or by 4 cycles per round).
-// (***) Super-impressive coefficients over gcc-generated code are
-// indication of some compiler "pathology", most notably code
-// generated with -mgeneral-regs-only is significantly faster
-// and the gap is only 40-90%.
-
-#ifndef __KERNEL__
-# include <openssl/arm_arch.h>
-#endif
-
-.text
-
-.globl sha512_block_data_order_nohw
-
-.def sha512_block_data_order_nohw
- .type 32
-.endef
-.align 6
-sha512_block_data_order_nohw:
- AARCH64_SIGN_LINK_REGISTER
- stp x29,x30,[sp,#-128]!
- add x29,sp,#0
-
- stp x19,x20,[sp,#16]
- stp x21,x22,[sp,#32]
- stp x23,x24,[sp,#48]
- stp x25,x26,[sp,#64]
- stp x27,x28,[sp,#80]
- sub sp,sp,#4*8
-
- ldp x20,x21,[x0] // load context
- ldp x22,x23,[x0,#2*8]
- ldp x24,x25,[x0,#4*8]
- add x2,x1,x2,lsl#7 // end of input
- ldp x26,x27,[x0,#6*8]
- adrp x30,LK512
- add x30,x30,:lo12:LK512
- stp x0,x2,[x29,#96]
-
-Loop:
- ldp x3,x4,[x1],#2*8
- ldr x19,[x30],#8 // *K++
- eor x28,x21,x22 // magic seed
- str x1,[x29,#112]
-#ifndef __AARCH64EB__
- rev x3,x3 // 0
-#endif
- ror x16,x24,#14
- add x27,x27,x19 // h+=K[i]
- eor x6,x24,x24,ror#23
- and x17,x25,x24
- bic x19,x26,x24
- add x27,x27,x3 // h+=X[i]
- orr x17,x17,x19 // Ch(e,f,g)
- eor x19,x20,x21 // a^b, b^c in next round
- eor x16,x16,x6,ror#18 // Sigma1(e)
- ror x6,x20,#28
- add x27,x27,x17 // h+=Ch(e,f,g)
- eor x17,x20,x20,ror#5
- add x27,x27,x16 // h+=Sigma1(e)
- and x28,x28,x19 // (b^c)&=(a^b)
- add x23,x23,x27 // d+=h
- eor x28,x28,x21 // Maj(a,b,c)
- eor x17,x6,x17,ror#34 // Sigma0(a)
- add x27,x27,x28 // h+=Maj(a,b,c)
- ldr x28,[x30],#8 // *K++, x19 in next round
- //add x27,x27,x17 // h+=Sigma0(a)
-#ifndef __AARCH64EB__
- rev x4,x4 // 1
-#endif
- ldp x5,x6,[x1],#2*8
- add x27,x27,x17 // h+=Sigma0(a)
- ror x16,x23,#14
- add x26,x26,x28 // h+=K[i]
- eor x7,x23,x23,ror#23
- and x17,x24,x23
- bic x28,x25,x23
- add x26,x26,x4 // h+=X[i]
- orr x17,x17,x28 // Ch(e,f,g)
- eor x28,x27,x20 // a^b, b^c in next round
- eor x16,x16,x7,ror#18 // Sigma1(e)
- ror x7,x27,#28
- add x26,x26,x17 // h+=Ch(e,f,g)
- eor x17,x27,x27,ror#5
- add x26,x26,x16 // h+=Sigma1(e)
- and x19,x19,x28 // (b^c)&=(a^b)
- add x22,x22,x26 // d+=h
- eor x19,x19,x20 // Maj(a,b,c)
- eor x17,x7,x17,ror#34 // Sigma0(a)
- add x26,x26,x19 // h+=Maj(a,b,c)
- ldr x19,[x30],#8 // *K++, x28 in next round
- //add x26,x26,x17 // h+=Sigma0(a)
-#ifndef __AARCH64EB__
- rev x5,x5 // 2
-#endif
- add x26,x26,x17 // h+=Sigma0(a)
- ror x16,x22,#14
- add x25,x25,x19 // h+=K[i]
- eor x8,x22,x22,ror#23
- and x17,x23,x22
- bic x19,x24,x22
- add x25,x25,x5 // h+=X[i]
- orr x17,x17,x19 // Ch(e,f,g)
- eor x19,x26,x27 // a^b, b^c in next round
- eor x16,x16,x8,ror#18 // Sigma1(e)
- ror x8,x26,#28
- add x25,x25,x17 // h+=Ch(e,f,g)
- eor x17,x26,x26,ror#5
- add x25,x25,x16 // h+=Sigma1(e)
- and x28,x28,x19 // (b^c)&=(a^b)
- add x21,x21,x25 // d+=h
- eor x28,x28,x27 // Maj(a,b,c)
- eor x17,x8,x17,ror#34 // Sigma0(a)
- add x25,x25,x28 // h+=Maj(a,b,c)
- ldr x28,[x30],#8 // *K++, x19 in next round
- //add x25,x25,x17 // h+=Sigma0(a)
-#ifndef __AARCH64EB__
- rev x6,x6 // 3
-#endif
- ldp x7,x8,[x1],#2*8
- add x25,x25,x17 // h+=Sigma0(a)
- ror x16,x21,#14
- add x24,x24,x28 // h+=K[i]
- eor x9,x21,x21,ror#23
- and x17,x22,x21
- bic x28,x23,x21
- add x24,x24,x6 // h+=X[i]
- orr x17,x17,x28 // Ch(e,f,g)
- eor x28,x25,x26 // a^b, b^c in next round
- eor x16,x16,x9,ror#18 // Sigma1(e)
- ror x9,x25,#28
- add x24,x24,x17 // h+=Ch(e,f,g)
- eor x17,x25,x25,ror#5
- add x24,x24,x16 // h+=Sigma1(e)
- and x19,x19,x28 // (b^c)&=(a^b)
- add x20,x20,x24 // d+=h
- eor x19,x19,x26 // Maj(a,b,c)
- eor x17,x9,x17,ror#34 // Sigma0(a)
- add x24,x24,x19 // h+=Maj(a,b,c)
- ldr x19,[x30],#8 // *K++, x28 in next round
- //add x24,x24,x17 // h+=Sigma0(a)
-#ifndef __AARCH64EB__
- rev x7,x7 // 4
-#endif
- add x24,x24,x17 // h+=Sigma0(a)
- ror x16,x20,#14
- add x23,x23,x19 // h+=K[i]
- eor x10,x20,x20,ror#23
- and x17,x21,x20
- bic x19,x22,x20
- add x23,x23,x7 // h+=X[i]
- orr x17,x17,x19 // Ch(e,f,g)
- eor x19,x24,x25 // a^b, b^c in next round
- eor x16,x16,x10,ror#18 // Sigma1(e)
- ror x10,x24,#28
- add x23,x23,x17 // h+=Ch(e,f,g)
- eor x17,x24,x24,ror#5
- add x23,x23,x16 // h+=Sigma1(e)
- and x28,x28,x19 // (b^c)&=(a^b)
- add x27,x27,x23 // d+=h
- eor x28,x28,x25 // Maj(a,b,c)
- eor x17,x10,x17,ror#34 // Sigma0(a)
- add x23,x23,x28 // h+=Maj(a,b,c)
- ldr x28,[x30],#8 // *K++, x19 in next round
- //add x23,x23,x17 // h+=Sigma0(a)
-#ifndef __AARCH64EB__
- rev x8,x8 // 5
-#endif
- ldp x9,x10,[x1],#2*8
- add x23,x23,x17 // h+=Sigma0(a)
- ror x16,x27,#14
- add x22,x22,x28 // h+=K[i]
- eor x11,x27,x27,ror#23
- and x17,x20,x27
- bic x28,x21,x27
- add x22,x22,x8 // h+=X[i]
- orr x17,x17,x28 // Ch(e,f,g)
- eor x28,x23,x24 // a^b, b^c in next round
- eor x16,x16,x11,ror#18 // Sigma1(e)
- ror x11,x23,#28
- add x22,x22,x17 // h+=Ch(e,f,g)
- eor x17,x23,x23,ror#5
- add x22,x22,x16 // h+=Sigma1(e)
- and x19,x19,x28 // (b^c)&=(a^b)
- add x26,x26,x22 // d+=h
- eor x19,x19,x24 // Maj(a,b,c)
- eor x17,x11,x17,ror#34 // Sigma0(a)
- add x22,x22,x19 // h+=Maj(a,b,c)
- ldr x19,[x30],#8 // *K++, x28 in next round
- //add x22,x22,x17 // h+=Sigma0(a)
-#ifndef __AARCH64EB__
- rev x9,x9 // 6
-#endif
- add x22,x22,x17 // h+=Sigma0(a)
- ror x16,x26,#14
- add x21,x21,x19 // h+=K[i]
- eor x12,x26,x26,ror#23
- and x17,x27,x26
- bic x19,x20,x26
- add x21,x21,x9 // h+=X[i]
- orr x17,x17,x19 // Ch(e,f,g)
- eor x19,x22,x23 // a^b, b^c in next round
- eor x16,x16,x12,ror#18 // Sigma1(e)
- ror x12,x22,#28
- add x21,x21,x17 // h+=Ch(e,f,g)
- eor x17,x22,x22,ror#5
- add x21,x21,x16 // h+=Sigma1(e)
- and x28,x28,x19 // (b^c)&=(a^b)
- add x25,x25,x21 // d+=h
- eor x28,x28,x23 // Maj(a,b,c)
- eor x17,x12,x17,ror#34 // Sigma0(a)
- add x21,x21,x28 // h+=Maj(a,b,c)
- ldr x28,[x30],#8 // *K++, x19 in next round
- //add x21,x21,x17 // h+=Sigma0(a)
-#ifndef __AARCH64EB__
- rev x10,x10 // 7
-#endif
- ldp x11,x12,[x1],#2*8
- add x21,x21,x17 // h+=Sigma0(a)
- ror x16,x25,#14
- add x20,x20,x28 // h+=K[i]
- eor x13,x25,x25,ror#23
- and x17,x26,x25
- bic x28,x27,x25
- add x20,x20,x10 // h+=X[i]
- orr x17,x17,x28 // Ch(e,f,g)
- eor x28,x21,x22 // a^b, b^c in next round
- eor x16,x16,x13,ror#18 // Sigma1(e)
- ror x13,x21,#28
- add x20,x20,x17 // h+=Ch(e,f,g)
- eor x17,x21,x21,ror#5
- add x20,x20,x16 // h+=Sigma1(e)
- and x19,x19,x28 // (b^c)&=(a^b)
- add x24,x24,x20 // d+=h
- eor x19,x19,x22 // Maj(a,b,c)
- eor x17,x13,x17,ror#34 // Sigma0(a)
- add x20,x20,x19 // h+=Maj(a,b,c)
- ldr x19,[x30],#8 // *K++, x28 in next round
- //add x20,x20,x17 // h+=Sigma0(a)
-#ifndef __AARCH64EB__
- rev x11,x11 // 8
-#endif
- add x20,x20,x17 // h+=Sigma0(a)
- ror x16,x24,#14
- add x27,x27,x19 // h+=K[i]
- eor x14,x24,x24,ror#23
- and x17,x25,x24
- bic x19,x26,x24
- add x27,x27,x11 // h+=X[i]
- orr x17,x17,x19 // Ch(e,f,g)
- eor x19,x20,x21 // a^b, b^c in next round
- eor x16,x16,x14,ror#18 // Sigma1(e)
- ror x14,x20,#28
- add x27,x27,x17 // h+=Ch(e,f,g)
- eor x17,x20,x20,ror#5
- add x27,x27,x16 // h+=Sigma1(e)
- and x28,x28,x19 // (b^c)&=(a^b)
- add x23,x23,x27 // d+=h
- eor x28,x28,x21 // Maj(a,b,c)
- eor x17,x14,x17,ror#34 // Sigma0(a)
- add x27,x27,x28 // h+=Maj(a,b,c)
- ldr x28,[x30],#8 // *K++, x19 in next round
- //add x27,x27,x17 // h+=Sigma0(a)
-#ifndef __AARCH64EB__
- rev x12,x12 // 9
-#endif
- ldp x13,x14,[x1],#2*8
- add x27,x27,x17 // h+=Sigma0(a)
- ror x16,x23,#14
- add x26,x26,x28 // h+=K[i]
- eor x15,x23,x23,ror#23
- and x17,x24,x23
- bic x28,x25,x23
- add x26,x26,x12 // h+=X[i]
- orr x17,x17,x28 // Ch(e,f,g)
- eor x28,x27,x20 // a^b, b^c in next round
- eor x16,x16,x15,ror#18 // Sigma1(e)
- ror x15,x27,#28
- add x26,x26,x17 // h+=Ch(e,f,g)
- eor x17,x27,x27,ror#5
- add x26,x26,x16 // h+=Sigma1(e)
- and x19,x19,x28 // (b^c)&=(a^b)
- add x22,x22,x26 // d+=h
- eor x19,x19,x20 // Maj(a,b,c)
- eor x17,x15,x17,ror#34 // Sigma0(a)
- add x26,x26,x19 // h+=Maj(a,b,c)
- ldr x19,[x30],#8 // *K++, x28 in next round
- //add x26,x26,x17 // h+=Sigma0(a)
-#ifndef __AARCH64EB__
- rev x13,x13 // 10
-#endif
- add x26,x26,x17 // h+=Sigma0(a)
- ror x16,x22,#14
- add x25,x25,x19 // h+=K[i]
- eor x0,x22,x22,ror#23
- and x17,x23,x22
- bic x19,x24,x22
- add x25,x25,x13 // h+=X[i]
- orr x17,x17,x19 // Ch(e,f,g)
- eor x19,x26,x27 // a^b, b^c in next round
- eor x16,x16,x0,ror#18 // Sigma1(e)
- ror x0,x26,#28
- add x25,x25,x17 // h+=Ch(e,f,g)
- eor x17,x26,x26,ror#5
- add x25,x25,x16 // h+=Sigma1(e)
- and x28,x28,x19 // (b^c)&=(a^b)
- add x21,x21,x25 // d+=h
- eor x28,x28,x27 // Maj(a,b,c)
- eor x17,x0,x17,ror#34 // Sigma0(a)
- add x25,x25,x28 // h+=Maj(a,b,c)
- ldr x28,[x30],#8 // *K++, x19 in next round
- //add x25,x25,x17 // h+=Sigma0(a)
-#ifndef __AARCH64EB__
- rev x14,x14 // 11
-#endif
- ldp x15,x0,[x1],#2*8
- add x25,x25,x17 // h+=Sigma0(a)
- str x6,[sp,#24]
- ror x16,x21,#14
- add x24,x24,x28 // h+=K[i]
- eor x6,x21,x21,ror#23
- and x17,x22,x21
- bic x28,x23,x21
- add x24,x24,x14 // h+=X[i]
- orr x17,x17,x28 // Ch(e,f,g)
- eor x28,x25,x26 // a^b, b^c in next round
- eor x16,x16,x6,ror#18 // Sigma1(e)
- ror x6,x25,#28
- add x24,x24,x17 // h+=Ch(e,f,g)
- eor x17,x25,x25,ror#5
- add x24,x24,x16 // h+=Sigma1(e)
- and x19,x19,x28 // (b^c)&=(a^b)
- add x20,x20,x24 // d+=h
- eor x19,x19,x26 // Maj(a,b,c)
- eor x17,x6,x17,ror#34 // Sigma0(a)
- add x24,x24,x19 // h+=Maj(a,b,c)
- ldr x19,[x30],#8 // *K++, x28 in next round
- //add x24,x24,x17 // h+=Sigma0(a)
-#ifndef __AARCH64EB__
- rev x15,x15 // 12
-#endif
- add x24,x24,x17 // h+=Sigma0(a)
- str x7,[sp,#0]
- ror x16,x20,#14
- add x23,x23,x19 // h+=K[i]
- eor x7,x20,x20,ror#23
- and x17,x21,x20
- bic x19,x22,x20
- add x23,x23,x15 // h+=X[i]
- orr x17,x17,x19 // Ch(e,f,g)
- eor x19,x24,x25 // a^b, b^c in next round
- eor x16,x16,x7,ror#18 // Sigma1(e)
- ror x7,x24,#28
- add x23,x23,x17 // h+=Ch(e,f,g)
- eor x17,x24,x24,ror#5
- add x23,x23,x16 // h+=Sigma1(e)
- and x28,x28,x19 // (b^c)&=(a^b)
- add x27,x27,x23 // d+=h
- eor x28,x28,x25 // Maj(a,b,c)
- eor x17,x7,x17,ror#34 // Sigma0(a)
- add x23,x23,x28 // h+=Maj(a,b,c)
- ldr x28,[x30],#8 // *K++, x19 in next round
- //add x23,x23,x17 // h+=Sigma0(a)
-#ifndef __AARCH64EB__
- rev x0,x0 // 13
-#endif
- ldp x1,x2,[x1]
- add x23,x23,x17 // h+=Sigma0(a)
- str x8,[sp,#8]
- ror x16,x27,#14
- add x22,x22,x28 // h+=K[i]
- eor x8,x27,x27,ror#23
- and x17,x20,x27
- bic x28,x21,x27
- add x22,x22,x0 // h+=X[i]
- orr x17,x17,x28 // Ch(e,f,g)
- eor x28,x23,x24 // a^b, b^c in next round
- eor x16,x16,x8,ror#18 // Sigma1(e)
- ror x8,x23,#28
- add x22,x22,x17 // h+=Ch(e,f,g)
- eor x17,x23,x23,ror#5
- add x22,x22,x16 // h+=Sigma1(e)
- and x19,x19,x28 // (b^c)&=(a^b)
- add x26,x26,x22 // d+=h
- eor x19,x19,x24 // Maj(a,b,c)
- eor x17,x8,x17,ror#34 // Sigma0(a)
- add x22,x22,x19 // h+=Maj(a,b,c)
- ldr x19,[x30],#8 // *K++, x28 in next round
- //add x22,x22,x17 // h+=Sigma0(a)
-#ifndef __AARCH64EB__
- rev x1,x1 // 14
-#endif
- ldr x6,[sp,#24]
- add x22,x22,x17 // h+=Sigma0(a)
- str x9,[sp,#16]
- ror x16,x26,#14
- add x21,x21,x19 // h+=K[i]
- eor x9,x26,x26,ror#23
- and x17,x27,x26
- bic x19,x20,x26
- add x21,x21,x1 // h+=X[i]
- orr x17,x17,x19 // Ch(e,f,g)
- eor x19,x22,x23 // a^b, b^c in next round
- eor x16,x16,x9,ror#18 // Sigma1(e)
- ror x9,x22,#28
- add x21,x21,x17 // h+=Ch(e,f,g)
- eor x17,x22,x22,ror#5
- add x21,x21,x16 // h+=Sigma1(e)
- and x28,x28,x19 // (b^c)&=(a^b)
- add x25,x25,x21 // d+=h
- eor x28,x28,x23 // Maj(a,b,c)
- eor x17,x9,x17,ror#34 // Sigma0(a)
- add x21,x21,x28 // h+=Maj(a,b,c)
- ldr x28,[x30],#8 // *K++, x19 in next round
- //add x21,x21,x17 // h+=Sigma0(a)
-#ifndef __AARCH64EB__
- rev x2,x2 // 15
-#endif
- ldr x7,[sp,#0]
- add x21,x21,x17 // h+=Sigma0(a)
- str x10,[sp,#24]
- ror x16,x25,#14
- add x20,x20,x28 // h+=K[i]
- ror x9,x4,#1
- and x17,x26,x25
- ror x8,x1,#19
- bic x28,x27,x25
- ror x10,x21,#28
- add x20,x20,x2 // h+=X[i]
- eor x16,x16,x25,ror#18
- eor x9,x9,x4,ror#8
- orr x17,x17,x28 // Ch(e,f,g)
- eor x28,x21,x22 // a^b, b^c in next round
- eor x16,x16,x25,ror#41 // Sigma1(e)
- eor x10,x10,x21,ror#34
- add x20,x20,x17 // h+=Ch(e,f,g)
- and x19,x19,x28 // (b^c)&=(a^b)
- eor x8,x8,x1,ror#61
- eor x9,x9,x4,lsr#7 // sigma0(X[i+1])
- add x20,x20,x16 // h+=Sigma1(e)
- eor x19,x19,x22 // Maj(a,b,c)
- eor x17,x10,x21,ror#39 // Sigma0(a)
- eor x8,x8,x1,lsr#6 // sigma1(X[i+14])
- add x3,x3,x12
- add x24,x24,x20 // d+=h
- add x20,x20,x19 // h+=Maj(a,b,c)
- ldr x19,[x30],#8 // *K++, x28 in next round
- add x3,x3,x9
- add x20,x20,x17 // h+=Sigma0(a)
- add x3,x3,x8
-Loop_16_xx:
- ldr x8,[sp,#8]
- str x11,[sp,#0]
- ror x16,x24,#14
- add x27,x27,x19 // h+=K[i]
- ror x10,x5,#1
- and x17,x25,x24
- ror x9,x2,#19
- bic x19,x26,x24
- ror x11,x20,#28
- add x27,x27,x3 // h+=X[i]
- eor x16,x16,x24,ror#18
- eor x10,x10,x5,ror#8
- orr x17,x17,x19 // Ch(e,f,g)
- eor x19,x20,x21 // a^b, b^c in next round
- eor x16,x16,x24,ror#41 // Sigma1(e)
- eor x11,x11,x20,ror#34
- add x27,x27,x17 // h+=Ch(e,f,g)
- and x28,x28,x19 // (b^c)&=(a^b)
- eor x9,x9,x2,ror#61
- eor x10,x10,x5,lsr#7 // sigma0(X[i+1])
- add x27,x27,x16 // h+=Sigma1(e)
- eor x28,x28,x21 // Maj(a,b,c)
- eor x17,x11,x20,ror#39 // Sigma0(a)
- eor x9,x9,x2,lsr#6 // sigma1(X[i+14])
- add x4,x4,x13
- add x23,x23,x27 // d+=h
- add x27,x27,x28 // h+=Maj(a,b,c)
- ldr x28,[x30],#8 // *K++, x19 in next round
- add x4,x4,x10
- add x27,x27,x17 // h+=Sigma0(a)
- add x4,x4,x9
- ldr x9,[sp,#16]
- str x12,[sp,#8]
- ror x16,x23,#14
- add x26,x26,x28 // h+=K[i]
- ror x11,x6,#1
- and x17,x24,x23
- ror x10,x3,#19
- bic x28,x25,x23
- ror x12,x27,#28
- add x26,x26,x4 // h+=X[i]
- eor x16,x16,x23,ror#18
- eor x11,x11,x6,ror#8
- orr x17,x17,x28 // Ch(e,f,g)
- eor x28,x27,x20 // a^b, b^c in next round
- eor x16,x16,x23,ror#41 // Sigma1(e)
- eor x12,x12,x27,ror#34
- add x26,x26,x17 // h+=Ch(e,f,g)
- and x19,x19,x28 // (b^c)&=(a^b)
- eor x10,x10,x3,ror#61
- eor x11,x11,x6,lsr#7 // sigma0(X[i+1])
- add x26,x26,x16 // h+=Sigma1(e)
- eor x19,x19,x20 // Maj(a,b,c)
- eor x17,x12,x27,ror#39 // Sigma0(a)
- eor x10,x10,x3,lsr#6 // sigma1(X[i+14])
- add x5,x5,x14
- add x22,x22,x26 // d+=h
- add x26,x26,x19 // h+=Maj(a,b,c)
- ldr x19,[x30],#8 // *K++, x28 in next round
- add x5,x5,x11
- add x26,x26,x17 // h+=Sigma0(a)
- add x5,x5,x10
- ldr x10,[sp,#24]
- str x13,[sp,#16]
- ror x16,x22,#14
- add x25,x25,x19 // h+=K[i]
- ror x12,x7,#1
- and x17,x23,x22
- ror x11,x4,#19
- bic x19,x24,x22
- ror x13,x26,#28
- add x25,x25,x5 // h+=X[i]
- eor x16,x16,x22,ror#18
- eor x12,x12,x7,ror#8
- orr x17,x17,x19 // Ch(e,f,g)
- eor x19,x26,x27 // a^b, b^c in next round
- eor x16,x16,x22,ror#41 // Sigma1(e)
- eor x13,x13,x26,ror#34
- add x25,x25,x17 // h+=Ch(e,f,g)
- and x28,x28,x19 // (b^c)&=(a^b)
- eor x11,x11,x4,ror#61
- eor x12,x12,x7,lsr#7 // sigma0(X[i+1])
- add x25,x25,x16 // h+=Sigma1(e)
- eor x28,x28,x27 // Maj(a,b,c)
- eor x17,x13,x26,ror#39 // Sigma0(a)
- eor x11,x11,x4,lsr#6 // sigma1(X[i+14])
- add x6,x6,x15
- add x21,x21,x25 // d+=h
- add x25,x25,x28 // h+=Maj(a,b,c)
- ldr x28,[x30],#8 // *K++, x19 in next round
- add x6,x6,x12
- add x25,x25,x17 // h+=Sigma0(a)
- add x6,x6,x11
- ldr x11,[sp,#0]
- str x14,[sp,#24]
- ror x16,x21,#14
- add x24,x24,x28 // h+=K[i]
- ror x13,x8,#1
- and x17,x22,x21
- ror x12,x5,#19
- bic x28,x23,x21
- ror x14,x25,#28
- add x24,x24,x6 // h+=X[i]
- eor x16,x16,x21,ror#18
- eor x13,x13,x8,ror#8
- orr x17,x17,x28 // Ch(e,f,g)
- eor x28,x25,x26 // a^b, b^c in next round
- eor x16,x16,x21,ror#41 // Sigma1(e)
- eor x14,x14,x25,ror#34
- add x24,x24,x17 // h+=Ch(e,f,g)
- and x19,x19,x28 // (b^c)&=(a^b)
- eor x12,x12,x5,ror#61
- eor x13,x13,x8,lsr#7 // sigma0(X[i+1])
- add x24,x24,x16 // h+=Sigma1(e)
- eor x19,x19,x26 // Maj(a,b,c)
- eor x17,x14,x25,ror#39 // Sigma0(a)
- eor x12,x12,x5,lsr#6 // sigma1(X[i+14])
- add x7,x7,x0
- add x20,x20,x24 // d+=h
- add x24,x24,x19 // h+=Maj(a,b,c)
- ldr x19,[x30],#8 // *K++, x28 in next round
- add x7,x7,x13
- add x24,x24,x17 // h+=Sigma0(a)
- add x7,x7,x12
- ldr x12,[sp,#8]
- str x15,[sp,#0]
- ror x16,x20,#14
- add x23,x23,x19 // h+=K[i]
- ror x14,x9,#1
- and x17,x21,x20
- ror x13,x6,#19
- bic x19,x22,x20
- ror x15,x24,#28
- add x23,x23,x7 // h+=X[i]
- eor x16,x16,x20,ror#18
- eor x14,x14,x9,ror#8
- orr x17,x17,x19 // Ch(e,f,g)
- eor x19,x24,x25 // a^b, b^c in next round
- eor x16,x16,x20,ror#41 // Sigma1(e)
- eor x15,x15,x24,ror#34
- add x23,x23,x17 // h+=Ch(e,f,g)
- and x28,x28,x19 // (b^c)&=(a^b)
- eor x13,x13,x6,ror#61
- eor x14,x14,x9,lsr#7 // sigma0(X[i+1])
- add x23,x23,x16 // h+=Sigma1(e)
- eor x28,x28,x25 // Maj(a,b,c)
- eor x17,x15,x24,ror#39 // Sigma0(a)
- eor x13,x13,x6,lsr#6 // sigma1(X[i+14])
- add x8,x8,x1
- add x27,x27,x23 // d+=h
- add x23,x23,x28 // h+=Maj(a,b,c)
- ldr x28,[x30],#8 // *K++, x19 in next round
- add x8,x8,x14
- add x23,x23,x17 // h+=Sigma0(a)
- add x8,x8,x13
- ldr x13,[sp,#16]
- str x0,[sp,#8]
- ror x16,x27,#14
- add x22,x22,x28 // h+=K[i]
- ror x15,x10,#1
- and x17,x20,x27
- ror x14,x7,#19
- bic x28,x21,x27
- ror x0,x23,#28
- add x22,x22,x8 // h+=X[i]
- eor x16,x16,x27,ror#18
- eor x15,x15,x10,ror#8
- orr x17,x17,x28 // Ch(e,f,g)
- eor x28,x23,x24 // a^b, b^c in next round
- eor x16,x16,x27,ror#41 // Sigma1(e)
- eor x0,x0,x23,ror#34
- add x22,x22,x17 // h+=Ch(e,f,g)
- and x19,x19,x28 // (b^c)&=(a^b)
- eor x14,x14,x7,ror#61
- eor x15,x15,x10,lsr#7 // sigma0(X[i+1])
- add x22,x22,x16 // h+=Sigma1(e)
- eor x19,x19,x24 // Maj(a,b,c)
- eor x17,x0,x23,ror#39 // Sigma0(a)
- eor x14,x14,x7,lsr#6 // sigma1(X[i+14])
- add x9,x9,x2
- add x26,x26,x22 // d+=h
- add x22,x22,x19 // h+=Maj(a,b,c)
- ldr x19,[x30],#8 // *K++, x28 in next round
- add x9,x9,x15
- add x22,x22,x17 // h+=Sigma0(a)
- add x9,x9,x14
- ldr x14,[sp,#24]
- str x1,[sp,#16]
- ror x16,x26,#14
- add x21,x21,x19 // h+=K[i]
- ror x0,x11,#1
- and x17,x27,x26
- ror x15,x8,#19
- bic x19,x20,x26
- ror x1,x22,#28
- add x21,x21,x9 // h+=X[i]
- eor x16,x16,x26,ror#18
- eor x0,x0,x11,ror#8
- orr x17,x17,x19 // Ch(e,f,g)
- eor x19,x22,x23 // a^b, b^c in next round
- eor x16,x16,x26,ror#41 // Sigma1(e)
- eor x1,x1,x22,ror#34
- add x21,x21,x17 // h+=Ch(e,f,g)
- and x28,x28,x19 // (b^c)&=(a^b)
- eor x15,x15,x8,ror#61
- eor x0,x0,x11,lsr#7 // sigma0(X[i+1])
- add x21,x21,x16 // h+=Sigma1(e)
- eor x28,x28,x23 // Maj(a,b,c)
- eor x17,x1,x22,ror#39 // Sigma0(a)
- eor x15,x15,x8,lsr#6 // sigma1(X[i+14])
- add x10,x10,x3
- add x25,x25,x21 // d+=h
- add x21,x21,x28 // h+=Maj(a,b,c)
- ldr x28,[x30],#8 // *K++, x19 in next round
- add x10,x10,x0
- add x21,x21,x17 // h+=Sigma0(a)
- add x10,x10,x15
- ldr x15,[sp,#0]
- str x2,[sp,#24]
- ror x16,x25,#14
- add x20,x20,x28 // h+=K[i]
- ror x1,x12,#1
- and x17,x26,x25
- ror x0,x9,#19
- bic x28,x27,x25
- ror x2,x21,#28
- add x20,x20,x10 // h+=X[i]
- eor x16,x16,x25,ror#18
- eor x1,x1,x12,ror#8
- orr x17,x17,x28 // Ch(e,f,g)
- eor x28,x21,x22 // a^b, b^c in next round
- eor x16,x16,x25,ror#41 // Sigma1(e)
- eor x2,x2,x21,ror#34
- add x20,x20,x17 // h+=Ch(e,f,g)
- and x19,x19,x28 // (b^c)&=(a^b)
- eor x0,x0,x9,ror#61
- eor x1,x1,x12,lsr#7 // sigma0(X[i+1])
- add x20,x20,x16 // h+=Sigma1(e)
- eor x19,x19,x22 // Maj(a,b,c)
- eor x17,x2,x21,ror#39 // Sigma0(a)
- eor x0,x0,x9,lsr#6 // sigma1(X[i+14])
- add x11,x11,x4
- add x24,x24,x20 // d+=h
- add x20,x20,x19 // h+=Maj(a,b,c)
- ldr x19,[x30],#8 // *K++, x28 in next round
- add x11,x11,x1
- add x20,x20,x17 // h+=Sigma0(a)
- add x11,x11,x0
- ldr x0,[sp,#8]
- str x3,[sp,#0]
- ror x16,x24,#14
- add x27,x27,x19 // h+=K[i]
- ror x2,x13,#1
- and x17,x25,x24
- ror x1,x10,#19
- bic x19,x26,x24
- ror x3,x20,#28
- add x27,x27,x11 // h+=X[i]
- eor x16,x16,x24,ror#18
- eor x2,x2,x13,ror#8
- orr x17,x17,x19 // Ch(e,f,g)
- eor x19,x20,x21 // a^b, b^c in next round
- eor x16,x16,x24,ror#41 // Sigma1(e)
- eor x3,x3,x20,ror#34
- add x27,x27,x17 // h+=Ch(e,f,g)
- and x28,x28,x19 // (b^c)&=(a^b)
- eor x1,x1,x10,ror#61
- eor x2,x2,x13,lsr#7 // sigma0(X[i+1])
- add x27,x27,x16 // h+=Sigma1(e)
- eor x28,x28,x21 // Maj(a,b,c)
- eor x17,x3,x20,ror#39 // Sigma0(a)
- eor x1,x1,x10,lsr#6 // sigma1(X[i+14])
- add x12,x12,x5
- add x23,x23,x27 // d+=h
- add x27,x27,x28 // h+=Maj(a,b,c)
- ldr x28,[x30],#8 // *K++, x19 in next round
- add x12,x12,x2
- add x27,x27,x17 // h+=Sigma0(a)
- add x12,x12,x1
- ldr x1,[sp,#16]
- str x4,[sp,#8]
- ror x16,x23,#14
- add x26,x26,x28 // h+=K[i]
- ror x3,x14,#1
- and x17,x24,x23
- ror x2,x11,#19
- bic x28,x25,x23
- ror x4,x27,#28
- add x26,x26,x12 // h+=X[i]
- eor x16,x16,x23,ror#18
- eor x3,x3,x14,ror#8
- orr x17,x17,x28 // Ch(e,f,g)
- eor x28,x27,x20 // a^b, b^c in next round
- eor x16,x16,x23,ror#41 // Sigma1(e)
- eor x4,x4,x27,ror#34
- add x26,x26,x17 // h+=Ch(e,f,g)
- and x19,x19,x28 // (b^c)&=(a^b)
- eor x2,x2,x11,ror#61
- eor x3,x3,x14,lsr#7 // sigma0(X[i+1])
- add x26,x26,x16 // h+=Sigma1(e)
- eor x19,x19,x20 // Maj(a,b,c)
- eor x17,x4,x27,ror#39 // Sigma0(a)
- eor x2,x2,x11,lsr#6 // sigma1(X[i+14])
- add x13,x13,x6
- add x22,x22,x26 // d+=h
- add x26,x26,x19 // h+=Maj(a,b,c)
- ldr x19,[x30],#8 // *K++, x28 in next round
- add x13,x13,x3
- add x26,x26,x17 // h+=Sigma0(a)
- add x13,x13,x2
- ldr x2,[sp,#24]
- str x5,[sp,#16]
- ror x16,x22,#14
- add x25,x25,x19 // h+=K[i]
- ror x4,x15,#1
- and x17,x23,x22
- ror x3,x12,#19
- bic x19,x24,x22
- ror x5,x26,#28
- add x25,x25,x13 // h+=X[i]
- eor x16,x16,x22,ror#18
- eor x4,x4,x15,ror#8
- orr x17,x17,x19 // Ch(e,f,g)
- eor x19,x26,x27 // a^b, b^c in next round
- eor x16,x16,x22,ror#41 // Sigma1(e)
- eor x5,x5,x26,ror#34
- add x25,x25,x17 // h+=Ch(e,f,g)
- and x28,x28,x19 // (b^c)&=(a^b)
- eor x3,x3,x12,ror#61
- eor x4,x4,x15,lsr#7 // sigma0(X[i+1])
- add x25,x25,x16 // h+=Sigma1(e)
- eor x28,x28,x27 // Maj(a,b,c)
- eor x17,x5,x26,ror#39 // Sigma0(a)
- eor x3,x3,x12,lsr#6 // sigma1(X[i+14])
- add x14,x14,x7
- add x21,x21,x25 // d+=h
- add x25,x25,x28 // h+=Maj(a,b,c)
- ldr x28,[x30],#8 // *K++, x19 in next round
- add x14,x14,x4
- add x25,x25,x17 // h+=Sigma0(a)
- add x14,x14,x3
- ldr x3,[sp,#0]
- str x6,[sp,#24]
- ror x16,x21,#14
- add x24,x24,x28 // h+=K[i]
- ror x5,x0,#1
- and x17,x22,x21
- ror x4,x13,#19
- bic x28,x23,x21
- ror x6,x25,#28
- add x24,x24,x14 // h+=X[i]
- eor x16,x16,x21,ror#18
- eor x5,x5,x0,ror#8
- orr x17,x17,x28 // Ch(e,f,g)
- eor x28,x25,x26 // a^b, b^c in next round
- eor x16,x16,x21,ror#41 // Sigma1(e)
- eor x6,x6,x25,ror#34
- add x24,x24,x17 // h+=Ch(e,f,g)
- and x19,x19,x28 // (b^c)&=(a^b)
- eor x4,x4,x13,ror#61
- eor x5,x5,x0,lsr#7 // sigma0(X[i+1])
- add x24,x24,x16 // h+=Sigma1(e)
- eor x19,x19,x26 // Maj(a,b,c)
- eor x17,x6,x25,ror#39 // Sigma0(a)
- eor x4,x4,x13,lsr#6 // sigma1(X[i+14])
- add x15,x15,x8
- add x20,x20,x24 // d+=h
- add x24,x24,x19 // h+=Maj(a,b,c)
- ldr x19,[x30],#8 // *K++, x28 in next round
- add x15,x15,x5
- add x24,x24,x17 // h+=Sigma0(a)
- add x15,x15,x4
- ldr x4,[sp,#8]
- str x7,[sp,#0]
- ror x16,x20,#14
- add x23,x23,x19 // h+=K[i]
- ror x6,x1,#1
- and x17,x21,x20
- ror x5,x14,#19
- bic x19,x22,x20
- ror x7,x24,#28
- add x23,x23,x15 // h+=X[i]
- eor x16,x16,x20,ror#18
- eor x6,x6,x1,ror#8
- orr x17,x17,x19 // Ch(e,f,g)
- eor x19,x24,x25 // a^b, b^c in next round
- eor x16,x16,x20,ror#41 // Sigma1(e)
- eor x7,x7,x24,ror#34
- add x23,x23,x17 // h+=Ch(e,f,g)
- and x28,x28,x19 // (b^c)&=(a^b)
- eor x5,x5,x14,ror#61
- eor x6,x6,x1,lsr#7 // sigma0(X[i+1])
- add x23,x23,x16 // h+=Sigma1(e)
- eor x28,x28,x25 // Maj(a,b,c)
- eor x17,x7,x24,ror#39 // Sigma0(a)
- eor x5,x5,x14,lsr#6 // sigma1(X[i+14])
- add x0,x0,x9
- add x27,x27,x23 // d+=h
- add x23,x23,x28 // h+=Maj(a,b,c)
- ldr x28,[x30],#8 // *K++, x19 in next round
- add x0,x0,x6
- add x23,x23,x17 // h+=Sigma0(a)
- add x0,x0,x5
- ldr x5,[sp,#16]
- str x8,[sp,#8]
- ror x16,x27,#14
- add x22,x22,x28 // h+=K[i]
- ror x7,x2,#1
- and x17,x20,x27
- ror x6,x15,#19
- bic x28,x21,x27
- ror x8,x23,#28
- add x22,x22,x0 // h+=X[i]
- eor x16,x16,x27,ror#18
- eor x7,x7,x2,ror#8
- orr x17,x17,x28 // Ch(e,f,g)
- eor x28,x23,x24 // a^b, b^c in next round
- eor x16,x16,x27,ror#41 // Sigma1(e)
- eor x8,x8,x23,ror#34
- add x22,x22,x17 // h+=Ch(e,f,g)
- and x19,x19,x28 // (b^c)&=(a^b)
- eor x6,x6,x15,ror#61
- eor x7,x7,x2,lsr#7 // sigma0(X[i+1])
- add x22,x22,x16 // h+=Sigma1(e)
- eor x19,x19,x24 // Maj(a,b,c)
- eor x17,x8,x23,ror#39 // Sigma0(a)
- eor x6,x6,x15,lsr#6 // sigma1(X[i+14])
- add x1,x1,x10
- add x26,x26,x22 // d+=h
- add x22,x22,x19 // h+=Maj(a,b,c)
- ldr x19,[x30],#8 // *K++, x28 in next round
- add x1,x1,x7
- add x22,x22,x17 // h+=Sigma0(a)
- add x1,x1,x6
- ldr x6,[sp,#24]
- str x9,[sp,#16]
- ror x16,x26,#14
- add x21,x21,x19 // h+=K[i]
- ror x8,x3,#1
- and x17,x27,x26
- ror x7,x0,#19
- bic x19,x20,x26
- ror x9,x22,#28
- add x21,x21,x1 // h+=X[i]
- eor x16,x16,x26,ror#18
- eor x8,x8,x3,ror#8
- orr x17,x17,x19 // Ch(e,f,g)
- eor x19,x22,x23 // a^b, b^c in next round
- eor x16,x16,x26,ror#41 // Sigma1(e)
- eor x9,x9,x22,ror#34
- add x21,x21,x17 // h+=Ch(e,f,g)
- and x28,x28,x19 // (b^c)&=(a^b)
- eor x7,x7,x0,ror#61
- eor x8,x8,x3,lsr#7 // sigma0(X[i+1])
- add x21,x21,x16 // h+=Sigma1(e)
- eor x28,x28,x23 // Maj(a,b,c)
- eor x17,x9,x22,ror#39 // Sigma0(a)
- eor x7,x7,x0,lsr#6 // sigma1(X[i+14])
- add x2,x2,x11
- add x25,x25,x21 // d+=h
- add x21,x21,x28 // h+=Maj(a,b,c)
- ldr x28,[x30],#8 // *K++, x19 in next round
- add x2,x2,x8
- add x21,x21,x17 // h+=Sigma0(a)
- add x2,x2,x7
- ldr x7,[sp,#0]
- str x10,[sp,#24]
- ror x16,x25,#14
- add x20,x20,x28 // h+=K[i]
- ror x9,x4,#1
- and x17,x26,x25
- ror x8,x1,#19
- bic x28,x27,x25
- ror x10,x21,#28
- add x20,x20,x2 // h+=X[i]
- eor x16,x16,x25,ror#18
- eor x9,x9,x4,ror#8
- orr x17,x17,x28 // Ch(e,f,g)
- eor x28,x21,x22 // a^b, b^c in next round
- eor x16,x16,x25,ror#41 // Sigma1(e)
- eor x10,x10,x21,ror#34
- add x20,x20,x17 // h+=Ch(e,f,g)
- and x19,x19,x28 // (b^c)&=(a^b)
- eor x8,x8,x1,ror#61
- eor x9,x9,x4,lsr#7 // sigma0(X[i+1])
- add x20,x20,x16 // h+=Sigma1(e)
- eor x19,x19,x22 // Maj(a,b,c)
- eor x17,x10,x21,ror#39 // Sigma0(a)
- eor x8,x8,x1,lsr#6 // sigma1(X[i+14])
- add x3,x3,x12
- add x24,x24,x20 // d+=h
- add x20,x20,x19 // h+=Maj(a,b,c)
- ldr x19,[x30],#8 // *K++, x28 in next round
- add x3,x3,x9
- add x20,x20,x17 // h+=Sigma0(a)
- add x3,x3,x8
- cbnz x19,Loop_16_xx
-
- ldp x0,x2,[x29,#96]
- ldr x1,[x29,#112]
- sub x30,x30,#648 // rewind
-
- ldp x3,x4,[x0]
- ldp x5,x6,[x0,#2*8]
- add x1,x1,#14*8 // advance input pointer
- ldp x7,x8,[x0,#4*8]
- add x20,x20,x3
- ldp x9,x10,[x0,#6*8]
- add x21,x21,x4
- add x22,x22,x5
- add x23,x23,x6
- stp x20,x21,[x0]
- add x24,x24,x7
- add x25,x25,x8
- stp x22,x23,[x0,#2*8]
- add x26,x26,x9
- add x27,x27,x10
- cmp x1,x2
- stp x24,x25,[x0,#4*8]
- stp x26,x27,[x0,#6*8]
- b.ne Loop
-
- ldp x19,x20,[x29,#16]
- add sp,sp,#4*8
- ldp x21,x22,[x29,#32]
- ldp x23,x24,[x29,#48]
- ldp x25,x26,[x29,#64]
- ldp x27,x28,[x29,#80]
- ldp x29,x30,[sp],#128
- AARCH64_VALIDATE_LINK_REGISTER
- ret
-
-
-.section .rodata
-.align 6
-
-LK512:
-.quad 0x428a2f98d728ae22,0x7137449123ef65cd
-.quad 0xb5c0fbcfec4d3b2f,0xe9b5dba58189dbbc
-.quad 0x3956c25bf348b538,0x59f111f1b605d019
-.quad 0x923f82a4af194f9b,0xab1c5ed5da6d8118
-.quad 0xd807aa98a3030242,0x12835b0145706fbe
-.quad 0x243185be4ee4b28c,0x550c7dc3d5ffb4e2
-.quad 0x72be5d74f27b896f,0x80deb1fe3b1696b1
-.quad 0x9bdc06a725c71235,0xc19bf174cf692694
-.quad 0xe49b69c19ef14ad2,0xefbe4786384f25e3
-.quad 0x0fc19dc68b8cd5b5,0x240ca1cc77ac9c65
-.quad 0x2de92c6f592b0275,0x4a7484aa6ea6e483
-.quad 0x5cb0a9dcbd41fbd4,0x76f988da831153b5
-.quad 0x983e5152ee66dfab,0xa831c66d2db43210
-.quad 0xb00327c898fb213f,0xbf597fc7beef0ee4
-.quad 0xc6e00bf33da88fc2,0xd5a79147930aa725
-.quad 0x06ca6351e003826f,0x142929670a0e6e70
-.quad 0x27b70a8546d22ffc,0x2e1b21385c26c926
-.quad 0x4d2c6dfc5ac42aed,0x53380d139d95b3df
-.quad 0x650a73548baf63de,0x766a0abb3c77b2a8
-.quad 0x81c2c92e47edaee6,0x92722c851482353b
-.quad 0xa2bfe8a14cf10364,0xa81a664bbc423001
-.quad 0xc24b8b70d0f89791,0xc76c51a30654be30
-.quad 0xd192e819d6ef5218,0xd69906245565a910
-.quad 0xf40e35855771202a,0x106aa07032bbd1b8
-.quad 0x19a4c116b8d2d0c8,0x1e376c085141ab53
-.quad 0x2748774cdf8eeb99,0x34b0bcb5e19b48a8
-.quad 0x391c0cb3c5c95a63,0x4ed8aa4ae3418acb
-.quad 0x5b9cca4f7763e373,0x682e6ff3d6b2b8a3
-.quad 0x748f82ee5defb2fc,0x78a5636f43172f60
-.quad 0x84c87814a1f0ab72,0x8cc702081a6439ec
-.quad 0x90befffa23631e28,0xa4506cebde82bde9
-.quad 0xbef9a3f7b2c67915,0xc67178f2e372532b
-.quad 0xca273eceea26619c,0xd186b8c721c0c207
-.quad 0xeada7dd6cde0eb1e,0xf57d4f7fee6ed178
-.quad 0x06f067aa72176fba,0x0a637dc5a2c898a6
-.quad 0x113f9804bef90dae,0x1b710b35131c471b
-.quad 0x28db77f523047d84,0x32caab7b40c72493
-.quad 0x3c9ebe0a15c9bebc,0x431d67c49c100d4c
-.quad 0x4cc5d4becb3e42b6,0x597f299cfc657e2a
-.quad 0x5fcb6fab3ad6faec,0x6c44198c4a475817
-.quad 0 // terminator
-
-.byte 83,72,65,53,49,50,32,98,108,111,99,107,32,116,114,97,110,115,102,111,114,109,32,102,111,114,32,65,82,77,118,56,44,32,67,82,89,80,84,79,71,65,77,83,32,98,121,32,60,97,112,112,114,111,64,111,112,101,110,115,115,108,46,111,114,103,62,0
-.align 2
-.align 2
-.text
-#ifndef __KERNEL__
-.globl sha512_block_data_order_hw
-
-.def sha512_block_data_order_hw
- .type 32
-.endef
-.align 6
-sha512_block_data_order_hw:
- // Armv8.3-A PAuth: even though x30 is pushed to stack it is not popped later.
- AARCH64_VALID_CALL_TARGET
- stp x29,x30,[sp,#-16]!
- add x29,sp,#0
-
- ld1 {v16.16b,v17.16b,v18.16b,v19.16b},[x1],#64 // load input
- ld1 {v20.16b,v21.16b,v22.16b,v23.16b},[x1],#64
-
- ld1 {v0.2d,v1.2d,v2.2d,v3.2d},[x0] // load context
- adrp x3,LK512
- add x3,x3,:lo12:LK512
-
- rev64 v16.16b,v16.16b
- rev64 v17.16b,v17.16b
- rev64 v18.16b,v18.16b
- rev64 v19.16b,v19.16b
- rev64 v20.16b,v20.16b
- rev64 v21.16b,v21.16b
- rev64 v22.16b,v22.16b
- rev64 v23.16b,v23.16b
- b Loop_hw
-
-.align 4
-Loop_hw:
- ld1 {v24.2d},[x3],#16
- subs x2,x2,#1
- sub x4,x1,#128
- orr v26.16b,v0.16b,v0.16b // offload
- orr v27.16b,v1.16b,v1.16b
- orr v28.16b,v2.16b,v2.16b
- orr v29.16b,v3.16b,v3.16b
- csel x1,x1,x4,ne // conditional rewind
- add v24.2d,v24.2d,v16.2d
- ld1 {v25.2d},[x3],#16
- ext v24.16b,v24.16b,v24.16b,#8
- ext v5.16b,v2.16b,v3.16b,#8
- ext v6.16b,v1.16b,v2.16b,#8
- add v3.2d,v3.2d,v24.2d // "T1 + H + K512[i]"
-.long 0xcec08230 //sha512su0 v16.16b,v17.16b
- ext v7.16b,v20.16b,v21.16b,#8
-.long 0xce6680a3 //sha512h v3.16b,v5.16b,v6.16b
-.long 0xce678af0 //sha512su1 v16.16b,v23.16b,v7.16b
- add v4.2d,v1.2d,v3.2d // "D + T1"
-.long 0xce608423 //sha512h2 v3.16b,v1.16b,v0.16b
- add v25.2d,v25.2d,v17.2d
- ld1 {v24.2d},[x3],#16
- ext v25.16b,v25.16b,v25.16b,#8
- ext v5.16b,v4.16b,v2.16b,#8
- ext v6.16b,v0.16b,v4.16b,#8
- add v2.2d,v2.2d,v25.2d // "T1 + H + K512[i]"
-.long 0xcec08251 //sha512su0 v17.16b,v18.16b
- ext v7.16b,v21.16b,v22.16b,#8
-.long 0xce6680a2 //sha512h v2.16b,v5.16b,v6.16b
-.long 0xce678a11 //sha512su1 v17.16b,v16.16b,v7.16b
- add v1.2d,v0.2d,v2.2d // "D + T1"
-.long 0xce638402 //sha512h2 v2.16b,v0.16b,v3.16b
- add v24.2d,v24.2d,v18.2d
- ld1 {v25.2d},[x3],#16
- ext v24.16b,v24.16b,v24.16b,#8
- ext v5.16b,v1.16b,v4.16b,#8
- ext v6.16b,v3.16b,v1.16b,#8
- add v4.2d,v4.2d,v24.2d // "T1 + H + K512[i]"
-.long 0xcec08272 //sha512su0 v18.16b,v19.16b
- ext v7.16b,v22.16b,v23.16b,#8
-.long 0xce6680a4 //sha512h v4.16b,v5.16b,v6.16b
-.long 0xce678a32 //sha512su1 v18.16b,v17.16b,v7.16b
- add v0.2d,v3.2d,v4.2d // "D + T1"
-.long 0xce628464 //sha512h2 v4.16b,v3.16b,v2.16b
- add v25.2d,v25.2d,v19.2d
- ld1 {v24.2d},[x3],#16
- ext v25.16b,v25.16b,v25.16b,#8
- ext v5.16b,v0.16b,v1.16b,#8
- ext v6.16b,v2.16b,v0.16b,#8
- add v1.2d,v1.2d,v25.2d // "T1 + H + K512[i]"
-.long 0xcec08293 //sha512su0 v19.16b,v20.16b
- ext v7.16b,v23.16b,v16.16b,#8
-.long 0xce6680a1 //sha512h v1.16b,v5.16b,v6.16b
-.long 0xce678a53 //sha512su1 v19.16b,v18.16b,v7.16b
- add v3.2d,v2.2d,v1.2d // "D + T1"
-.long 0xce648441 //sha512h2 v1.16b,v2.16b,v4.16b
- add v24.2d,v24.2d,v20.2d
- ld1 {v25.2d},[x3],#16
- ext v24.16b,v24.16b,v24.16b,#8
- ext v5.16b,v3.16b,v0.16b,#8
- ext v6.16b,v4.16b,v3.16b,#8
- add v0.2d,v0.2d,v24.2d // "T1 + H + K512[i]"
-.long 0xcec082b4 //sha512su0 v20.16b,v21.16b
- ext v7.16b,v16.16b,v17.16b,#8
-.long 0xce6680a0 //sha512h v0.16b,v5.16b,v6.16b
-.long 0xce678a74 //sha512su1 v20.16b,v19.16b,v7.16b
- add v2.2d,v4.2d,v0.2d // "D + T1"
-.long 0xce618480 //sha512h2 v0.16b,v4.16b,v1.16b
- add v25.2d,v25.2d,v21.2d
- ld1 {v24.2d},[x3],#16
- ext v25.16b,v25.16b,v25.16b,#8
- ext v5.16b,v2.16b,v3.16b,#8
- ext v6.16b,v1.16b,v2.16b,#8
- add v3.2d,v3.2d,v25.2d // "T1 + H + K512[i]"
-.long 0xcec082d5 //sha512su0 v21.16b,v22.16b
- ext v7.16b,v17.16b,v18.16b,#8
-.long 0xce6680a3 //sha512h v3.16b,v5.16b,v6.16b
-.long 0xce678a95 //sha512su1 v21.16b,v20.16b,v7.16b
- add v4.2d,v1.2d,v3.2d // "D + T1"
-.long 0xce608423 //sha512h2 v3.16b,v1.16b,v0.16b
- add v24.2d,v24.2d,v22.2d
- ld1 {v25.2d},[x3],#16
- ext v24.16b,v24.16b,v24.16b,#8
- ext v5.16b,v4.16b,v2.16b,#8
- ext v6.16b,v0.16b,v4.16b,#8
- add v2.2d,v2.2d,v24.2d // "T1 + H + K512[i]"
-.long 0xcec082f6 //sha512su0 v22.16b,v23.16b
- ext v7.16b,v18.16b,v19.16b,#8
-.long 0xce6680a2 //sha512h v2.16b,v5.16b,v6.16b
-.long 0xce678ab6 //sha512su1 v22.16b,v21.16b,v7.16b
- add v1.2d,v0.2d,v2.2d // "D + T1"
-.long 0xce638402 //sha512h2 v2.16b,v0.16b,v3.16b
- add v25.2d,v25.2d,v23.2d
- ld1 {v24.2d},[x3],#16
- ext v25.16b,v25.16b,v25.16b,#8
- ext v5.16b,v1.16b,v4.16b,#8
- ext v6.16b,v3.16b,v1.16b,#8
- add v4.2d,v4.2d,v25.2d // "T1 + H + K512[i]"
-.long 0xcec08217 //sha512su0 v23.16b,v16.16b
- ext v7.16b,v19.16b,v20.16b,#8
-.long 0xce6680a4 //sha512h v4.16b,v5.16b,v6.16b
-.long 0xce678ad7 //sha512su1 v23.16b,v22.16b,v7.16b
- add v0.2d,v3.2d,v4.2d // "D + T1"
-.long 0xce628464 //sha512h2 v4.16b,v3.16b,v2.16b
- add v24.2d,v24.2d,v16.2d
- ld1 {v25.2d},[x3],#16
- ext v24.16b,v24.16b,v24.16b,#8
- ext v5.16b,v0.16b,v1.16b,#8
- ext v6.16b,v2.16b,v0.16b,#8
- add v1.2d,v1.2d,v24.2d // "T1 + H + K512[i]"
-.long 0xcec08230 //sha512su0 v16.16b,v17.16b
- ext v7.16b,v20.16b,v21.16b,#8
-.long 0xce6680a1 //sha512h v1.16b,v5.16b,v6.16b
-.long 0xce678af0 //sha512su1 v16.16b,v23.16b,v7.16b
- add v3.2d,v2.2d,v1.2d // "D + T1"
-.long 0xce648441 //sha512h2 v1.16b,v2.16b,v4.16b
- add v25.2d,v25.2d,v17.2d
- ld1 {v24.2d},[x3],#16
- ext v25.16b,v25.16b,v25.16b,#8
- ext v5.16b,v3.16b,v0.16b,#8
- ext v6.16b,v4.16b,v3.16b,#8
- add v0.2d,v0.2d,v25.2d // "T1 + H + K512[i]"
-.long 0xcec08251 //sha512su0 v17.16b,v18.16b
- ext v7.16b,v21.16b,v22.16b,#8
-.long 0xce6680a0 //sha512h v0.16b,v5.16b,v6.16b
-.long 0xce678a11 //sha512su1 v17.16b,v16.16b,v7.16b
- add v2.2d,v4.2d,v0.2d // "D + T1"
-.long 0xce618480 //sha512h2 v0.16b,v4.16b,v1.16b
- add v24.2d,v24.2d,v18.2d
- ld1 {v25.2d},[x3],#16
- ext v24.16b,v24.16b,v24.16b,#8
- ext v5.16b,v2.16b,v3.16b,#8
- ext v6.16b,v1.16b,v2.16b,#8
- add v3.2d,v3.2d,v24.2d // "T1 + H + K512[i]"
-.long 0xcec08272 //sha512su0 v18.16b,v19.16b
- ext v7.16b,v22.16b,v23.16b,#8
-.long 0xce6680a3 //sha512h v3.16b,v5.16b,v6.16b
-.long 0xce678a32 //sha512su1 v18.16b,v17.16b,v7.16b
- add v4.2d,v1.2d,v3.2d // "D + T1"
-.long 0xce608423 //sha512h2 v3.16b,v1.16b,v0.16b
- add v25.2d,v25.2d,v19.2d
- ld1 {v24.2d},[x3],#16
- ext v25.16b,v25.16b,v25.16b,#8
- ext v5.16b,v4.16b,v2.16b,#8
- ext v6.16b,v0.16b,v4.16b,#8
- add v2.2d,v2.2d,v25.2d // "T1 + H + K512[i]"
-.long 0xcec08293 //sha512su0 v19.16b,v20.16b
- ext v7.16b,v23.16b,v16.16b,#8
-.long 0xce6680a2 //sha512h v2.16b,v5.16b,v6.16b
-.long 0xce678a53 //sha512su1 v19.16b,v18.16b,v7.16b
- add v1.2d,v0.2d,v2.2d // "D + T1"
-.long 0xce638402 //sha512h2 v2.16b,v0.16b,v3.16b
- add v24.2d,v24.2d,v20.2d
- ld1 {v25.2d},[x3],#16
- ext v24.16b,v24.16b,v24.16b,#8
- ext v5.16b,v1.16b,v4.16b,#8
- ext v6.16b,v3.16b,v1.16b,#8
- add v4.2d,v4.2d,v24.2d // "T1 + H + K512[i]"
-.long 0xcec082b4 //sha512su0 v20.16b,v21.16b
- ext v7.16b,v16.16b,v17.16b,#8
-.long 0xce6680a4 //sha512h v4.16b,v5.16b,v6.16b
-.long 0xce678a74 //sha512su1 v20.16b,v19.16b,v7.16b
- add v0.2d,v3.2d,v4.2d // "D + T1"
-.long 0xce628464 //sha512h2 v4.16b,v3.16b,v2.16b
- add v25.2d,v25.2d,v21.2d
- ld1 {v24.2d},[x3],#16
- ext v25.16b,v25.16b,v25.16b,#8
- ext v5.16b,v0.16b,v1.16b,#8
- ext v6.16b,v2.16b,v0.16b,#8
- add v1.2d,v1.2d,v25.2d // "T1 + H + K512[i]"
-.long 0xcec082d5 //sha512su0 v21.16b,v22.16b
- ext v7.16b,v17.16b,v18.16b,#8
-.long 0xce6680a1 //sha512h v1.16b,v5.16b,v6.16b
-.long 0xce678a95 //sha512su1 v21.16b,v20.16b,v7.16b
- add v3.2d,v2.2d,v1.2d // "D + T1"
-.long 0xce648441 //sha512h2 v1.16b,v2.16b,v4.16b
- add v24.2d,v24.2d,v22.2d
- ld1 {v25.2d},[x3],#16
- ext v24.16b,v24.16b,v24.16b,#8
- ext v5.16b,v3.16b,v0.16b,#8
- ext v6.16b,v4.16b,v3.16b,#8
- add v0.2d,v0.2d,v24.2d // "T1 + H + K512[i]"
-.long 0xcec082f6 //sha512su0 v22.16b,v23.16b
- ext v7.16b,v18.16b,v19.16b,#8
-.long 0xce6680a0 //sha512h v0.16b,v5.16b,v6.16b
-.long 0xce678ab6 //sha512su1 v22.16b,v21.16b,v7.16b
- add v2.2d,v4.2d,v0.2d // "D + T1"
-.long 0xce618480 //sha512h2 v0.16b,v4.16b,v1.16b
- add v25.2d,v25.2d,v23.2d
- ld1 {v24.2d},[x3],#16
- ext v25.16b,v25.16b,v25.16b,#8
- ext v5.16b,v2.16b,v3.16b,#8
- ext v6.16b,v1.16b,v2.16b,#8
- add v3.2d,v3.2d,v25.2d // "T1 + H + K512[i]"
-.long 0xcec08217 //sha512su0 v23.16b,v16.16b
- ext v7.16b,v19.16b,v20.16b,#8
-.long 0xce6680a3 //sha512h v3.16b,v5.16b,v6.16b
-.long 0xce678ad7 //sha512su1 v23.16b,v22.16b,v7.16b
- add v4.2d,v1.2d,v3.2d // "D + T1"
-.long 0xce608423 //sha512h2 v3.16b,v1.16b,v0.16b
- add v24.2d,v24.2d,v16.2d
- ld1 {v25.2d},[x3],#16
- ext v24.16b,v24.16b,v24.16b,#8
- ext v5.16b,v4.16b,v2.16b,#8
- ext v6.16b,v0.16b,v4.16b,#8
- add v2.2d,v2.2d,v24.2d // "T1 + H + K512[i]"
-.long 0xcec08230 //sha512su0 v16.16b,v17.16b
- ext v7.16b,v20.16b,v21.16b,#8
-.long 0xce6680a2 //sha512h v2.16b,v5.16b,v6.16b
-.long 0xce678af0 //sha512su1 v16.16b,v23.16b,v7.16b
- add v1.2d,v0.2d,v2.2d // "D + T1"
-.long 0xce638402 //sha512h2 v2.16b,v0.16b,v3.16b
- add v25.2d,v25.2d,v17.2d
- ld1 {v24.2d},[x3],#16
- ext v25.16b,v25.16b,v25.16b,#8
- ext v5.16b,v1.16b,v4.16b,#8
- ext v6.16b,v3.16b,v1.16b,#8
- add v4.2d,v4.2d,v25.2d // "T1 + H + K512[i]"
-.long 0xcec08251 //sha512su0 v17.16b,v18.16b
- ext v7.16b,v21.16b,v22.16b,#8
-.long 0xce6680a4 //sha512h v4.16b,v5.16b,v6.16b
-.long 0xce678a11 //sha512su1 v17.16b,v16.16b,v7.16b
- add v0.2d,v3.2d,v4.2d // "D + T1"
-.long 0xce628464 //sha512h2 v4.16b,v3.16b,v2.16b
- add v24.2d,v24.2d,v18.2d
- ld1 {v25.2d},[x3],#16
- ext v24.16b,v24.16b,v24.16b,#8
- ext v5.16b,v0.16b,v1.16b,#8
- ext v6.16b,v2.16b,v0.16b,#8
- add v1.2d,v1.2d,v24.2d // "T1 + H + K512[i]"
-.long 0xcec08272 //sha512su0 v18.16b,v19.16b
- ext v7.16b,v22.16b,v23.16b,#8
-.long 0xce6680a1 //sha512h v1.16b,v5.16b,v6.16b
-.long 0xce678a32 //sha512su1 v18.16b,v17.16b,v7.16b
- add v3.2d,v2.2d,v1.2d // "D + T1"
-.long 0xce648441 //sha512h2 v1.16b,v2.16b,v4.16b
- add v25.2d,v25.2d,v19.2d
- ld1 {v24.2d},[x3],#16
- ext v25.16b,v25.16b,v25.16b,#8
- ext v5.16b,v3.16b,v0.16b,#8
- ext v6.16b,v4.16b,v3.16b,#8
- add v0.2d,v0.2d,v25.2d // "T1 + H + K512[i]"
-.long 0xcec08293 //sha512su0 v19.16b,v20.16b
- ext v7.16b,v23.16b,v16.16b,#8
-.long 0xce6680a0 //sha512h v0.16b,v5.16b,v6.16b
-.long 0xce678a53 //sha512su1 v19.16b,v18.16b,v7.16b
- add v2.2d,v4.2d,v0.2d // "D + T1"
-.long 0xce618480 //sha512h2 v0.16b,v4.16b,v1.16b
- add v24.2d,v24.2d,v20.2d
- ld1 {v25.2d},[x3],#16
- ext v24.16b,v24.16b,v24.16b,#8
- ext v5.16b,v2.16b,v3.16b,#8
- ext v6.16b,v1.16b,v2.16b,#8
- add v3.2d,v3.2d,v24.2d // "T1 + H + K512[i]"
-.long 0xcec082b4 //sha512su0 v20.16b,v21.16b
- ext v7.16b,v16.16b,v17.16b,#8
-.long 0xce6680a3 //sha512h v3.16b,v5.16b,v6.16b
-.long 0xce678a74 //sha512su1 v20.16b,v19.16b,v7.16b
- add v4.2d,v1.2d,v3.2d // "D + T1"
-.long 0xce608423 //sha512h2 v3.16b,v1.16b,v0.16b
- add v25.2d,v25.2d,v21.2d
- ld1 {v24.2d},[x3],#16
- ext v25.16b,v25.16b,v25.16b,#8
- ext v5.16b,v4.16b,v2.16b,#8
- ext v6.16b,v0.16b,v4.16b,#8
- add v2.2d,v2.2d,v25.2d // "T1 + H + K512[i]"
-.long 0xcec082d5 //sha512su0 v21.16b,v22.16b
- ext v7.16b,v17.16b,v18.16b,#8
-.long 0xce6680a2 //sha512h v2.16b,v5.16b,v6.16b
-.long 0xce678a95 //sha512su1 v21.16b,v20.16b,v7.16b
- add v1.2d,v0.2d,v2.2d // "D + T1"
-.long 0xce638402 //sha512h2 v2.16b,v0.16b,v3.16b
- add v24.2d,v24.2d,v22.2d
- ld1 {v25.2d},[x3],#16
- ext v24.16b,v24.16b,v24.16b,#8
- ext v5.16b,v1.16b,v4.16b,#8
- ext v6.16b,v3.16b,v1.16b,#8
- add v4.2d,v4.2d,v24.2d // "T1 + H + K512[i]"
-.long 0xcec082f6 //sha512su0 v22.16b,v23.16b
- ext v7.16b,v18.16b,v19.16b,#8
-.long 0xce6680a4 //sha512h v4.16b,v5.16b,v6.16b
-.long 0xce678ab6 //sha512su1 v22.16b,v21.16b,v7.16b
- add v0.2d,v3.2d,v4.2d // "D + T1"
-.long 0xce628464 //sha512h2 v4.16b,v3.16b,v2.16b
- add v25.2d,v25.2d,v23.2d
- ld1 {v24.2d},[x3],#16
- ext v25.16b,v25.16b,v25.16b,#8
- ext v5.16b,v0.16b,v1.16b,#8
- ext v6.16b,v2.16b,v0.16b,#8
- add v1.2d,v1.2d,v25.2d // "T1 + H + K512[i]"
-.long 0xcec08217 //sha512su0 v23.16b,v16.16b
- ext v7.16b,v19.16b,v20.16b,#8
-.long 0xce6680a1 //sha512h v1.16b,v5.16b,v6.16b
-.long 0xce678ad7 //sha512su1 v23.16b,v22.16b,v7.16b
- add v3.2d,v2.2d,v1.2d // "D + T1"
-.long 0xce648441 //sha512h2 v1.16b,v2.16b,v4.16b
- add v24.2d,v24.2d,v16.2d
- ld1 {v25.2d},[x3],#16
- ext v24.16b,v24.16b,v24.16b,#8
- ext v5.16b,v3.16b,v0.16b,#8
- ext v6.16b,v4.16b,v3.16b,#8
- add v0.2d,v0.2d,v24.2d // "T1 + H + K512[i]"
-.long 0xcec08230 //sha512su0 v16.16b,v17.16b
- ext v7.16b,v20.16b,v21.16b,#8
-.long 0xce6680a0 //sha512h v0.16b,v5.16b,v6.16b
-.long 0xce678af0 //sha512su1 v16.16b,v23.16b,v7.16b
- add v2.2d,v4.2d,v0.2d // "D + T1"
-.long 0xce618480 //sha512h2 v0.16b,v4.16b,v1.16b
- add v25.2d,v25.2d,v17.2d
- ld1 {v24.2d},[x3],#16
- ext v25.16b,v25.16b,v25.16b,#8
- ext v5.16b,v2.16b,v3.16b,#8
- ext v6.16b,v1.16b,v2.16b,#8
- add v3.2d,v3.2d,v25.2d // "T1 + H + K512[i]"
-.long 0xcec08251 //sha512su0 v17.16b,v18.16b
- ext v7.16b,v21.16b,v22.16b,#8
-.long 0xce6680a3 //sha512h v3.16b,v5.16b,v6.16b
-.long 0xce678a11 //sha512su1 v17.16b,v16.16b,v7.16b
- add v4.2d,v1.2d,v3.2d // "D + T1"
-.long 0xce608423 //sha512h2 v3.16b,v1.16b,v0.16b
- add v24.2d,v24.2d,v18.2d
- ld1 {v25.2d},[x3],#16
- ext v24.16b,v24.16b,v24.16b,#8
- ext v5.16b,v4.16b,v2.16b,#8
- ext v6.16b,v0.16b,v4.16b,#8
- add v2.2d,v2.2d,v24.2d // "T1 + H + K512[i]"
-.long 0xcec08272 //sha512su0 v18.16b,v19.16b
- ext v7.16b,v22.16b,v23.16b,#8
-.long 0xce6680a2 //sha512h v2.16b,v5.16b,v6.16b
-.long 0xce678a32 //sha512su1 v18.16b,v17.16b,v7.16b
- add v1.2d,v0.2d,v2.2d // "D + T1"
-.long 0xce638402 //sha512h2 v2.16b,v0.16b,v3.16b
- add v25.2d,v25.2d,v19.2d
- ld1 {v24.2d},[x3],#16
- ext v25.16b,v25.16b,v25.16b,#8
- ext v5.16b,v1.16b,v4.16b,#8
- ext v6.16b,v3.16b,v1.16b,#8
- add v4.2d,v4.2d,v25.2d // "T1 + H + K512[i]"
-.long 0xcec08293 //sha512su0 v19.16b,v20.16b
- ext v7.16b,v23.16b,v16.16b,#8
-.long 0xce6680a4 //sha512h v4.16b,v5.16b,v6.16b
-.long 0xce678a53 //sha512su1 v19.16b,v18.16b,v7.16b
- add v0.2d,v3.2d,v4.2d // "D + T1"
-.long 0xce628464 //sha512h2 v4.16b,v3.16b,v2.16b
- add v24.2d,v24.2d,v20.2d
- ld1 {v25.2d},[x3],#16
- ext v24.16b,v24.16b,v24.16b,#8
- ext v5.16b,v0.16b,v1.16b,#8
- ext v6.16b,v2.16b,v0.16b,#8
- add v1.2d,v1.2d,v24.2d // "T1 + H + K512[i]"
-.long 0xcec082b4 //sha512su0 v20.16b,v21.16b
- ext v7.16b,v16.16b,v17.16b,#8
-.long 0xce6680a1 //sha512h v1.16b,v5.16b,v6.16b
-.long 0xce678a74 //sha512su1 v20.16b,v19.16b,v7.16b
- add v3.2d,v2.2d,v1.2d // "D + T1"
-.long 0xce648441 //sha512h2 v1.16b,v2.16b,v4.16b
- add v25.2d,v25.2d,v21.2d
- ld1 {v24.2d},[x3],#16
- ext v25.16b,v25.16b,v25.16b,#8
- ext v5.16b,v3.16b,v0.16b,#8
- ext v6.16b,v4.16b,v3.16b,#8
- add v0.2d,v0.2d,v25.2d // "T1 + H + K512[i]"
-.long 0xcec082d5 //sha512su0 v21.16b,v22.16b
- ext v7.16b,v17.16b,v18.16b,#8
-.long 0xce6680a0 //sha512h v0.16b,v5.16b,v6.16b
-.long 0xce678a95 //sha512su1 v21.16b,v20.16b,v7.16b
- add v2.2d,v4.2d,v0.2d // "D + T1"
-.long 0xce618480 //sha512h2 v0.16b,v4.16b,v1.16b
- add v24.2d,v24.2d,v22.2d
- ld1 {v25.2d},[x3],#16
- ext v24.16b,v24.16b,v24.16b,#8
- ext v5.16b,v2.16b,v3.16b,#8
- ext v6.16b,v1.16b,v2.16b,#8
- add v3.2d,v3.2d,v24.2d // "T1 + H + K512[i]"
-.long 0xcec082f6 //sha512su0 v22.16b,v23.16b
- ext v7.16b,v18.16b,v19.16b,#8
-.long 0xce6680a3 //sha512h v3.16b,v5.16b,v6.16b
-.long 0xce678ab6 //sha512su1 v22.16b,v21.16b,v7.16b
- add v4.2d,v1.2d,v3.2d // "D + T1"
-.long 0xce608423 //sha512h2 v3.16b,v1.16b,v0.16b
- add v25.2d,v25.2d,v23.2d
- ld1 {v24.2d},[x3],#16
- ext v25.16b,v25.16b,v25.16b,#8
- ext v5.16b,v4.16b,v2.16b,#8
- ext v6.16b,v0.16b,v4.16b,#8
- add v2.2d,v2.2d,v25.2d // "T1 + H + K512[i]"
-.long 0xcec08217 //sha512su0 v23.16b,v16.16b
- ext v7.16b,v19.16b,v20.16b,#8
-.long 0xce6680a2 //sha512h v2.16b,v5.16b,v6.16b
-.long 0xce678ad7 //sha512su1 v23.16b,v22.16b,v7.16b
- add v1.2d,v0.2d,v2.2d // "D + T1"
-.long 0xce638402 //sha512h2 v2.16b,v0.16b,v3.16b
- ld1 {v25.2d},[x3],#16
- add v24.2d,v24.2d,v16.2d
- ld1 {v16.16b},[x1],#16 // load next input
- ext v24.16b,v24.16b,v24.16b,#8
- ext v5.16b,v1.16b,v4.16b,#8
- ext v6.16b,v3.16b,v1.16b,#8
- add v4.2d,v4.2d,v24.2d // "T1 + H + K512[i]"
-.long 0xce6680a4 //sha512h v4.16b,v5.16b,v6.16b
- rev64 v16.16b,v16.16b
- add v0.2d,v3.2d,v4.2d // "D + T1"
-.long 0xce628464 //sha512h2 v4.16b,v3.16b,v2.16b
- ld1 {v24.2d},[x3],#16
- add v25.2d,v25.2d,v17.2d
- ld1 {v17.16b},[x1],#16 // load next input
- ext v25.16b,v25.16b,v25.16b,#8
- ext v5.16b,v0.16b,v1.16b,#8
- ext v6.16b,v2.16b,v0.16b,#8
- add v1.2d,v1.2d,v25.2d // "T1 + H + K512[i]"
-.long 0xce6680a1 //sha512h v1.16b,v5.16b,v6.16b
- rev64 v17.16b,v17.16b
- add v3.2d,v2.2d,v1.2d // "D + T1"
-.long 0xce648441 //sha512h2 v1.16b,v2.16b,v4.16b
- ld1 {v25.2d},[x3],#16
- add v24.2d,v24.2d,v18.2d
- ld1 {v18.16b},[x1],#16 // load next input
- ext v24.16b,v24.16b,v24.16b,#8
- ext v5.16b,v3.16b,v0.16b,#8
- ext v6.16b,v4.16b,v3.16b,#8
- add v0.2d,v0.2d,v24.2d // "T1 + H + K512[i]"
-.long 0xce6680a0 //sha512h v0.16b,v5.16b,v6.16b
- rev64 v18.16b,v18.16b
- add v2.2d,v4.2d,v0.2d // "D + T1"
-.long 0xce618480 //sha512h2 v0.16b,v4.16b,v1.16b
- ld1 {v24.2d},[x3],#16
- add v25.2d,v25.2d,v19.2d
- ld1 {v19.16b},[x1],#16 // load next input
- ext v25.16b,v25.16b,v25.16b,#8
- ext v5.16b,v2.16b,v3.16b,#8
- ext v6.16b,v1.16b,v2.16b,#8
- add v3.2d,v3.2d,v25.2d // "T1 + H + K512[i]"
-.long 0xce6680a3 //sha512h v3.16b,v5.16b,v6.16b
- rev64 v19.16b,v19.16b
- add v4.2d,v1.2d,v3.2d // "D + T1"
-.long 0xce608423 //sha512h2 v3.16b,v1.16b,v0.16b
- ld1 {v25.2d},[x3],#16
- add v24.2d,v24.2d,v20.2d
- ld1 {v20.16b},[x1],#16 // load next input
- ext v24.16b,v24.16b,v24.16b,#8
- ext v5.16b,v4.16b,v2.16b,#8
- ext v6.16b,v0.16b,v4.16b,#8
- add v2.2d,v2.2d,v24.2d // "T1 + H + K512[i]"
-.long 0xce6680a2 //sha512h v2.16b,v5.16b,v6.16b
- rev64 v20.16b,v20.16b
- add v1.2d,v0.2d,v2.2d // "D + T1"
-.long 0xce638402 //sha512h2 v2.16b,v0.16b,v3.16b
- ld1 {v24.2d},[x3],#16
- add v25.2d,v25.2d,v21.2d
- ld1 {v21.16b},[x1],#16 // load next input
- ext v25.16b,v25.16b,v25.16b,#8
- ext v5.16b,v1.16b,v4.16b,#8
- ext v6.16b,v3.16b,v1.16b,#8
- add v4.2d,v4.2d,v25.2d // "T1 + H + K512[i]"
-.long 0xce6680a4 //sha512h v4.16b,v5.16b,v6.16b
- rev64 v21.16b,v21.16b
- add v0.2d,v3.2d,v4.2d // "D + T1"
-.long 0xce628464 //sha512h2 v4.16b,v3.16b,v2.16b
- ld1 {v25.2d},[x3],#16
- add v24.2d,v24.2d,v22.2d
- ld1 {v22.16b},[x1],#16 // load next input
- ext v24.16b,v24.16b,v24.16b,#8
- ext v5.16b,v0.16b,v1.16b,#8
- ext v6.16b,v2.16b,v0.16b,#8
- add v1.2d,v1.2d,v24.2d // "T1 + H + K512[i]"
-.long 0xce6680a1 //sha512h v1.16b,v5.16b,v6.16b
- rev64 v22.16b,v22.16b
- add v3.2d,v2.2d,v1.2d // "D + T1"
-.long 0xce648441 //sha512h2 v1.16b,v2.16b,v4.16b
- sub x3,x3,#80*8 // rewind
- add v25.2d,v25.2d,v23.2d
- ld1 {v23.16b},[x1],#16 // load next input
- ext v25.16b,v25.16b,v25.16b,#8
- ext v5.16b,v3.16b,v0.16b,#8
- ext v6.16b,v4.16b,v3.16b,#8
- add v0.2d,v0.2d,v25.2d // "T1 + H + K512[i]"
-.long 0xce6680a0 //sha512h v0.16b,v5.16b,v6.16b
- rev64 v23.16b,v23.16b
- add v2.2d,v4.2d,v0.2d // "D + T1"
-.long 0xce618480 //sha512h2 v0.16b,v4.16b,v1.16b
- add v0.2d,v0.2d,v26.2d // accumulate
- add v1.2d,v1.2d,v27.2d
- add v2.2d,v2.2d,v28.2d
- add v3.2d,v3.2d,v29.2d
-
- cbnz x2,Loop_hw
-
- st1 {v0.2d,v1.2d,v2.2d,v3.2d},[x0] // store context
-
- ldr x29,[sp],#16
- ret
-
-#endif
-#endif // !OPENSSL_NO_ASM && defined(OPENSSL_AARCH64) && defined(_WIN32)
diff --git a/win-aarch64/crypto/fipsmodule/vpaes-armv8-win.S b/win-aarch64/crypto/fipsmodule/vpaes-armv8-win.S
deleted file mode 100644
index d399d229..00000000
--- a/win-aarch64/crypto/fipsmodule/vpaes-armv8-win.S
+++ /dev/null
@@ -1,1262 +0,0 @@
-// This file is generated from a similarly-named Perl script in the BoringSSL
-// source tree. Do not edit by hand.
-
-#include <openssl/asm_base.h>
-
-#if !defined(OPENSSL_NO_ASM) && defined(OPENSSL_AARCH64) && defined(_WIN32)
-#include <openssl/arm_arch.h>
-
-.section .rodata
-
-
-.align 7 // totally strategic alignment
-_vpaes_consts:
-Lk_mc_forward: // mc_forward
-.quad 0x0407060500030201, 0x0C0F0E0D080B0A09
-.quad 0x080B0A0904070605, 0x000302010C0F0E0D
-.quad 0x0C0F0E0D080B0A09, 0x0407060500030201
-.quad 0x000302010C0F0E0D, 0x080B0A0904070605
-Lk_mc_backward: // mc_backward
-.quad 0x0605040702010003, 0x0E0D0C0F0A09080B
-.quad 0x020100030E0D0C0F, 0x0A09080B06050407
-.quad 0x0E0D0C0F0A09080B, 0x0605040702010003
-.quad 0x0A09080B06050407, 0x020100030E0D0C0F
-Lk_sr: // sr
-.quad 0x0706050403020100, 0x0F0E0D0C0B0A0908
-.quad 0x030E09040F0A0500, 0x0B06010C07020D08
-.quad 0x0F060D040B020900, 0x070E050C030A0108
-.quad 0x0B0E0104070A0D00, 0x0306090C0F020508
-
-//
-// "Hot" constants
-//
-Lk_inv: // inv, inva
-.quad 0x0E05060F0D080180, 0x040703090A0B0C02
-.quad 0x01040A060F0B0780, 0x030D0E0C02050809
-Lk_ipt: // input transform (lo, hi)
-.quad 0xC2B2E8985A2A7000, 0xCABAE09052227808
-.quad 0x4C01307D317C4D00, 0xCD80B1FCB0FDCC81
-Lk_sbo: // sbou, sbot
-.quad 0xD0D26D176FBDC700, 0x15AABF7AC502A878
-.quad 0xCFE474A55FBB6A00, 0x8E1E90D1412B35FA
-Lk_sb1: // sb1u, sb1t
-.quad 0x3618D415FAE22300, 0x3BF7CCC10D2ED9EF
-.quad 0xB19BE18FCB503E00, 0xA5DF7A6E142AF544
-Lk_sb2: // sb2u, sb2t
-.quad 0x69EB88400AE12900, 0xC2A163C8AB82234A
-.quad 0xE27A93C60B712400, 0x5EB7E955BC982FCD
-
-//
-// Decryption stuff
-//
-Lk_dipt: // decryption input transform
-.quad 0x0F505B040B545F00, 0x154A411E114E451A
-.quad 0x86E383E660056500, 0x12771772F491F194
-Lk_dsbo: // decryption sbox final output
-.quad 0x1387EA537EF94000, 0xC7AA6DB9D4943E2D
-.quad 0x12D7560F93441D00, 0xCA4B8159D8C58E9C
-Lk_dsb9: // decryption sbox output *9*u, *9*t
-.quad 0x851C03539A86D600, 0xCAD51F504F994CC9
-.quad 0xC03B1789ECD74900, 0x725E2C9EB2FBA565
-Lk_dsbd: // decryption sbox output *D*u, *D*t
-.quad 0x7D57CCDFE6B1A200, 0xF56E9B13882A4439
-.quad 0x3CE2FAF724C6CB00, 0x2931180D15DEEFD3
-Lk_dsbb: // decryption sbox output *B*u, *B*t
-.quad 0xD022649296B44200, 0x602646F6B0F2D404
-.quad 0xC19498A6CD596700, 0xF3FF0C3E3255AA6B
-Lk_dsbe: // decryption sbox output *E*u, *E*t
-.quad 0x46F2929626D4D000, 0x2242600464B4F6B0
-.quad 0x0C55A6CDFFAAC100, 0x9467F36B98593E32
-
-//
-// Key schedule constants
-//
-Lk_dksd: // decryption key schedule: invskew x*D
-.quad 0xFEB91A5DA3E44700, 0x0740E3A45A1DBEF9
-.quad 0x41C277F4B5368300, 0x5FDC69EAAB289D1E
-Lk_dksb: // decryption key schedule: invskew x*B
-.quad 0x9A4FCA1F8550D500, 0x03D653861CC94C99
-.quad 0x115BEDA7B6FC4A00, 0xD993256F7E3482C8
-Lk_dkse: // decryption key schedule: invskew x*E + 0x63
-.quad 0xD5031CCA1FC9D600, 0x53859A4C994F5086
-.quad 0xA23196054FDC7BE8, 0xCD5EF96A20B31487
-Lk_dks9: // decryption key schedule: invskew x*9
-.quad 0xB6116FC87ED9A700, 0x4AED933482255BFC
-.quad 0x4576516227143300, 0x8BB89FACE9DAFDCE
-
-Lk_rcon: // rcon
-.quad 0x1F8391B9AF9DEEB6, 0x702A98084D7C7D81
-
-Lk_opt: // output transform
-.quad 0xFF9F4929D6B66000, 0xF7974121DEBE6808
-.quad 0x01EDBD5150BCEC00, 0xE10D5DB1B05C0CE0
-Lk_deskew: // deskew tables: inverts the sbox's "skew"
-.quad 0x07E4A34047A4E300, 0x1DFEB95A5DBEF91A
-.quad 0x5F36B5DC83EA6900, 0x2841C2ABF49D1E77
-
-.byte 86,101,99,116,111,114,32,80,101,114,109,117,116,97,116,105,111,110,32,65,69,83,32,102,111,114,32,65,82,77,118,56,44,32,77,105,107,101,32,72,97,109,98,117,114,103,32,40,83,116,97,110,102,111,114,100,32,85,110,105,118,101,114,115,105,116,121,41,0
-.align 2
-
-.align 6
-
-.text
-##
-## _aes_preheat
-##
-## Fills register %r10 -> .aes_consts (so you can -fPIC)
-## and %xmm9-%xmm15 as specified below.
-##
-.def _vpaes_encrypt_preheat
- .type 32
-.endef
-.align 4
-_vpaes_encrypt_preheat:
- adrp x10, Lk_inv
- add x10, x10, :lo12:Lk_inv
- movi v17.16b, #0x0f
- ld1 {v18.2d,v19.2d}, [x10],#32 // Lk_inv
- ld1 {v20.2d,v21.2d,v22.2d,v23.2d}, [x10],#64 // Lk_ipt, Lk_sbo
- ld1 {v24.2d,v25.2d,v26.2d,v27.2d}, [x10] // Lk_sb1, Lk_sb2
- ret
-
-
-##
-## _aes_encrypt_core
-##
-## AES-encrypt %xmm0.
-##
-## Inputs:
-## %xmm0 = input
-## %xmm9-%xmm15 as in _vpaes_preheat
-## (%rdx) = scheduled keys
-##
-## Output in %xmm0
-## Clobbers %xmm1-%xmm5, %r9, %r10, %r11, %rax
-## Preserves %xmm6 - %xmm8 so you get some local vectors
-##
-##
-.def _vpaes_encrypt_core
- .type 32
-.endef
-.align 4
-_vpaes_encrypt_core:
- mov x9, x2
- ldr w8, [x2,#240] // pull rounds
- adrp x11, Lk_mc_forward+16
- add x11, x11, :lo12:Lk_mc_forward+16
- // vmovdqa .Lk_ipt(%rip), %xmm2 # iptlo
- ld1 {v16.2d}, [x9], #16 // vmovdqu (%r9), %xmm5 # round0 key
- and v1.16b, v7.16b, v17.16b // vpand %xmm9, %xmm0, %xmm1
- ushr v0.16b, v7.16b, #4 // vpsrlb $4, %xmm0, %xmm0
- tbl v1.16b, {v20.16b}, v1.16b // vpshufb %xmm1, %xmm2, %xmm1
- // vmovdqa .Lk_ipt+16(%rip), %xmm3 # ipthi
- tbl v2.16b, {v21.16b}, v0.16b // vpshufb %xmm0, %xmm3, %xmm2
- eor v0.16b, v1.16b, v16.16b // vpxor %xmm5, %xmm1, %xmm0
- eor v0.16b, v0.16b, v2.16b // vpxor %xmm2, %xmm0, %xmm0
- b Lenc_entry
-
-.align 4
-Lenc_loop:
- // middle of middle round
- add x10, x11, #0x40
- tbl v4.16b, {v25.16b}, v2.16b // vpshufb %xmm2, %xmm13, %xmm4 # 4 = sb1u
- ld1 {v1.2d}, [x11], #16 // vmovdqa -0x40(%r11,%r10), %xmm1 # Lk_mc_forward[]
- tbl v0.16b, {v24.16b}, v3.16b // vpshufb %xmm3, %xmm12, %xmm0 # 0 = sb1t
- eor v4.16b, v4.16b, v16.16b // vpxor %xmm5, %xmm4, %xmm4 # 4 = sb1u + k
- tbl v5.16b, {v27.16b}, v2.16b // vpshufb %xmm2, %xmm15, %xmm5 # 4 = sb2u
- eor v0.16b, v0.16b, v4.16b // vpxor %xmm4, %xmm0, %xmm0 # 0 = A
- tbl v2.16b, {v26.16b}, v3.16b // vpshufb %xmm3, %xmm14, %xmm2 # 2 = sb2t
- ld1 {v4.2d}, [x10] // vmovdqa (%r11,%r10), %xmm4 # Lk_mc_backward[]
- tbl v3.16b, {v0.16b}, v1.16b // vpshufb %xmm1, %xmm0, %xmm3 # 0 = B
- eor v2.16b, v2.16b, v5.16b // vpxor %xmm5, %xmm2, %xmm2 # 2 = 2A
- tbl v0.16b, {v0.16b}, v4.16b // vpshufb %xmm4, %xmm0, %xmm0 # 3 = D
- eor v3.16b, v3.16b, v2.16b // vpxor %xmm2, %xmm3, %xmm3 # 0 = 2A+B
- tbl v4.16b, {v3.16b}, v1.16b // vpshufb %xmm1, %xmm3, %xmm4 # 0 = 2B+C
- eor v0.16b, v0.16b, v3.16b // vpxor %xmm3, %xmm0, %xmm0 # 3 = 2A+B+D
- and x11, x11, #~(1<<6) // and $0x30, %r11 # ... mod 4
- eor v0.16b, v0.16b, v4.16b // vpxor %xmm4, %xmm0, %xmm0 # 0 = 2A+3B+C+D
- sub w8, w8, #1 // nr--
-
-Lenc_entry:
- // top of round
- and v1.16b, v0.16b, v17.16b // vpand %xmm0, %xmm9, %xmm1 # 0 = k
- ushr v0.16b, v0.16b, #4 // vpsrlb $4, %xmm0, %xmm0 # 1 = i
- tbl v5.16b, {v19.16b}, v1.16b // vpshufb %xmm1, %xmm11, %xmm5 # 2 = a/k
- eor v1.16b, v1.16b, v0.16b // vpxor %xmm0, %xmm1, %xmm1 # 0 = j
- tbl v3.16b, {v18.16b}, v0.16b // vpshufb %xmm0, %xmm10, %xmm3 # 3 = 1/i
- tbl v4.16b, {v18.16b}, v1.16b // vpshufb %xmm1, %xmm10, %xmm4 # 4 = 1/j
- eor v3.16b, v3.16b, v5.16b // vpxor %xmm5, %xmm3, %xmm3 # 3 = iak = 1/i + a/k
- eor v4.16b, v4.16b, v5.16b // vpxor %xmm5, %xmm4, %xmm4 # 4 = jak = 1/j + a/k
- tbl v2.16b, {v18.16b}, v3.16b // vpshufb %xmm3, %xmm10, %xmm2 # 2 = 1/iak
- tbl v3.16b, {v18.16b}, v4.16b // vpshufb %xmm4, %xmm10, %xmm3 # 3 = 1/jak
- eor v2.16b, v2.16b, v1.16b // vpxor %xmm1, %xmm2, %xmm2 # 2 = io
- eor v3.16b, v3.16b, v0.16b // vpxor %xmm0, %xmm3, %xmm3 # 3 = jo
- ld1 {v16.2d}, [x9],#16 // vmovdqu (%r9), %xmm5
- cbnz w8, Lenc_loop
-
- // middle of last round
- add x10, x11, #0x80
- // vmovdqa -0x60(%r10), %xmm4 # 3 : sbou .Lk_sbo
- // vmovdqa -0x50(%r10), %xmm0 # 0 : sbot .Lk_sbo+16
- tbl v4.16b, {v22.16b}, v2.16b // vpshufb %xmm2, %xmm4, %xmm4 # 4 = sbou
- ld1 {v1.2d}, [x10] // vmovdqa 0x40(%r11,%r10), %xmm1 # Lk_sr[]
- tbl v0.16b, {v23.16b}, v3.16b // vpshufb %xmm3, %xmm0, %xmm0 # 0 = sb1t
- eor v4.16b, v4.16b, v16.16b // vpxor %xmm5, %xmm4, %xmm4 # 4 = sb1u + k
- eor v0.16b, v0.16b, v4.16b // vpxor %xmm4, %xmm0, %xmm0 # 0 = A
- tbl v0.16b, {v0.16b}, v1.16b // vpshufb %xmm1, %xmm0, %xmm0
- ret
-
-
-.globl vpaes_encrypt
-
-.def vpaes_encrypt
- .type 32
-.endef
-.align 4
-vpaes_encrypt:
- AARCH64_SIGN_LINK_REGISTER
- stp x29,x30,[sp,#-16]!
- add x29,sp,#0
-
- ld1 {v7.16b}, [x0]
- bl _vpaes_encrypt_preheat
- bl _vpaes_encrypt_core
- st1 {v0.16b}, [x1]
-
- ldp x29,x30,[sp],#16
- AARCH64_VALIDATE_LINK_REGISTER
- ret
-
-
-.def _vpaes_encrypt_2x
- .type 32
-.endef
-.align 4
-_vpaes_encrypt_2x:
- mov x9, x2
- ldr w8, [x2,#240] // pull rounds
- adrp x11, Lk_mc_forward+16
- add x11, x11, :lo12:Lk_mc_forward+16
- // vmovdqa .Lk_ipt(%rip), %xmm2 # iptlo
- ld1 {v16.2d}, [x9], #16 // vmovdqu (%r9), %xmm5 # round0 key
- and v1.16b, v14.16b, v17.16b // vpand %xmm9, %xmm0, %xmm1
- ushr v0.16b, v14.16b, #4 // vpsrlb $4, %xmm0, %xmm0
- and v9.16b, v15.16b, v17.16b
- ushr v8.16b, v15.16b, #4
- tbl v1.16b, {v20.16b}, v1.16b // vpshufb %xmm1, %xmm2, %xmm1
- tbl v9.16b, {v20.16b}, v9.16b
- // vmovdqa .Lk_ipt+16(%rip), %xmm3 # ipthi
- tbl v2.16b, {v21.16b}, v0.16b // vpshufb %xmm0, %xmm3, %xmm2
- tbl v10.16b, {v21.16b}, v8.16b
- eor v0.16b, v1.16b, v16.16b // vpxor %xmm5, %xmm1, %xmm0
- eor v8.16b, v9.16b, v16.16b
- eor v0.16b, v0.16b, v2.16b // vpxor %xmm2, %xmm0, %xmm0
- eor v8.16b, v8.16b, v10.16b
- b Lenc_2x_entry
-
-.align 4
-Lenc_2x_loop:
- // middle of middle round
- add x10, x11, #0x40
- tbl v4.16b, {v25.16b}, v2.16b // vpshufb %xmm2, %xmm13, %xmm4 # 4 = sb1u
- tbl v12.16b, {v25.16b}, v10.16b
- ld1 {v1.2d}, [x11], #16 // vmovdqa -0x40(%r11,%r10), %xmm1 # Lk_mc_forward[]
- tbl v0.16b, {v24.16b}, v3.16b // vpshufb %xmm3, %xmm12, %xmm0 # 0 = sb1t
- tbl v8.16b, {v24.16b}, v11.16b
- eor v4.16b, v4.16b, v16.16b // vpxor %xmm5, %xmm4, %xmm4 # 4 = sb1u + k
- eor v12.16b, v12.16b, v16.16b
- tbl v5.16b, {v27.16b}, v2.16b // vpshufb %xmm2, %xmm15, %xmm5 # 4 = sb2u
- tbl v13.16b, {v27.16b}, v10.16b
- eor v0.16b, v0.16b, v4.16b // vpxor %xmm4, %xmm0, %xmm0 # 0 = A
- eor v8.16b, v8.16b, v12.16b
- tbl v2.16b, {v26.16b}, v3.16b // vpshufb %xmm3, %xmm14, %xmm2 # 2 = sb2t
- tbl v10.16b, {v26.16b}, v11.16b
- ld1 {v4.2d}, [x10] // vmovdqa (%r11,%r10), %xmm4 # Lk_mc_backward[]
- tbl v3.16b, {v0.16b}, v1.16b // vpshufb %xmm1, %xmm0, %xmm3 # 0 = B
- tbl v11.16b, {v8.16b}, v1.16b
- eor v2.16b, v2.16b, v5.16b // vpxor %xmm5, %xmm2, %xmm2 # 2 = 2A
- eor v10.16b, v10.16b, v13.16b
- tbl v0.16b, {v0.16b}, v4.16b // vpshufb %xmm4, %xmm0, %xmm0 # 3 = D
- tbl v8.16b, {v8.16b}, v4.16b
- eor v3.16b, v3.16b, v2.16b // vpxor %xmm2, %xmm3, %xmm3 # 0 = 2A+B
- eor v11.16b, v11.16b, v10.16b
- tbl v4.16b, {v3.16b}, v1.16b // vpshufb %xmm1, %xmm3, %xmm4 # 0 = 2B+C
- tbl v12.16b, {v11.16b},v1.16b
- eor v0.16b, v0.16b, v3.16b // vpxor %xmm3, %xmm0, %xmm0 # 3 = 2A+B+D
- eor v8.16b, v8.16b, v11.16b
- and x11, x11, #~(1<<6) // and $0x30, %r11 # ... mod 4
- eor v0.16b, v0.16b, v4.16b // vpxor %xmm4, %xmm0, %xmm0 # 0 = 2A+3B+C+D
- eor v8.16b, v8.16b, v12.16b
- sub w8, w8, #1 // nr--
-
-Lenc_2x_entry:
- // top of round
- and v1.16b, v0.16b, v17.16b // vpand %xmm0, %xmm9, %xmm1 # 0 = k
- ushr v0.16b, v0.16b, #4 // vpsrlb $4, %xmm0, %xmm0 # 1 = i
- and v9.16b, v8.16b, v17.16b
- ushr v8.16b, v8.16b, #4
- tbl v5.16b, {v19.16b},v1.16b // vpshufb %xmm1, %xmm11, %xmm5 # 2 = a/k
- tbl v13.16b, {v19.16b},v9.16b
- eor v1.16b, v1.16b, v0.16b // vpxor %xmm0, %xmm1, %xmm1 # 0 = j
- eor v9.16b, v9.16b, v8.16b
- tbl v3.16b, {v18.16b},v0.16b // vpshufb %xmm0, %xmm10, %xmm3 # 3 = 1/i
- tbl v11.16b, {v18.16b},v8.16b
- tbl v4.16b, {v18.16b},v1.16b // vpshufb %xmm1, %xmm10, %xmm4 # 4 = 1/j
- tbl v12.16b, {v18.16b},v9.16b
- eor v3.16b, v3.16b, v5.16b // vpxor %xmm5, %xmm3, %xmm3 # 3 = iak = 1/i + a/k
- eor v11.16b, v11.16b, v13.16b
- eor v4.16b, v4.16b, v5.16b // vpxor %xmm5, %xmm4, %xmm4 # 4 = jak = 1/j + a/k
- eor v12.16b, v12.16b, v13.16b
- tbl v2.16b, {v18.16b},v3.16b // vpshufb %xmm3, %xmm10, %xmm2 # 2 = 1/iak
- tbl v10.16b, {v18.16b},v11.16b
- tbl v3.16b, {v18.16b},v4.16b // vpshufb %xmm4, %xmm10, %xmm3 # 3 = 1/jak
- tbl v11.16b, {v18.16b},v12.16b
- eor v2.16b, v2.16b, v1.16b // vpxor %xmm1, %xmm2, %xmm2 # 2 = io
- eor v10.16b, v10.16b, v9.16b
- eor v3.16b, v3.16b, v0.16b // vpxor %xmm0, %xmm3, %xmm3 # 3 = jo
- eor v11.16b, v11.16b, v8.16b
- ld1 {v16.2d}, [x9],#16 // vmovdqu (%r9), %xmm5
- cbnz w8, Lenc_2x_loop
-
- // middle of last round
- add x10, x11, #0x80
- // vmovdqa -0x60(%r10), %xmm4 # 3 : sbou .Lk_sbo
- // vmovdqa -0x50(%r10), %xmm0 # 0 : sbot .Lk_sbo+16
- tbl v4.16b, {v22.16b}, v2.16b // vpshufb %xmm2, %xmm4, %xmm4 # 4 = sbou
- tbl v12.16b, {v22.16b}, v10.16b
- ld1 {v1.2d}, [x10] // vmovdqa 0x40(%r11,%r10), %xmm1 # Lk_sr[]
- tbl v0.16b, {v23.16b}, v3.16b // vpshufb %xmm3, %xmm0, %xmm0 # 0 = sb1t
- tbl v8.16b, {v23.16b}, v11.16b
- eor v4.16b, v4.16b, v16.16b // vpxor %xmm5, %xmm4, %xmm4 # 4 = sb1u + k
- eor v12.16b, v12.16b, v16.16b
- eor v0.16b, v0.16b, v4.16b // vpxor %xmm4, %xmm0, %xmm0 # 0 = A
- eor v8.16b, v8.16b, v12.16b
- tbl v0.16b, {v0.16b},v1.16b // vpshufb %xmm1, %xmm0, %xmm0
- tbl v1.16b, {v8.16b},v1.16b
- ret
-
-
-.def _vpaes_decrypt_preheat
- .type 32
-.endef
-.align 4
-_vpaes_decrypt_preheat:
- adrp x10, Lk_inv
- add x10, x10, :lo12:Lk_inv
- movi v17.16b, #0x0f
- adrp x11, Lk_dipt
- add x11, x11, :lo12:Lk_dipt
- ld1 {v18.2d,v19.2d}, [x10],#32 // Lk_inv
- ld1 {v20.2d,v21.2d,v22.2d,v23.2d}, [x11],#64 // Lk_dipt, Lk_dsbo
- ld1 {v24.2d,v25.2d,v26.2d,v27.2d}, [x11],#64 // Lk_dsb9, Lk_dsbd
- ld1 {v28.2d,v29.2d,v30.2d,v31.2d}, [x11] // Lk_dsbb, Lk_dsbe
- ret
-
-
-##
-## Decryption core
-##
-## Same API as encryption core.
-##
-.def _vpaes_decrypt_core
- .type 32
-.endef
-.align 4
-_vpaes_decrypt_core:
- mov x9, x2
- ldr w8, [x2,#240] // pull rounds
-
- // vmovdqa .Lk_dipt(%rip), %xmm2 # iptlo
- lsl x11, x8, #4 // mov %rax, %r11; shl $4, %r11
- eor x11, x11, #0x30 // xor $0x30, %r11
- adrp x10, Lk_sr
- add x10, x10, :lo12:Lk_sr
- and x11, x11, #0x30 // and $0x30, %r11
- add x11, x11, x10
- adrp x10, Lk_mc_forward+48
- add x10, x10, :lo12:Lk_mc_forward+48
-
- ld1 {v16.2d}, [x9],#16 // vmovdqu (%r9), %xmm4 # round0 key
- and v1.16b, v7.16b, v17.16b // vpand %xmm9, %xmm0, %xmm1
- ushr v0.16b, v7.16b, #4 // vpsrlb $4, %xmm0, %xmm0
- tbl v2.16b, {v20.16b}, v1.16b // vpshufb %xmm1, %xmm2, %xmm2
- ld1 {v5.2d}, [x10] // vmovdqa Lk_mc_forward+48(%rip), %xmm5
- // vmovdqa .Lk_dipt+16(%rip), %xmm1 # ipthi
- tbl v0.16b, {v21.16b}, v0.16b // vpshufb %xmm0, %xmm1, %xmm0
- eor v2.16b, v2.16b, v16.16b // vpxor %xmm4, %xmm2, %xmm2
- eor v0.16b, v0.16b, v2.16b // vpxor %xmm2, %xmm0, %xmm0
- b Ldec_entry
-
-.align 4
-Ldec_loop:
-//
-// Inverse mix columns
-//
- // vmovdqa -0x20(%r10),%xmm4 # 4 : sb9u
- // vmovdqa -0x10(%r10),%xmm1 # 0 : sb9t
- tbl v4.16b, {v24.16b}, v2.16b // vpshufb %xmm2, %xmm4, %xmm4 # 4 = sb9u
- tbl v1.16b, {v25.16b}, v3.16b // vpshufb %xmm3, %xmm1, %xmm1 # 0 = sb9t
- eor v0.16b, v4.16b, v16.16b // vpxor %xmm4, %xmm0, %xmm0
- // vmovdqa 0x00(%r10),%xmm4 # 4 : sbdu
- eor v0.16b, v0.16b, v1.16b // vpxor %xmm1, %xmm0, %xmm0 # 0 = ch
- // vmovdqa 0x10(%r10),%xmm1 # 0 : sbdt
-
- tbl v4.16b, {v26.16b}, v2.16b // vpshufb %xmm2, %xmm4, %xmm4 # 4 = sbdu
- tbl v0.16b, {v0.16b}, v5.16b // vpshufb %xmm5, %xmm0, %xmm0 # MC ch
- tbl v1.16b, {v27.16b}, v3.16b // vpshufb %xmm3, %xmm1, %xmm1 # 0 = sbdt
- eor v0.16b, v0.16b, v4.16b // vpxor %xmm4, %xmm0, %xmm0 # 4 = ch
- // vmovdqa 0x20(%r10), %xmm4 # 4 : sbbu
- eor v0.16b, v0.16b, v1.16b // vpxor %xmm1, %xmm0, %xmm0 # 0 = ch
- // vmovdqa 0x30(%r10), %xmm1 # 0 : sbbt
-
- tbl v4.16b, {v28.16b}, v2.16b // vpshufb %xmm2, %xmm4, %xmm4 # 4 = sbbu
- tbl v0.16b, {v0.16b}, v5.16b // vpshufb %xmm5, %xmm0, %xmm0 # MC ch
- tbl v1.16b, {v29.16b}, v3.16b // vpshufb %xmm3, %xmm1, %xmm1 # 0 = sbbt
- eor v0.16b, v0.16b, v4.16b // vpxor %xmm4, %xmm0, %xmm0 # 4 = ch
- // vmovdqa 0x40(%r10), %xmm4 # 4 : sbeu
- eor v0.16b, v0.16b, v1.16b // vpxor %xmm1, %xmm0, %xmm0 # 0 = ch
- // vmovdqa 0x50(%r10), %xmm1 # 0 : sbet
-
- tbl v4.16b, {v30.16b}, v2.16b // vpshufb %xmm2, %xmm4, %xmm4 # 4 = sbeu
- tbl v0.16b, {v0.16b}, v5.16b // vpshufb %xmm5, %xmm0, %xmm0 # MC ch
- tbl v1.16b, {v31.16b}, v3.16b // vpshufb %xmm3, %xmm1, %xmm1 # 0 = sbet
- eor v0.16b, v0.16b, v4.16b // vpxor %xmm4, %xmm0, %xmm0 # 4 = ch
- ext v5.16b, v5.16b, v5.16b, #12 // vpalignr $12, %xmm5, %xmm5, %xmm5
- eor v0.16b, v0.16b, v1.16b // vpxor %xmm1, %xmm0, %xmm0 # 0 = ch
- sub w8, w8, #1 // sub $1,%rax # nr--
-
-Ldec_entry:
- // top of round
- and v1.16b, v0.16b, v17.16b // vpand %xmm9, %xmm0, %xmm1 # 0 = k
- ushr v0.16b, v0.16b, #4 // vpsrlb $4, %xmm0, %xmm0 # 1 = i
- tbl v2.16b, {v19.16b}, v1.16b // vpshufb %xmm1, %xmm11, %xmm2 # 2 = a/k
- eor v1.16b, v1.16b, v0.16b // vpxor %xmm0, %xmm1, %xmm1 # 0 = j
- tbl v3.16b, {v18.16b}, v0.16b // vpshufb %xmm0, %xmm10, %xmm3 # 3 = 1/i
- tbl v4.16b, {v18.16b}, v1.16b // vpshufb %xmm1, %xmm10, %xmm4 # 4 = 1/j
- eor v3.16b, v3.16b, v2.16b // vpxor %xmm2, %xmm3, %xmm3 # 3 = iak = 1/i + a/k
- eor v4.16b, v4.16b, v2.16b // vpxor %xmm2, %xmm4, %xmm4 # 4 = jak = 1/j + a/k
- tbl v2.16b, {v18.16b}, v3.16b // vpshufb %xmm3, %xmm10, %xmm2 # 2 = 1/iak
- tbl v3.16b, {v18.16b}, v4.16b // vpshufb %xmm4, %xmm10, %xmm3 # 3 = 1/jak
- eor v2.16b, v2.16b, v1.16b // vpxor %xmm1, %xmm2, %xmm2 # 2 = io
- eor v3.16b, v3.16b, v0.16b // vpxor %xmm0, %xmm3, %xmm3 # 3 = jo
- ld1 {v16.2d}, [x9],#16 // vmovdqu (%r9), %xmm0
- cbnz w8, Ldec_loop
-
- // middle of last round
- // vmovdqa 0x60(%r10), %xmm4 # 3 : sbou
- tbl v4.16b, {v22.16b}, v2.16b // vpshufb %xmm2, %xmm4, %xmm4 # 4 = sbou
- // vmovdqa 0x70(%r10), %xmm1 # 0 : sbot
- ld1 {v2.2d}, [x11] // vmovdqa -0x160(%r11), %xmm2 # Lk_sr-Lk_dsbd=-0x160
- tbl v1.16b, {v23.16b}, v3.16b // vpshufb %xmm3, %xmm1, %xmm1 # 0 = sb1t
- eor v4.16b, v4.16b, v16.16b // vpxor %xmm0, %xmm4, %xmm4 # 4 = sb1u + k
- eor v0.16b, v1.16b, v4.16b // vpxor %xmm4, %xmm1, %xmm0 # 0 = A
- tbl v0.16b, {v0.16b}, v2.16b // vpshufb %xmm2, %xmm0, %xmm0
- ret
-
-
-.globl vpaes_decrypt
-
-.def vpaes_decrypt
- .type 32
-.endef
-.align 4
-vpaes_decrypt:
- AARCH64_SIGN_LINK_REGISTER
- stp x29,x30,[sp,#-16]!
- add x29,sp,#0
-
- ld1 {v7.16b}, [x0]
- bl _vpaes_decrypt_preheat
- bl _vpaes_decrypt_core
- st1 {v0.16b}, [x1]
-
- ldp x29,x30,[sp],#16
- AARCH64_VALIDATE_LINK_REGISTER
- ret
-
-
-// v14-v15 input, v0-v1 output
-.def _vpaes_decrypt_2x
- .type 32
-.endef
-.align 4
-_vpaes_decrypt_2x:
- mov x9, x2
- ldr w8, [x2,#240] // pull rounds
-
- // vmovdqa .Lk_dipt(%rip), %xmm2 # iptlo
- lsl x11, x8, #4 // mov %rax, %r11; shl $4, %r11
- eor x11, x11, #0x30 // xor $0x30, %r11
- adrp x10, Lk_sr
- add x10, x10, :lo12:Lk_sr
- and x11, x11, #0x30 // and $0x30, %r11
- add x11, x11, x10
- adrp x10, Lk_mc_forward+48
- add x10, x10, :lo12:Lk_mc_forward+48
-
- ld1 {v16.2d}, [x9],#16 // vmovdqu (%r9), %xmm4 # round0 key
- and v1.16b, v14.16b, v17.16b // vpand %xmm9, %xmm0, %xmm1
- ushr v0.16b, v14.16b, #4 // vpsrlb $4, %xmm0, %xmm0
- and v9.16b, v15.16b, v17.16b
- ushr v8.16b, v15.16b, #4
- tbl v2.16b, {v20.16b},v1.16b // vpshufb %xmm1, %xmm2, %xmm2
- tbl v10.16b, {v20.16b},v9.16b
- ld1 {v5.2d}, [x10] // vmovdqa Lk_mc_forward+48(%rip), %xmm5
- // vmovdqa .Lk_dipt+16(%rip), %xmm1 # ipthi
- tbl v0.16b, {v21.16b},v0.16b // vpshufb %xmm0, %xmm1, %xmm0
- tbl v8.16b, {v21.16b},v8.16b
- eor v2.16b, v2.16b, v16.16b // vpxor %xmm4, %xmm2, %xmm2
- eor v10.16b, v10.16b, v16.16b
- eor v0.16b, v0.16b, v2.16b // vpxor %xmm2, %xmm0, %xmm0
- eor v8.16b, v8.16b, v10.16b
- b Ldec_2x_entry
-
-.align 4
-Ldec_2x_loop:
-//
-// Inverse mix columns
-//
- // vmovdqa -0x20(%r10),%xmm4 # 4 : sb9u
- // vmovdqa -0x10(%r10),%xmm1 # 0 : sb9t
- tbl v4.16b, {v24.16b}, v2.16b // vpshufb %xmm2, %xmm4, %xmm4 # 4 = sb9u
- tbl v12.16b, {v24.16b}, v10.16b
- tbl v1.16b, {v25.16b}, v3.16b // vpshufb %xmm3, %xmm1, %xmm1 # 0 = sb9t
- tbl v9.16b, {v25.16b}, v11.16b
- eor v0.16b, v4.16b, v16.16b // vpxor %xmm4, %xmm0, %xmm0
- eor v8.16b, v12.16b, v16.16b
- // vmovdqa 0x00(%r10),%xmm4 # 4 : sbdu
- eor v0.16b, v0.16b, v1.16b // vpxor %xmm1, %xmm0, %xmm0 # 0 = ch
- eor v8.16b, v8.16b, v9.16b // vpxor %xmm1, %xmm0, %xmm0 # 0 = ch
- // vmovdqa 0x10(%r10),%xmm1 # 0 : sbdt
-
- tbl v4.16b, {v26.16b}, v2.16b // vpshufb %xmm2, %xmm4, %xmm4 # 4 = sbdu
- tbl v12.16b, {v26.16b}, v10.16b
- tbl v0.16b, {v0.16b},v5.16b // vpshufb %xmm5, %xmm0, %xmm0 # MC ch
- tbl v8.16b, {v8.16b},v5.16b
- tbl v1.16b, {v27.16b}, v3.16b // vpshufb %xmm3, %xmm1, %xmm1 # 0 = sbdt
- tbl v9.16b, {v27.16b}, v11.16b
- eor v0.16b, v0.16b, v4.16b // vpxor %xmm4, %xmm0, %xmm0 # 4 = ch
- eor v8.16b, v8.16b, v12.16b
- // vmovdqa 0x20(%r10), %xmm4 # 4 : sbbu
- eor v0.16b, v0.16b, v1.16b // vpxor %xmm1, %xmm0, %xmm0 # 0 = ch
- eor v8.16b, v8.16b, v9.16b
- // vmovdqa 0x30(%r10), %xmm1 # 0 : sbbt
-
- tbl v4.16b, {v28.16b}, v2.16b // vpshufb %xmm2, %xmm4, %xmm4 # 4 = sbbu
- tbl v12.16b, {v28.16b}, v10.16b
- tbl v0.16b, {v0.16b},v5.16b // vpshufb %xmm5, %xmm0, %xmm0 # MC ch
- tbl v8.16b, {v8.16b},v5.16b
- tbl v1.16b, {v29.16b}, v3.16b // vpshufb %xmm3, %xmm1, %xmm1 # 0 = sbbt
- tbl v9.16b, {v29.16b}, v11.16b
- eor v0.16b, v0.16b, v4.16b // vpxor %xmm4, %xmm0, %xmm0 # 4 = ch
- eor v8.16b, v8.16b, v12.16b
- // vmovdqa 0x40(%r10), %xmm4 # 4 : sbeu
- eor v0.16b, v0.16b, v1.16b // vpxor %xmm1, %xmm0, %xmm0 # 0 = ch
- eor v8.16b, v8.16b, v9.16b
- // vmovdqa 0x50(%r10), %xmm1 # 0 : sbet
-
- tbl v4.16b, {v30.16b}, v2.16b // vpshufb %xmm2, %xmm4, %xmm4 # 4 = sbeu
- tbl v12.16b, {v30.16b}, v10.16b
- tbl v0.16b, {v0.16b},v5.16b // vpshufb %xmm5, %xmm0, %xmm0 # MC ch
- tbl v8.16b, {v8.16b},v5.16b
- tbl v1.16b, {v31.16b}, v3.16b // vpshufb %xmm3, %xmm1, %xmm1 # 0 = sbet
- tbl v9.16b, {v31.16b}, v11.16b
- eor v0.16b, v0.16b, v4.16b // vpxor %xmm4, %xmm0, %xmm0 # 4 = ch
- eor v8.16b, v8.16b, v12.16b
- ext v5.16b, v5.16b, v5.16b, #12 // vpalignr $12, %xmm5, %xmm5, %xmm5
- eor v0.16b, v0.16b, v1.16b // vpxor %xmm1, %xmm0, %xmm0 # 0 = ch
- eor v8.16b, v8.16b, v9.16b
- sub w8, w8, #1 // sub $1,%rax # nr--
-
-Ldec_2x_entry:
- // top of round
- and v1.16b, v0.16b, v17.16b // vpand %xmm9, %xmm0, %xmm1 # 0 = k
- ushr v0.16b, v0.16b, #4 // vpsrlb $4, %xmm0, %xmm0 # 1 = i
- and v9.16b, v8.16b, v17.16b
- ushr v8.16b, v8.16b, #4
- tbl v2.16b, {v19.16b},v1.16b // vpshufb %xmm1, %xmm11, %xmm2 # 2 = a/k
- tbl v10.16b, {v19.16b},v9.16b
- eor v1.16b, v1.16b, v0.16b // vpxor %xmm0, %xmm1, %xmm1 # 0 = j
- eor v9.16b, v9.16b, v8.16b
- tbl v3.16b, {v18.16b},v0.16b // vpshufb %xmm0, %xmm10, %xmm3 # 3 = 1/i
- tbl v11.16b, {v18.16b},v8.16b
- tbl v4.16b, {v18.16b},v1.16b // vpshufb %xmm1, %xmm10, %xmm4 # 4 = 1/j
- tbl v12.16b, {v18.16b},v9.16b
- eor v3.16b, v3.16b, v2.16b // vpxor %xmm2, %xmm3, %xmm3 # 3 = iak = 1/i + a/k
- eor v11.16b, v11.16b, v10.16b
- eor v4.16b, v4.16b, v2.16b // vpxor %xmm2, %xmm4, %xmm4 # 4 = jak = 1/j + a/k
- eor v12.16b, v12.16b, v10.16b
- tbl v2.16b, {v18.16b},v3.16b // vpshufb %xmm3, %xmm10, %xmm2 # 2 = 1/iak
- tbl v10.16b, {v18.16b},v11.16b
- tbl v3.16b, {v18.16b},v4.16b // vpshufb %xmm4, %xmm10, %xmm3 # 3 = 1/jak
- tbl v11.16b, {v18.16b},v12.16b
- eor v2.16b, v2.16b, v1.16b // vpxor %xmm1, %xmm2, %xmm2 # 2 = io
- eor v10.16b, v10.16b, v9.16b
- eor v3.16b, v3.16b, v0.16b // vpxor %xmm0, %xmm3, %xmm3 # 3 = jo
- eor v11.16b, v11.16b, v8.16b
- ld1 {v16.2d}, [x9],#16 // vmovdqu (%r9), %xmm0
- cbnz w8, Ldec_2x_loop
-
- // middle of last round
- // vmovdqa 0x60(%r10), %xmm4 # 3 : sbou
- tbl v4.16b, {v22.16b}, v2.16b // vpshufb %xmm2, %xmm4, %xmm4 # 4 = sbou
- tbl v12.16b, {v22.16b}, v10.16b
- // vmovdqa 0x70(%r10), %xmm1 # 0 : sbot
- tbl v1.16b, {v23.16b}, v3.16b // vpshufb %xmm3, %xmm1, %xmm1 # 0 = sb1t
- tbl v9.16b, {v23.16b}, v11.16b
- ld1 {v2.2d}, [x11] // vmovdqa -0x160(%r11), %xmm2 # Lk_sr-Lk_dsbd=-0x160
- eor v4.16b, v4.16b, v16.16b // vpxor %xmm0, %xmm4, %xmm4 # 4 = sb1u + k
- eor v12.16b, v12.16b, v16.16b
- eor v0.16b, v1.16b, v4.16b // vpxor %xmm4, %xmm1, %xmm0 # 0 = A
- eor v8.16b, v9.16b, v12.16b
- tbl v0.16b, {v0.16b},v2.16b // vpshufb %xmm2, %xmm0, %xmm0
- tbl v1.16b, {v8.16b},v2.16b
- ret
-
-########################################################
-## ##
-## AES key schedule ##
-## ##
-########################################################
-.def _vpaes_key_preheat
- .type 32
-.endef
-.align 4
-_vpaes_key_preheat:
- adrp x10, Lk_inv
- add x10, x10, :lo12:Lk_inv
- movi v16.16b, #0x5b // Lk_s63
- adrp x11, Lk_sb1
- add x11, x11, :lo12:Lk_sb1
- movi v17.16b, #0x0f // Lk_s0F
- ld1 {v18.2d,v19.2d,v20.2d,v21.2d}, [x10] // Lk_inv, Lk_ipt
- adrp x10, Lk_dksd
- add x10, x10, :lo12:Lk_dksd
- ld1 {v22.2d,v23.2d}, [x11] // Lk_sb1
- adrp x11, Lk_mc_forward
- add x11, x11, :lo12:Lk_mc_forward
- ld1 {v24.2d,v25.2d,v26.2d,v27.2d}, [x10],#64 // Lk_dksd, Lk_dksb
- ld1 {v28.2d,v29.2d,v30.2d,v31.2d}, [x10],#64 // Lk_dkse, Lk_dks9
- ld1 {v8.2d}, [x10] // Lk_rcon
- ld1 {v9.2d}, [x11] // Lk_mc_forward[0]
- ret
-
-
-.def _vpaes_schedule_core
- .type 32
-.endef
-.align 4
-_vpaes_schedule_core:
- AARCH64_SIGN_LINK_REGISTER
- stp x29, x30, [sp,#-16]!
- add x29,sp,#0
-
- bl _vpaes_key_preheat // load the tables
-
- ld1 {v0.16b}, [x0],#16 // vmovdqu (%rdi), %xmm0 # load key (unaligned)
-
- // input transform
- mov v3.16b, v0.16b // vmovdqa %xmm0, %xmm3
- bl _vpaes_schedule_transform
- mov v7.16b, v0.16b // vmovdqa %xmm0, %xmm7
-
- adrp x10, Lk_sr // lea Lk_sr(%rip),%r10
- add x10, x10, :lo12:Lk_sr
-
- add x8, x8, x10
- cbnz w3, Lschedule_am_decrypting
-
- // encrypting, output zeroth round key after transform
- st1 {v0.2d}, [x2] // vmovdqu %xmm0, (%rdx)
- b Lschedule_go
-
-Lschedule_am_decrypting:
- // decrypting, output zeroth round key after shiftrows
- ld1 {v1.2d}, [x8] // vmovdqa (%r8,%r10), %xmm1
- tbl v3.16b, {v3.16b}, v1.16b // vpshufb %xmm1, %xmm3, %xmm3
- st1 {v3.2d}, [x2] // vmovdqu %xmm3, (%rdx)
- eor x8, x8, #0x30 // xor $0x30, %r8
-
-Lschedule_go:
- cmp w1, #192 // cmp $192, %esi
- b.hi Lschedule_256
- b.eq Lschedule_192
- // 128: fall though
-
-##
-## .schedule_128
-##
-## 128-bit specific part of key schedule.
-##
-## This schedule is really simple, because all its parts
-## are accomplished by the subroutines.
-##
-Lschedule_128:
- mov x0, #10 // mov $10, %esi
-
-Loop_schedule_128:
- sub x0, x0, #1 // dec %esi
- bl _vpaes_schedule_round
- cbz x0, Lschedule_mangle_last
- bl _vpaes_schedule_mangle // write output
- b Loop_schedule_128
-
-##
-## .aes_schedule_192
-##
-## 192-bit specific part of key schedule.
-##
-## The main body of this schedule is the same as the 128-bit
-## schedule, but with more smearing. The long, high side is
-## stored in %xmm7 as before, and the short, low side is in
-## the high bits of %xmm6.
-##
-## This schedule is somewhat nastier, however, because each
-## round produces 192 bits of key material, or 1.5 round keys.
-## Therefore, on each cycle we do 2 rounds and produce 3 round
-## keys.
-##
-.align 4
-Lschedule_192:
- sub x0, x0, #8
- ld1 {v0.16b}, [x0] // vmovdqu 8(%rdi),%xmm0 # load key part 2 (very unaligned)
- bl _vpaes_schedule_transform // input transform
- mov v6.16b, v0.16b // vmovdqa %xmm0, %xmm6 # save short part
- eor v4.16b, v4.16b, v4.16b // vpxor %xmm4, %xmm4, %xmm4 # clear 4
- ins v6.d[0], v4.d[0] // vmovhlps %xmm4, %xmm6, %xmm6 # clobber low side with zeros
- mov x0, #4 // mov $4, %esi
-
-Loop_schedule_192:
- sub x0, x0, #1 // dec %esi
- bl _vpaes_schedule_round
- ext v0.16b, v6.16b, v0.16b, #8 // vpalignr $8,%xmm6,%xmm0,%xmm0
- bl _vpaes_schedule_mangle // save key n
- bl _vpaes_schedule_192_smear
- bl _vpaes_schedule_mangle // save key n+1
- bl _vpaes_schedule_round
- cbz x0, Lschedule_mangle_last
- bl _vpaes_schedule_mangle // save key n+2
- bl _vpaes_schedule_192_smear
- b Loop_schedule_192
-
-##
-## .aes_schedule_256
-##
-## 256-bit specific part of key schedule.
-##
-## The structure here is very similar to the 128-bit
-## schedule, but with an additional "low side" in
-## %xmm6. The low side's rounds are the same as the
-## high side's, except no rcon and no rotation.
-##
-.align 4
-Lschedule_256:
- ld1 {v0.16b}, [x0] // vmovdqu 16(%rdi),%xmm0 # load key part 2 (unaligned)
- bl _vpaes_schedule_transform // input transform
- mov x0, #7 // mov $7, %esi
-
-Loop_schedule_256:
- sub x0, x0, #1 // dec %esi
- bl _vpaes_schedule_mangle // output low result
- mov v6.16b, v0.16b // vmovdqa %xmm0, %xmm6 # save cur_lo in xmm6
-
- // high round
- bl _vpaes_schedule_round
- cbz x0, Lschedule_mangle_last
- bl _vpaes_schedule_mangle
-
- // low round. swap xmm7 and xmm6
- dup v0.4s, v0.s[3] // vpshufd $0xFF, %xmm0, %xmm0
- movi v4.16b, #0
- mov v5.16b, v7.16b // vmovdqa %xmm7, %xmm5
- mov v7.16b, v6.16b // vmovdqa %xmm6, %xmm7
- bl _vpaes_schedule_low_round
- mov v7.16b, v5.16b // vmovdqa %xmm5, %xmm7
-
- b Loop_schedule_256
-
-##
-## .aes_schedule_mangle_last
-##
-## Mangler for last round of key schedule
-## Mangles %xmm0
-## when encrypting, outputs out(%xmm0) ^ 63
-## when decrypting, outputs unskew(%xmm0)
-##
-## Always called right before return... jumps to cleanup and exits
-##
-.align 4
-Lschedule_mangle_last:
- // schedule last round key from xmm0
- adrp x11, Lk_deskew // lea Lk_deskew(%rip),%r11 # prepare to deskew
- add x11, x11, :lo12:Lk_deskew
-
- cbnz w3, Lschedule_mangle_last_dec
-
- // encrypting
- ld1 {v1.2d}, [x8] // vmovdqa (%r8,%r10),%xmm1
- adrp x11, Lk_opt // lea Lk_opt(%rip), %r11 # prepare to output transform
- add x11, x11, :lo12:Lk_opt
- add x2, x2, #32 // add $32, %rdx
- tbl v0.16b, {v0.16b}, v1.16b // vpshufb %xmm1, %xmm0, %xmm0 # output permute
-
-Lschedule_mangle_last_dec:
- ld1 {v20.2d,v21.2d}, [x11] // reload constants
- sub x2, x2, #16 // add $-16, %rdx
- eor v0.16b, v0.16b, v16.16b // vpxor Lk_s63(%rip), %xmm0, %xmm0
- bl _vpaes_schedule_transform // output transform
- st1 {v0.2d}, [x2] // vmovdqu %xmm0, (%rdx) # save last key
-
- // cleanup
- eor v0.16b, v0.16b, v0.16b // vpxor %xmm0, %xmm0, %xmm0
- eor v1.16b, v1.16b, v1.16b // vpxor %xmm1, %xmm1, %xmm1
- eor v2.16b, v2.16b, v2.16b // vpxor %xmm2, %xmm2, %xmm2
- eor v3.16b, v3.16b, v3.16b // vpxor %xmm3, %xmm3, %xmm3
- eor v4.16b, v4.16b, v4.16b // vpxor %xmm4, %xmm4, %xmm4
- eor v5.16b, v5.16b, v5.16b // vpxor %xmm5, %xmm5, %xmm5
- eor v6.16b, v6.16b, v6.16b // vpxor %xmm6, %xmm6, %xmm6
- eor v7.16b, v7.16b, v7.16b // vpxor %xmm7, %xmm7, %xmm7
- ldp x29, x30, [sp],#16
- AARCH64_VALIDATE_LINK_REGISTER
- ret
-
-
-##
-## .aes_schedule_192_smear
-##
-## Smear the short, low side in the 192-bit key schedule.
-##
-## Inputs:
-## %xmm7: high side, b a x y
-## %xmm6: low side, d c 0 0
-## %xmm13: 0
-##
-## Outputs:
-## %xmm6: b+c+d b+c 0 0
-## %xmm0: b+c+d b+c b a
-##
-.def _vpaes_schedule_192_smear
- .type 32
-.endef
-.align 4
-_vpaes_schedule_192_smear:
- movi v1.16b, #0
- dup v0.4s, v7.s[3]
- ins v1.s[3], v6.s[2] // vpshufd $0x80, %xmm6, %xmm1 # d c 0 0 -> c 0 0 0
- ins v0.s[0], v7.s[2] // vpshufd $0xFE, %xmm7, %xmm0 # b a _ _ -> b b b a
- eor v6.16b, v6.16b, v1.16b // vpxor %xmm1, %xmm6, %xmm6 # -> c+d c 0 0
- eor v1.16b, v1.16b, v1.16b // vpxor %xmm1, %xmm1, %xmm1
- eor v6.16b, v6.16b, v0.16b // vpxor %xmm0, %xmm6, %xmm6 # -> b+c+d b+c b a
- mov v0.16b, v6.16b // vmovdqa %xmm6, %xmm0
- ins v6.d[0], v1.d[0] // vmovhlps %xmm1, %xmm6, %xmm6 # clobber low side with zeros
- ret
-
-
-##
-## .aes_schedule_round
-##
-## Runs one main round of the key schedule on %xmm0, %xmm7
-##
-## Specifically, runs subbytes on the high dword of %xmm0
-## then rotates it by one byte and xors into the low dword of
-## %xmm7.
-##
-## Adds rcon from low byte of %xmm8, then rotates %xmm8 for
-## next rcon.
-##
-## Smears the dwords of %xmm7 by xoring the low into the
-## second low, result into third, result into highest.
-##
-## Returns results in %xmm7 = %xmm0.
-## Clobbers %xmm1-%xmm4, %r11.
-##
-.def _vpaes_schedule_round
- .type 32
-.endef
-.align 4
-_vpaes_schedule_round:
- // extract rcon from xmm8
- movi v4.16b, #0 // vpxor %xmm4, %xmm4, %xmm4
- ext v1.16b, v8.16b, v4.16b, #15 // vpalignr $15, %xmm8, %xmm4, %xmm1
- ext v8.16b, v8.16b, v8.16b, #15 // vpalignr $15, %xmm8, %xmm8, %xmm8
- eor v7.16b, v7.16b, v1.16b // vpxor %xmm1, %xmm7, %xmm7
-
- // rotate
- dup v0.4s, v0.s[3] // vpshufd $0xFF, %xmm0, %xmm0
- ext v0.16b, v0.16b, v0.16b, #1 // vpalignr $1, %xmm0, %xmm0, %xmm0
-
- // fall through...
-
- // low round: same as high round, but no rotation and no rcon.
-_vpaes_schedule_low_round:
- // smear xmm7
- ext v1.16b, v4.16b, v7.16b, #12 // vpslldq $4, %xmm7, %xmm1
- eor v7.16b, v7.16b, v1.16b // vpxor %xmm1, %xmm7, %xmm7
- ext v4.16b, v4.16b, v7.16b, #8 // vpslldq $8, %xmm7, %xmm4
-
- // subbytes
- and v1.16b, v0.16b, v17.16b // vpand %xmm9, %xmm0, %xmm1 # 0 = k
- ushr v0.16b, v0.16b, #4 // vpsrlb $4, %xmm0, %xmm0 # 1 = i
- eor v7.16b, v7.16b, v4.16b // vpxor %xmm4, %xmm7, %xmm7
- tbl v2.16b, {v19.16b}, v1.16b // vpshufb %xmm1, %xmm11, %xmm2 # 2 = a/k
- eor v1.16b, v1.16b, v0.16b // vpxor %xmm0, %xmm1, %xmm1 # 0 = j
- tbl v3.16b, {v18.16b}, v0.16b // vpshufb %xmm0, %xmm10, %xmm3 # 3 = 1/i
- eor v3.16b, v3.16b, v2.16b // vpxor %xmm2, %xmm3, %xmm3 # 3 = iak = 1/i + a/k
- tbl v4.16b, {v18.16b}, v1.16b // vpshufb %xmm1, %xmm10, %xmm4 # 4 = 1/j
- eor v7.16b, v7.16b, v16.16b // vpxor Lk_s63(%rip), %xmm7, %xmm7
- tbl v3.16b, {v18.16b}, v3.16b // vpshufb %xmm3, %xmm10, %xmm3 # 2 = 1/iak
- eor v4.16b, v4.16b, v2.16b // vpxor %xmm2, %xmm4, %xmm4 # 4 = jak = 1/j + a/k
- tbl v2.16b, {v18.16b}, v4.16b // vpshufb %xmm4, %xmm10, %xmm2 # 3 = 1/jak
- eor v3.16b, v3.16b, v1.16b // vpxor %xmm1, %xmm3, %xmm3 # 2 = io
- eor v2.16b, v2.16b, v0.16b // vpxor %xmm0, %xmm2, %xmm2 # 3 = jo
- tbl v4.16b, {v23.16b}, v3.16b // vpshufb %xmm3, %xmm13, %xmm4 # 4 = sbou
- tbl v1.16b, {v22.16b}, v2.16b // vpshufb %xmm2, %xmm12, %xmm1 # 0 = sb1t
- eor v1.16b, v1.16b, v4.16b // vpxor %xmm4, %xmm1, %xmm1 # 0 = sbox output
-
- // add in smeared stuff
- eor v0.16b, v1.16b, v7.16b // vpxor %xmm7, %xmm1, %xmm0
- eor v7.16b, v1.16b, v7.16b // vmovdqa %xmm0, %xmm7
- ret
-
-
-##
-## .aes_schedule_transform
-##
-## Linear-transform %xmm0 according to tables at (%r11)
-##
-## Requires that %xmm9 = 0x0F0F... as in preheat
-## Output in %xmm0
-## Clobbers %xmm1, %xmm2
-##
-.def _vpaes_schedule_transform
- .type 32
-.endef
-.align 4
-_vpaes_schedule_transform:
- and v1.16b, v0.16b, v17.16b // vpand %xmm9, %xmm0, %xmm1
- ushr v0.16b, v0.16b, #4 // vpsrlb $4, %xmm0, %xmm0
- // vmovdqa (%r11), %xmm2 # lo
- tbl v2.16b, {v20.16b}, v1.16b // vpshufb %xmm1, %xmm2, %xmm2
- // vmovdqa 16(%r11), %xmm1 # hi
- tbl v0.16b, {v21.16b}, v0.16b // vpshufb %xmm0, %xmm1, %xmm0
- eor v0.16b, v0.16b, v2.16b // vpxor %xmm2, %xmm0, %xmm0
- ret
-
-
-##
-## .aes_schedule_mangle
-##
-## Mangle xmm0 from (basis-transformed) standard version
-## to our version.
-##
-## On encrypt,
-## xor with 0x63
-## multiply by circulant 0,1,1,1
-## apply shiftrows transform
-##
-## On decrypt,
-## xor with 0x63
-## multiply by "inverse mixcolumns" circulant E,B,D,9
-## deskew
-## apply shiftrows transform
-##
-##
-## Writes out to (%rdx), and increments or decrements it
-## Keeps track of round number mod 4 in %r8
-## Preserves xmm0
-## Clobbers xmm1-xmm5
-##
-.def _vpaes_schedule_mangle
- .type 32
-.endef
-.align 4
-_vpaes_schedule_mangle:
- mov v4.16b, v0.16b // vmovdqa %xmm0, %xmm4 # save xmm0 for later
- // vmovdqa .Lk_mc_forward(%rip),%xmm5
- cbnz w3, Lschedule_mangle_dec
-
- // encrypting
- eor v4.16b, v0.16b, v16.16b // vpxor Lk_s63(%rip), %xmm0, %xmm4
- add x2, x2, #16 // add $16, %rdx
- tbl v4.16b, {v4.16b}, v9.16b // vpshufb %xmm5, %xmm4, %xmm4
- tbl v1.16b, {v4.16b}, v9.16b // vpshufb %xmm5, %xmm4, %xmm1
- tbl v3.16b, {v1.16b}, v9.16b // vpshufb %xmm5, %xmm1, %xmm3
- eor v4.16b, v4.16b, v1.16b // vpxor %xmm1, %xmm4, %xmm4
- ld1 {v1.2d}, [x8] // vmovdqa (%r8,%r10), %xmm1
- eor v3.16b, v3.16b, v4.16b // vpxor %xmm4, %xmm3, %xmm3
-
- b Lschedule_mangle_both
-.align 4
-Lschedule_mangle_dec:
- // inverse mix columns
- // lea .Lk_dksd(%rip),%r11
- ushr v1.16b, v4.16b, #4 // vpsrlb $4, %xmm4, %xmm1 # 1 = hi
- and v4.16b, v4.16b, v17.16b // vpand %xmm9, %xmm4, %xmm4 # 4 = lo
-
- // vmovdqa 0x00(%r11), %xmm2
- tbl v2.16b, {v24.16b}, v4.16b // vpshufb %xmm4, %xmm2, %xmm2
- // vmovdqa 0x10(%r11), %xmm3
- tbl v3.16b, {v25.16b}, v1.16b // vpshufb %xmm1, %xmm3, %xmm3
- eor v3.16b, v3.16b, v2.16b // vpxor %xmm2, %xmm3, %xmm3
- tbl v3.16b, {v3.16b}, v9.16b // vpshufb %xmm5, %xmm3, %xmm3
-
- // vmovdqa 0x20(%r11), %xmm2
- tbl v2.16b, {v26.16b}, v4.16b // vpshufb %xmm4, %xmm2, %xmm2
- eor v2.16b, v2.16b, v3.16b // vpxor %xmm3, %xmm2, %xmm2
- // vmovdqa 0x30(%r11), %xmm3
- tbl v3.16b, {v27.16b}, v1.16b // vpshufb %xmm1, %xmm3, %xmm3
- eor v3.16b, v3.16b, v2.16b // vpxor %xmm2, %xmm3, %xmm3
- tbl v3.16b, {v3.16b}, v9.16b // vpshufb %xmm5, %xmm3, %xmm3
-
- // vmovdqa 0x40(%r11), %xmm2
- tbl v2.16b, {v28.16b}, v4.16b // vpshufb %xmm4, %xmm2, %xmm2
- eor v2.16b, v2.16b, v3.16b // vpxor %xmm3, %xmm2, %xmm2
- // vmovdqa 0x50(%r11), %xmm3
- tbl v3.16b, {v29.16b}, v1.16b // vpshufb %xmm1, %xmm3, %xmm3
- eor v3.16b, v3.16b, v2.16b // vpxor %xmm2, %xmm3, %xmm3
-
- // vmovdqa 0x60(%r11), %xmm2
- tbl v2.16b, {v30.16b}, v4.16b // vpshufb %xmm4, %xmm2, %xmm2
- tbl v3.16b, {v3.16b}, v9.16b // vpshufb %xmm5, %xmm3, %xmm3
- // vmovdqa 0x70(%r11), %xmm4
- tbl v4.16b, {v31.16b}, v1.16b // vpshufb %xmm1, %xmm4, %xmm4
- ld1 {v1.2d}, [x8] // vmovdqa (%r8,%r10), %xmm1
- eor v2.16b, v2.16b, v3.16b // vpxor %xmm3, %xmm2, %xmm2
- eor v3.16b, v4.16b, v2.16b // vpxor %xmm2, %xmm4, %xmm3
-
- sub x2, x2, #16 // add $-16, %rdx
-
-Lschedule_mangle_both:
- tbl v3.16b, {v3.16b}, v1.16b // vpshufb %xmm1, %xmm3, %xmm3
- add x8, x8, #48 // add $-16, %r8
- and x8, x8, #~(1<<6) // and $0x30, %r8
- st1 {v3.2d}, [x2] // vmovdqu %xmm3, (%rdx)
- ret
-
-
-.globl vpaes_set_encrypt_key
-
-.def vpaes_set_encrypt_key
- .type 32
-.endef
-.align 4
-vpaes_set_encrypt_key:
- AARCH64_SIGN_LINK_REGISTER
- stp x29,x30,[sp,#-16]!
- add x29,sp,#0
- stp d8,d9,[sp,#-16]! // ABI spec says so
-
- lsr w9, w1, #5 // shr $5,%eax
- add w9, w9, #5 // $5,%eax
- str w9, [x2,#240] // mov %eax,240(%rdx) # AES_KEY->rounds = nbits/32+5;
-
- mov w3, #0 // mov $0,%ecx
- mov x8, #0x30 // mov $0x30,%r8d
- bl _vpaes_schedule_core
- eor x0, x0, x0
-
- ldp d8,d9,[sp],#16
- ldp x29,x30,[sp],#16
- AARCH64_VALIDATE_LINK_REGISTER
- ret
-
-
-.globl vpaes_set_decrypt_key
-
-.def vpaes_set_decrypt_key
- .type 32
-.endef
-.align 4
-vpaes_set_decrypt_key:
- AARCH64_SIGN_LINK_REGISTER
- stp x29,x30,[sp,#-16]!
- add x29,sp,#0
- stp d8,d9,[sp,#-16]! // ABI spec says so
-
- lsr w9, w1, #5 // shr $5,%eax
- add w9, w9, #5 // $5,%eax
- str w9, [x2,#240] // mov %eax,240(%rdx) # AES_KEY->rounds = nbits/32+5;
- lsl w9, w9, #4 // shl $4,%eax
- add x2, x2, #16 // lea 16(%rdx,%rax),%rdx
- add x2, x2, x9
-
- mov w3, #1 // mov $1,%ecx
- lsr w8, w1, #1 // shr $1,%r8d
- and x8, x8, #32 // and $32,%r8d
- eor x8, x8, #32 // xor $32,%r8d # nbits==192?0:32
- bl _vpaes_schedule_core
-
- ldp d8,d9,[sp],#16
- ldp x29,x30,[sp],#16
- AARCH64_VALIDATE_LINK_REGISTER
- ret
-
-.globl vpaes_cbc_encrypt
-
-.def vpaes_cbc_encrypt
- .type 32
-.endef
-.align 4
-vpaes_cbc_encrypt:
- AARCH64_SIGN_LINK_REGISTER
- cbz x2, Lcbc_abort
- cmp w5, #0 // check direction
- b.eq vpaes_cbc_decrypt
-
- stp x29,x30,[sp,#-16]!
- add x29,sp,#0
-
- mov x17, x2 // reassign
- mov x2, x3 // reassign
-
- ld1 {v0.16b}, [x4] // load ivec
- bl _vpaes_encrypt_preheat
- b Lcbc_enc_loop
-
-.align 4
-Lcbc_enc_loop:
- ld1 {v7.16b}, [x0],#16 // load input
- eor v7.16b, v7.16b, v0.16b // xor with ivec
- bl _vpaes_encrypt_core
- st1 {v0.16b}, [x1],#16 // save output
- subs x17, x17, #16
- b.hi Lcbc_enc_loop
-
- st1 {v0.16b}, [x4] // write ivec
-
- ldp x29,x30,[sp],#16
-Lcbc_abort:
- AARCH64_VALIDATE_LINK_REGISTER
- ret
-
-
-.def vpaes_cbc_decrypt
- .type 32
-.endef
-.align 4
-vpaes_cbc_decrypt:
- // Not adding AARCH64_SIGN_LINK_REGISTER here because vpaes_cbc_decrypt is jumped to
- // only from vpaes_cbc_encrypt which has already signed the return address.
- stp x29,x30,[sp,#-16]!
- add x29,sp,#0
- stp d8,d9,[sp,#-16]! // ABI spec says so
- stp d10,d11,[sp,#-16]!
- stp d12,d13,[sp,#-16]!
- stp d14,d15,[sp,#-16]!
-
- mov x17, x2 // reassign
- mov x2, x3 // reassign
- ld1 {v6.16b}, [x4] // load ivec
- bl _vpaes_decrypt_preheat
- tst x17, #16
- b.eq Lcbc_dec_loop2x
-
- ld1 {v7.16b}, [x0], #16 // load input
- bl _vpaes_decrypt_core
- eor v0.16b, v0.16b, v6.16b // xor with ivec
- orr v6.16b, v7.16b, v7.16b // next ivec value
- st1 {v0.16b}, [x1], #16
- subs x17, x17, #16
- b.ls Lcbc_dec_done
-
-.align 4
-Lcbc_dec_loop2x:
- ld1 {v14.16b,v15.16b}, [x0], #32
- bl _vpaes_decrypt_2x
- eor v0.16b, v0.16b, v6.16b // xor with ivec
- eor v1.16b, v1.16b, v14.16b
- orr v6.16b, v15.16b, v15.16b
- st1 {v0.16b,v1.16b}, [x1], #32
- subs x17, x17, #32
- b.hi Lcbc_dec_loop2x
-
-Lcbc_dec_done:
- st1 {v6.16b}, [x4]
-
- ldp d14,d15,[sp],#16
- ldp d12,d13,[sp],#16
- ldp d10,d11,[sp],#16
- ldp d8,d9,[sp],#16
- ldp x29,x30,[sp],#16
- AARCH64_VALIDATE_LINK_REGISTER
- ret
-
-.globl vpaes_ctr32_encrypt_blocks
-
-.def vpaes_ctr32_encrypt_blocks
- .type 32
-.endef
-.align 4
-vpaes_ctr32_encrypt_blocks:
- AARCH64_SIGN_LINK_REGISTER
- stp x29,x30,[sp,#-16]!
- add x29,sp,#0
- stp d8,d9,[sp,#-16]! // ABI spec says so
- stp d10,d11,[sp,#-16]!
- stp d12,d13,[sp,#-16]!
- stp d14,d15,[sp,#-16]!
-
- cbz x2, Lctr32_done
-
- // Note, unlike the other functions, x2 here is measured in blocks,
- // not bytes.
- mov x17, x2
- mov x2, x3
-
- // Load the IV and counter portion.
- ldr w6, [x4, #12]
- ld1 {v7.16b}, [x4]
-
- bl _vpaes_encrypt_preheat
- tst x17, #1
- rev w6, w6 // The counter is big-endian.
- b.eq Lctr32_prep_loop
-
- // Handle one block so the remaining block count is even for
- // _vpaes_encrypt_2x.
- ld1 {v6.16b}, [x0], #16 // Load input ahead of time
- bl _vpaes_encrypt_core
- eor v0.16b, v0.16b, v6.16b // XOR input and result
- st1 {v0.16b}, [x1], #16
- subs x17, x17, #1
- // Update the counter.
- add w6, w6, #1
- rev w7, w6
- mov v7.s[3], w7
- b.ls Lctr32_done
-
-Lctr32_prep_loop:
- // _vpaes_encrypt_core takes its input from v7, while _vpaes_encrypt_2x
- // uses v14 and v15.
- mov v15.16b, v7.16b
- mov v14.16b, v7.16b
- add w6, w6, #1
- rev w7, w6
- mov v15.s[3], w7
-
-Lctr32_loop:
- ld1 {v6.16b,v7.16b}, [x0], #32 // Load input ahead of time
- bl _vpaes_encrypt_2x
- eor v0.16b, v0.16b, v6.16b // XOR input and result
- eor v1.16b, v1.16b, v7.16b // XOR input and result (#2)
- st1 {v0.16b,v1.16b}, [x1], #32
- subs x17, x17, #2
- // Update the counter.
- add w7, w6, #1
- add w6, w6, #2
- rev w7, w7
- mov v14.s[3], w7
- rev w7, w6
- mov v15.s[3], w7
- b.hi Lctr32_loop
-
-Lctr32_done:
- ldp d14,d15,[sp],#16
- ldp d12,d13,[sp],#16
- ldp d10,d11,[sp],#16
- ldp d8,d9,[sp],#16
- ldp x29,x30,[sp],#16
- AARCH64_VALIDATE_LINK_REGISTER
- ret
-
-#endif // !OPENSSL_NO_ASM && defined(OPENSSL_AARCH64) && defined(_WIN32)
diff --git a/win-aarch64/crypto/test/trampoline-armv8-win.S b/win-aarch64/crypto/test/trampoline-armv8-win.S
deleted file mode 100644
index 14773e3a..00000000
--- a/win-aarch64/crypto/test/trampoline-armv8-win.S
+++ /dev/null
@@ -1,750 +0,0 @@
-// This file is generated from a similarly-named Perl script in the BoringSSL
-// source tree. Do not edit by hand.
-
-#include <openssl/asm_base.h>
-
-#if !defined(OPENSSL_NO_ASM) && defined(OPENSSL_AARCH64) && defined(_WIN32)
-#include <openssl/arm_arch.h>
-
-.text
-
-// abi_test_trampoline loads callee-saved registers from |state|, calls |func|
-// with |argv|, then saves the callee-saved registers into |state|. It returns
-// the result of |func|. The |unwind| argument is unused.
-// uint64_t abi_test_trampoline(void (*func)(...), CallerState *state,
-// const uint64_t *argv, size_t argc,
-// uint64_t unwind);
-
-.globl abi_test_trampoline
-
-.align 4
-abi_test_trampoline:
-Labi_test_trampoline_begin:
- AARCH64_SIGN_LINK_REGISTER
- // Stack layout (low to high addresses)
- // x29,x30 (16 bytes)
- // d8-d15 (64 bytes)
- // x19-x28 (80 bytes)
- // x1 (8 bytes)
- // padding (8 bytes)
- stp x29, x30, [sp, #-176]!
- mov x29, sp
-
- // Saved callee-saved registers and |state|.
- stp d8, d9, [sp, #16]
- stp d10, d11, [sp, #32]
- stp d12, d13, [sp, #48]
- stp d14, d15, [sp, #64]
- stp x19, x20, [sp, #80]
- stp x21, x22, [sp, #96]
- stp x23, x24, [sp, #112]
- stp x25, x26, [sp, #128]
- stp x27, x28, [sp, #144]
- str x1, [sp, #160]
-
- // Load registers from |state|, with the exception of x29. x29 is the
- // frame pointer and also callee-saved, but AAPCS64 allows platforms to
- // mandate that x29 always point to a frame. iOS64 does so, which means
- // we cannot fill x29 with entropy without violating ABI rules
- // ourselves. x29 is tested separately below.
- ldp d8, d9, [x1], #16
- ldp d10, d11, [x1], #16
- ldp d12, d13, [x1], #16
- ldp d14, d15, [x1], #16
- ldp x19, x20, [x1], #16
- ldp x21, x22, [x1], #16
- ldp x23, x24, [x1], #16
- ldp x25, x26, [x1], #16
- ldp x27, x28, [x1], #16
-
- // Move parameters into temporary registers.
- mov x9, x0
- mov x10, x2
- mov x11, x3
-
- // Load parameters into registers.
- cbz x11, Largs_done
- ldr x0, [x10], #8
- subs x11, x11, #1
- b.eq Largs_done
- ldr x1, [x10], #8
- subs x11, x11, #1
- b.eq Largs_done
- ldr x2, [x10], #8
- subs x11, x11, #1
- b.eq Largs_done
- ldr x3, [x10], #8
- subs x11, x11, #1
- b.eq Largs_done
- ldr x4, [x10], #8
- subs x11, x11, #1
- b.eq Largs_done
- ldr x5, [x10], #8
- subs x11, x11, #1
- b.eq Largs_done
- ldr x6, [x10], #8
- subs x11, x11, #1
- b.eq Largs_done
- ldr x7, [x10], #8
-
-Largs_done:
- blr x9
-
- // Reload |state| and store registers.
- ldr x1, [sp, #160]
- stp d8, d9, [x1], #16
- stp d10, d11, [x1], #16
- stp d12, d13, [x1], #16
- stp d14, d15, [x1], #16
- stp x19, x20, [x1], #16
- stp x21, x22, [x1], #16
- stp x23, x24, [x1], #16
- stp x25, x26, [x1], #16
- stp x27, x28, [x1], #16
-
- // |func| is required to preserve x29, the frame pointer. We cannot load
- // random values into x29 (see comment above), so compare it against the
- // expected value and zero the field of |state| if corrupted.
- mov x9, sp
- cmp x29, x9
- b.eq Lx29_ok
- str xzr, [x1]
-
-Lx29_ok:
- // Restore callee-saved registers.
- ldp d8, d9, [sp, #16]
- ldp d10, d11, [sp, #32]
- ldp d12, d13, [sp, #48]
- ldp d14, d15, [sp, #64]
- ldp x19, x20, [sp, #80]
- ldp x21, x22, [sp, #96]
- ldp x23, x24, [sp, #112]
- ldp x25, x26, [sp, #128]
- ldp x27, x28, [sp, #144]
-
- ldp x29, x30, [sp], #176
- AARCH64_VALIDATE_LINK_REGISTER
- ret
-
-
-.globl abi_test_clobber_x0
-
-.align 4
-abi_test_clobber_x0:
- AARCH64_VALID_CALL_TARGET
- mov x0, xzr
- ret
-
-
-.globl abi_test_clobber_x1
-
-.align 4
-abi_test_clobber_x1:
- AARCH64_VALID_CALL_TARGET
- mov x1, xzr
- ret
-
-
-.globl abi_test_clobber_x2
-
-.align 4
-abi_test_clobber_x2:
- AARCH64_VALID_CALL_TARGET
- mov x2, xzr
- ret
-
-
-.globl abi_test_clobber_x3
-
-.align 4
-abi_test_clobber_x3:
- AARCH64_VALID_CALL_TARGET
- mov x3, xzr
- ret
-
-
-.globl abi_test_clobber_x4
-
-.align 4
-abi_test_clobber_x4:
- AARCH64_VALID_CALL_TARGET
- mov x4, xzr
- ret
-
-
-.globl abi_test_clobber_x5
-
-.align 4
-abi_test_clobber_x5:
- AARCH64_VALID_CALL_TARGET
- mov x5, xzr
- ret
-
-
-.globl abi_test_clobber_x6
-
-.align 4
-abi_test_clobber_x6:
- AARCH64_VALID_CALL_TARGET
- mov x6, xzr
- ret
-
-
-.globl abi_test_clobber_x7
-
-.align 4
-abi_test_clobber_x7:
- AARCH64_VALID_CALL_TARGET
- mov x7, xzr
- ret
-
-
-.globl abi_test_clobber_x8
-
-.align 4
-abi_test_clobber_x8:
- AARCH64_VALID_CALL_TARGET
- mov x8, xzr
- ret
-
-
-.globl abi_test_clobber_x9
-
-.align 4
-abi_test_clobber_x9:
- AARCH64_VALID_CALL_TARGET
- mov x9, xzr
- ret
-
-
-.globl abi_test_clobber_x10
-
-.align 4
-abi_test_clobber_x10:
- AARCH64_VALID_CALL_TARGET
- mov x10, xzr
- ret
-
-
-.globl abi_test_clobber_x11
-
-.align 4
-abi_test_clobber_x11:
- AARCH64_VALID_CALL_TARGET
- mov x11, xzr
- ret
-
-
-.globl abi_test_clobber_x12
-
-.align 4
-abi_test_clobber_x12:
- AARCH64_VALID_CALL_TARGET
- mov x12, xzr
- ret
-
-
-.globl abi_test_clobber_x13
-
-.align 4
-abi_test_clobber_x13:
- AARCH64_VALID_CALL_TARGET
- mov x13, xzr
- ret
-
-
-.globl abi_test_clobber_x14
-
-.align 4
-abi_test_clobber_x14:
- AARCH64_VALID_CALL_TARGET
- mov x14, xzr
- ret
-
-
-.globl abi_test_clobber_x15
-
-.align 4
-abi_test_clobber_x15:
- AARCH64_VALID_CALL_TARGET
- mov x15, xzr
- ret
-
-
-.globl abi_test_clobber_x16
-
-.align 4
-abi_test_clobber_x16:
- AARCH64_VALID_CALL_TARGET
- mov x16, xzr
- ret
-
-
-.globl abi_test_clobber_x17
-
-.align 4
-abi_test_clobber_x17:
- AARCH64_VALID_CALL_TARGET
- mov x17, xzr
- ret
-
-
-.globl abi_test_clobber_x19
-
-.align 4
-abi_test_clobber_x19:
- AARCH64_VALID_CALL_TARGET
- mov x19, xzr
- ret
-
-
-.globl abi_test_clobber_x20
-
-.align 4
-abi_test_clobber_x20:
- AARCH64_VALID_CALL_TARGET
- mov x20, xzr
- ret
-
-
-.globl abi_test_clobber_x21
-
-.align 4
-abi_test_clobber_x21:
- AARCH64_VALID_CALL_TARGET
- mov x21, xzr
- ret
-
-
-.globl abi_test_clobber_x22
-
-.align 4
-abi_test_clobber_x22:
- AARCH64_VALID_CALL_TARGET
- mov x22, xzr
- ret
-
-
-.globl abi_test_clobber_x23
-
-.align 4
-abi_test_clobber_x23:
- AARCH64_VALID_CALL_TARGET
- mov x23, xzr
- ret
-
-
-.globl abi_test_clobber_x24
-
-.align 4
-abi_test_clobber_x24:
- AARCH64_VALID_CALL_TARGET
- mov x24, xzr
- ret
-
-
-.globl abi_test_clobber_x25
-
-.align 4
-abi_test_clobber_x25:
- AARCH64_VALID_CALL_TARGET
- mov x25, xzr
- ret
-
-
-.globl abi_test_clobber_x26
-
-.align 4
-abi_test_clobber_x26:
- AARCH64_VALID_CALL_TARGET
- mov x26, xzr
- ret
-
-
-.globl abi_test_clobber_x27
-
-.align 4
-abi_test_clobber_x27:
- AARCH64_VALID_CALL_TARGET
- mov x27, xzr
- ret
-
-
-.globl abi_test_clobber_x28
-
-.align 4
-abi_test_clobber_x28:
- AARCH64_VALID_CALL_TARGET
- mov x28, xzr
- ret
-
-
-.globl abi_test_clobber_x29
-
-.align 4
-abi_test_clobber_x29:
- AARCH64_VALID_CALL_TARGET
- mov x29, xzr
- ret
-
-
-.globl abi_test_clobber_d0
-
-.align 4
-abi_test_clobber_d0:
- AARCH64_VALID_CALL_TARGET
- fmov d0, xzr
- ret
-
-
-.globl abi_test_clobber_d1
-
-.align 4
-abi_test_clobber_d1:
- AARCH64_VALID_CALL_TARGET
- fmov d1, xzr
- ret
-
-
-.globl abi_test_clobber_d2
-
-.align 4
-abi_test_clobber_d2:
- AARCH64_VALID_CALL_TARGET
- fmov d2, xzr
- ret
-
-
-.globl abi_test_clobber_d3
-
-.align 4
-abi_test_clobber_d3:
- AARCH64_VALID_CALL_TARGET
- fmov d3, xzr
- ret
-
-
-.globl abi_test_clobber_d4
-
-.align 4
-abi_test_clobber_d4:
- AARCH64_VALID_CALL_TARGET
- fmov d4, xzr
- ret
-
-
-.globl abi_test_clobber_d5
-
-.align 4
-abi_test_clobber_d5:
- AARCH64_VALID_CALL_TARGET
- fmov d5, xzr
- ret
-
-
-.globl abi_test_clobber_d6
-
-.align 4
-abi_test_clobber_d6:
- AARCH64_VALID_CALL_TARGET
- fmov d6, xzr
- ret
-
-
-.globl abi_test_clobber_d7
-
-.align 4
-abi_test_clobber_d7:
- AARCH64_VALID_CALL_TARGET
- fmov d7, xzr
- ret
-
-
-.globl abi_test_clobber_d8
-
-.align 4
-abi_test_clobber_d8:
- AARCH64_VALID_CALL_TARGET
- fmov d8, xzr
- ret
-
-
-.globl abi_test_clobber_d9
-
-.align 4
-abi_test_clobber_d9:
- AARCH64_VALID_CALL_TARGET
- fmov d9, xzr
- ret
-
-
-.globl abi_test_clobber_d10
-
-.align 4
-abi_test_clobber_d10:
- AARCH64_VALID_CALL_TARGET
- fmov d10, xzr
- ret
-
-
-.globl abi_test_clobber_d11
-
-.align 4
-abi_test_clobber_d11:
- AARCH64_VALID_CALL_TARGET
- fmov d11, xzr
- ret
-
-
-.globl abi_test_clobber_d12
-
-.align 4
-abi_test_clobber_d12:
- AARCH64_VALID_CALL_TARGET
- fmov d12, xzr
- ret
-
-
-.globl abi_test_clobber_d13
-
-.align 4
-abi_test_clobber_d13:
- AARCH64_VALID_CALL_TARGET
- fmov d13, xzr
- ret
-
-
-.globl abi_test_clobber_d14
-
-.align 4
-abi_test_clobber_d14:
- AARCH64_VALID_CALL_TARGET
- fmov d14, xzr
- ret
-
-
-.globl abi_test_clobber_d15
-
-.align 4
-abi_test_clobber_d15:
- AARCH64_VALID_CALL_TARGET
- fmov d15, xzr
- ret
-
-
-.globl abi_test_clobber_d16
-
-.align 4
-abi_test_clobber_d16:
- AARCH64_VALID_CALL_TARGET
- fmov d16, xzr
- ret
-
-
-.globl abi_test_clobber_d17
-
-.align 4
-abi_test_clobber_d17:
- AARCH64_VALID_CALL_TARGET
- fmov d17, xzr
- ret
-
-
-.globl abi_test_clobber_d18
-
-.align 4
-abi_test_clobber_d18:
- AARCH64_VALID_CALL_TARGET
- fmov d18, xzr
- ret
-
-
-.globl abi_test_clobber_d19
-
-.align 4
-abi_test_clobber_d19:
- AARCH64_VALID_CALL_TARGET
- fmov d19, xzr
- ret
-
-
-.globl abi_test_clobber_d20
-
-.align 4
-abi_test_clobber_d20:
- AARCH64_VALID_CALL_TARGET
- fmov d20, xzr
- ret
-
-
-.globl abi_test_clobber_d21
-
-.align 4
-abi_test_clobber_d21:
- AARCH64_VALID_CALL_TARGET
- fmov d21, xzr
- ret
-
-
-.globl abi_test_clobber_d22
-
-.align 4
-abi_test_clobber_d22:
- AARCH64_VALID_CALL_TARGET
- fmov d22, xzr
- ret
-
-
-.globl abi_test_clobber_d23
-
-.align 4
-abi_test_clobber_d23:
- AARCH64_VALID_CALL_TARGET
- fmov d23, xzr
- ret
-
-
-.globl abi_test_clobber_d24
-
-.align 4
-abi_test_clobber_d24:
- AARCH64_VALID_CALL_TARGET
- fmov d24, xzr
- ret
-
-
-.globl abi_test_clobber_d25
-
-.align 4
-abi_test_clobber_d25:
- AARCH64_VALID_CALL_TARGET
- fmov d25, xzr
- ret
-
-
-.globl abi_test_clobber_d26
-
-.align 4
-abi_test_clobber_d26:
- AARCH64_VALID_CALL_TARGET
- fmov d26, xzr
- ret
-
-
-.globl abi_test_clobber_d27
-
-.align 4
-abi_test_clobber_d27:
- AARCH64_VALID_CALL_TARGET
- fmov d27, xzr
- ret
-
-
-.globl abi_test_clobber_d28
-
-.align 4
-abi_test_clobber_d28:
- AARCH64_VALID_CALL_TARGET
- fmov d28, xzr
- ret
-
-
-.globl abi_test_clobber_d29
-
-.align 4
-abi_test_clobber_d29:
- AARCH64_VALID_CALL_TARGET
- fmov d29, xzr
- ret
-
-
-.globl abi_test_clobber_d30
-
-.align 4
-abi_test_clobber_d30:
- AARCH64_VALID_CALL_TARGET
- fmov d30, xzr
- ret
-
-
-.globl abi_test_clobber_d31
-
-.align 4
-abi_test_clobber_d31:
- AARCH64_VALID_CALL_TARGET
- fmov d31, xzr
- ret
-
-
-.globl abi_test_clobber_v8_upper
-
-.align 4
-abi_test_clobber_v8_upper:
- AARCH64_VALID_CALL_TARGET
- fmov v8.d[1], xzr
- ret
-
-
-.globl abi_test_clobber_v9_upper
-
-.align 4
-abi_test_clobber_v9_upper:
- AARCH64_VALID_CALL_TARGET
- fmov v9.d[1], xzr
- ret
-
-
-.globl abi_test_clobber_v10_upper
-
-.align 4
-abi_test_clobber_v10_upper:
- AARCH64_VALID_CALL_TARGET
- fmov v10.d[1], xzr
- ret
-
-
-.globl abi_test_clobber_v11_upper
-
-.align 4
-abi_test_clobber_v11_upper:
- AARCH64_VALID_CALL_TARGET
- fmov v11.d[1], xzr
- ret
-
-
-.globl abi_test_clobber_v12_upper
-
-.align 4
-abi_test_clobber_v12_upper:
- AARCH64_VALID_CALL_TARGET
- fmov v12.d[1], xzr
- ret
-
-
-.globl abi_test_clobber_v13_upper
-
-.align 4
-abi_test_clobber_v13_upper:
- AARCH64_VALID_CALL_TARGET
- fmov v13.d[1], xzr
- ret
-
-
-.globl abi_test_clobber_v14_upper
-
-.align 4
-abi_test_clobber_v14_upper:
- AARCH64_VALID_CALL_TARGET
- fmov v14.d[1], xzr
- ret
-
-
-.globl abi_test_clobber_v15_upper
-
-.align 4
-abi_test_clobber_v15_upper:
- AARCH64_VALID_CALL_TARGET
- fmov v15.d[1], xzr
- ret
-
-#endif // !OPENSSL_NO_ASM && defined(OPENSSL_AARCH64) && defined(_WIN32)