Age | Commit message (Collapse) | Author |
|
The outgoing license was MIT only. The new dual license allows
using the code under Apache-2.0 WITH LLVM-exception license too.
|
|
Scripted copyright year updates based on git committer date.
|
|
This implementation is a wrapper around the scalar pow with appropriate
call abi. As such it is not expected to be faster than scalar calls,
the new double prec vector pow symbols are provided for completeness.
|
|
Same design as in expf. Worst-case error of __v_exp2f and __v_exp2f_1u
is 1.96 and 0.88 ulp respectively.
It is not clear if round/convert instructions are better or +- Shift.
For expf the latter, for exp2f the former seems more consistently
faster, but both options are kept in the code for now.
|
|
Worst-case error is 1.67 ulp, the polynomial was generated by sollya.
Uses a 128 entry (2KB) lookup table. Special cases fall back to scalar
log call.
|
|
Worst-case error is 3.5 ulp, the polynomial was generated by sollya.
For large (>2^23) and special inputs the code falls back to scalar
sin and cos.
|
|
Essentially the scalar powf algorithm is used for each element in the
vector just inlined for better scheduling and simpler special case
handling. The log polynomial is smaller as less accuracy is enough.
Worst-case error is 2.6 ulp.
|
|
The polynomials were produced by searching the coefficient space using
heuristics and ideas from https://arxiv.org/abs/1508.03211
The worst-case error is 1.886 ulp, large inputs (> 2^20) and other
special cases use scalar sinf and cosf.
|
|
The polynomial was produced by searching the coefficient space using
heuristics and ideas from https://arxiv.org/abs/1508.03211
The worst-case error is 3.34 ulp, subnormal range inputs and other
special cases use scalar logf.
|
|
Vector math routines are added to the same libmathlib library as scalar
ones. The difficulty is that they are not always available, the external
abi depends on the compiler version used for the build. Currently only
aarch64 AdvSIMD is supported, there are 4 new sets of symbols:
__s_foo is a scalar function with identical result to the vector one,
__v_foo is a vector function using the base PCS,
__vn_foo uses the vector PCS and
_ZGV*_foo is the vector ABI symbol alias of vn_foo
for a scalar math function foo.
The test and benchmark code got extended to handle vector functions.
Vector functions aim for < 5 ulp worst case error, only support nearest
rounding mode and don't support floating-point exceptions. Vector
functions may call scalar functions to handle special cases, but for a
single value they should return the same result independently of values
in other vector lanes or the position of the value in the vector.
The __v_expf and __v_expf_1u polynomials were produced by searching the
coefficient space with some heuristics and ideas from
https://arxiv.org/abs/1508.03211
Their worst case error is 1.95 and 0.866 ulp respectively.
The exp polynomial was produced by sollya, it uses a 128 element (1KB)
lookup table and has 2.38 ulp worst case error.
|
|
Removed tanf declaration since the implementation got removed too.
|
|
|
|
Update mathlib.h to use GNU style declarations and add missing pow.
|
|
Similar algorithm is used as in log, but there are more operations
(and more error) due to the 1/ln2 multiplier.
There is separate code path when fma instruction is not available for
computing x/c - 1 precisely, for which the table size is doubled,
and to compute (x/c - 1)/ln2 precisely.
The worst case error is 0.547 ULP (0.55 without fma), the read only
global data size is 1168 bytes (2192 without fma). The non-nearest
rounding error is less than 1 ULP.
Improvements on Cortex-A72 compared to current glibc master:
log latency: 2.04x
log thruput: 1.87x
|
|
|
|
This patch is a complete rewrite of sinf, cosf and sincosf. The new version
is significantly faster, as well as simple and accurate.
The worst-case ULP is 0.56072, maximum relative error is 0.5303p-23 over all
4 billion inputs. In non-nearest rounding modes the error is 1ULP.
The algorithm uses 3 main cases: small inputs which don't need argument
reduction, small inputs which need a simple range reduction and large inputs
requiring complex range reduction. The code uses approximate integer
comparisons to quickly decide between these cases - on some targets this may
be slow, so this can be configured to use floating point comparisons.
The small range reducer uses a single reduction step to handle values up to
120.0. It is fastest on targets which support inlined round instructions.
The large range reducer uses integer arithmetic for simplicity. It does a
32x96 bit multiply to compute a 64-bit modulo result. This is more than
accurate enough to handle the worst-case cancellation for values close to
an integer multiple of PI/4. It could be further optimized, however it is
already much faster than necessary.
|
|
Use standard math symbols so it's easy to override libm functions.
The arm_math.h header is no longer necessary, user code can just use
math.h, but keep a header for freestanding code.
|
|
Use standard name for the LICENSE file.
Use consistent license text across files:
- "ARM" is changed to Arm,
- "All Rights Reserved" is dropped (not needed),
- "This file is part of.." is dropped,
- Text is formatted as is recommended by the LICENSE file.
|
|
|
|
|