diff options
author | Rose, James <james.rose@intel.com> | 2014-04-22 12:08:06 +0800 |
---|---|---|
committer | Xiaofei Wan <xiaofei.wan@intel.com> | 2014-04-22 12:08:27 +0800 |
commit | 7b7060c61e4182b29186849c5a857ea5f0898e56 (patch) | |
tree | 329c1d4403c3542757db63fb1fb230e74f78b0c1 /cpu_ref/rsCpuIntrinsicConvolve5x5.cpp | |
parent | 33c565f4766f961f4302c3e007a5ceaee312cc8c (diff) | |
download | rs-7b7060c61e4182b29186849c5a857ea5f0898e56.tar.gz |
Improve RS intrinsics performance.
Renderscript CPU performance for intrinsics cases is not good for x86 platforms.
In many cases it is significantly slower even with SIMD Intrinsics. In current x86 implementation
it is using full 32 bit multiplies which aren't well supported on current Atom platforms.
This patch uses 16 bit multiply with 32 bit add pmaddwd instruction where appropriate.
It also adds atom specificoptimizations to improve RS intrinsics performance.
Change-Id: Ifc01b5a6d6f7430d2dc218f1618b9df3fb7937fe
Signed-off-by: Xiaofei Wan <xiaofei.wan@intel.com>
Diffstat (limited to 'cpu_ref/rsCpuIntrinsicConvolve5x5.cpp')
-rw-r--r-- | cpu_ref/rsCpuIntrinsicConvolve5x5.cpp | 11 |
1 files changed, 11 insertions, 0 deletions
diff --git a/cpu_ref/rsCpuIntrinsicConvolve5x5.cpp b/cpu_ref/rsCpuIntrinsicConvolve5x5.cpp index 11dda592..bcffe9ac 100644 --- a/cpu_ref/rsCpuIntrinsicConvolve5x5.cpp +++ b/cpu_ref/rsCpuIntrinsicConvolve5x5.cpp @@ -378,6 +378,17 @@ void RsdCpuScriptIntrinsicConvolve5x5::kernelU4(const RsForEachStubParamStruct * out++; x1++; } +#if defined(ARCH_X86_HAVE_SSSE3) + // for x86 SIMD, require minimum of 7 elements (4 for SIMD, + // 3 for end boundary where x may hit the end boundary) + if (gArchUseSIMD &&((x1 + 6) < x2)) { + // subtract 3 for end boundary + uint32_t len = (x2 - x1 - 3) >> 2; + rsdIntrinsicConvolve5x5_K(out, py0, py1, py2, py3, py4, cp->mIp, len); + out += len << 2; + x1 += len << 2; + } +#endif #if defined(ARCH_ARM_HAVE_VFP) if(gArchUseSIMD && ((x1 + 3) < x2)) { |