summaryrefslogtreecommitdiff
path: root/mali_kbase
AgeCommit message (Collapse)Author
2023-06-20kbase: csf: Reboot on failed GPU resetandroid-u-beta-4_r0.4android-u-beta-4_r0.3android-u-beta-4_r0.2android-gs-pantah-5.10-u-beta4android-gs-lynx-5.10-u-beta4android-gs-felix-5.10-u-beta4Varad Gautam
If reset failed, both KMD and the hardware are in an unrecoverable state. Any future attempts to process work or reset the GPU will fail, and it may take a long time (30mins) for the device to reboot and return to normal. Collect a system ramdump and reboot the device immediately when reset fails. Bug: 276855700 Test: Simulated failed reset and checked that a ramdump was generated. Change-Id: Iba901e1654d150b834303e0caa8fba2dc468b5ac Signed-off-by: Varad Gautam <varadgautam@google.com>
2023-06-15Add missing hwaccess_lock around atom_flags updates.Michael Stokes
Commit 0935897 (pa/1761483) added two additional katom flags, but updates to these new flags were not protected by hwaccess_lock, and could thus race with other updates and ultimately corrupt atom_flags. Bug: 265931966 Test: SST soak test Change-Id: I95acc5e335d8013394b11149abf5d9b793648c6f
2023-06-15GPUCORE-35754: Add barrier before updating GLB_DB_REQ to ring CSG DBSuzanne Candanedo
GPUCORE-35974: Add Memory Barrier between CS_REQ/ACK and CSG_DB_REQ/ACK The access to GLB_DB_REQ/ACK needs to be ordered with respect to CSG_REQ/ACK and CSG_DB_REQ/ACK to avoid a scenario where a CSI request overlaps with a CSG request or 2 CSI requests overlap and FW ends up missing the 2nd request. Memory barrier is required, both on Host and FW side, to guarantee the ordering. Bug: 286056062 Test: SST soak test Provenance: https://code.ipdelivery.arm.com/c/GPU/mali-ddk/+/4688 Provenance: https://code.ipdelivery.arm.com/c/GPU/mali-ddk/+/5435 Change-Id: I4de23e3f37b81749c6d668952b4f8dd21c669fea
2023-06-11Merge android13-gs-pixel-5.10-tm-qpr3 into android13-gs-pixel-5.10-udcPixelBot AutoMerger
SBMerger: 526756187 Change-Id: Ibe152c3a5f6bde3b32b1349e33175811bc895c38 Signed-off-by: SecurityBot <android-nexus-securitybot@system.gserviceaccount.com>
2023-06-06GPUCORE-36682 Lock MMU while disabling AS to prevent use after freeandroid-13.0.0_r0.117android-13.0.0_r0.116android-13.0.0_r0.115android-13.0.0_r0.114android-13.0.0_r0.113android-13.0.0_r0.112android-gs-felix-5.10-android13-qpr3Suzanne Candanedo
During an invalid GPU page fault, kbase will try to flush the GPU cache and disable the faulting address space (AS). There is a small window between flushing of the GPU L2 cache (MMU resumes) and when the AS is disabled where existing jobs on the GPU may access memory for that AS, dirtying the GPU cache. This is a problem as the kctx->as_nr is marked as KBASEP_AS_NR_INVALID and thus no cache maintenance will be performed on the AS of the faulty context when cleaning up the csg_slot and releasing the context. This patch addresses that issue by: 1. locking the AS via a GPU command 2. flushing the cache 3. disabling the AS 4. unlocking the AS This ensures that any jobs remaining on the GPU will not be able to access the memory due to the locked AS. Once the AS is unlocked, any memory access will fail as the AS is now disabled. The issue only happens on CSF GPUs. To avoid any issues, the code path for non-CSF GPUs is left undisturbed. (cherry picked from commit 566789dffda3dfec00ecf00f9819e7a515fb2c61) Provenance: https://code.ipdelivery.arm.com/c/GPU/mali-ddk/+/5071 Bug: 274014055 Change-Id: I2028182878b4f88505cc135a5f53ae4c7e734650
2023-05-30GPUCORE-37961 Deadlock issue due to lock ordering issuekirdev01
This patch addresses the dead lock condition due to circular locking dependency between hwaccess_lock and clk_rtm->lock.Hwaccess_lock needs to be taken before clk_rtm->lock to avoid locking dependency. Change-Id: I1064dbbac7800282bf3a1ac167c9c476177aefd8 (cherry picked from commit e0dfe9669c3456ada4b860f6ba9859c59ffec9a7) Bug: 274687461 Provenance: https://code.ipdelivery.arm.com/c/GPU/mali-ddk/+/5258
2023-05-24Make sure jobs are flushed before kbasep_platform_context_termMattias Simonsson
If kbase_release is called while jobs are in progress, the driver will start by calling kbasep_platform_context_term before waiting for jobs to finish in kbase_context_flush_jobs. When the jobs do finish, the driver will call kbasep_platform_event_work_end, which leads to issues since the platform callback has already cleaned up resources for the kbase_context. Make sure kbase_context_flush_jobs is called before kbasep_platform_context_term. Test: start/stop processes over and over Bug: 278366794 Change-Id: Iee0297f4b64a3f6b59a5df0c26e46d446257a652
2023-05-23[Official] MIDCET-4546, GPUCORE-37946: Synchronize GPU cache flush cmds with ↵Suzanne Candanedo
silent reset on GPU power up Commands for GPU cache maintenance and TLB invalidation were sent after acquiring 'hwaccess_lock' and checking if the 'gpu_powered' flag is set. The combination of lock and the flag ensured that GPU registers remained accessible whilst the commands were in progress. If the flag was not set then the GPU power up was not performed and the commands were rightfully skipped. The 'gpu_powered' flag is set immediately after the Top-level power up of GPU is done by the platform specific power_on_callback() and so the registers can be safely accessed. If the callback returns 1 then a silent soft-reset of the GPU is performed after setting the flag. This lead to a race between the cache maintanence commands and the soft reset of GPU, due to which the commands did not complete or got lost and there was a timeout. This commit replaces the 'gpu_powered' flag with the 'gpu_ready' flag as the latter is set after the soft-reset is done and all the in-use GPU address spaces have been enabled. It is okay to skip the commands when the flag is false, as L2 cache would be in powered down state. The page migrate function is also updated to use 'gpu_ready' flag as that was also affected by the similar race with silent reset in GPUCORE-35861 and 'kbdev->pm.lock' had to be used. Change-Id: I4cefe3add2863d7b29f111d437061031b66e7080 (cherry picked from commit e31494f5b7b9e9101aab4bd75fa4dc7d7f47b66a) Provenance: https://code.ipdelivery.arm.com/c/GPU/mali-ddk/+/5284 Bug: 281540759
2023-05-18Merge android13-gs-pixel-5.10-tm-qpr3 into android13-gs-pixel-5.10-udcPixelBot AutoMerger
Bug: 281607159 SBMerger: 526756187 Change-Id: I15bc929d24d73e636f12cf880126e8192ba7d9cb Signed-off-by: SecurityBot <android-nexus-securitybot@system.gserviceaccount.com>
2023-05-11mali_kbase: hold GPU utilization for premature update.android-u-beta-3_r0.3android-u-beta-3_r0.2android-u-beta-2.1_r0.4android-u-beta-2.1_r0.3android-u-beta-2.1_r0.2android-gs-raviole-5.10-u-beta3android-gs-raviole-5.10-u-beta2android-gs-pantah-5.10-u-beta2android-gs-bluejay-5.10-u-beta3android-gs-bluejay-5.10-u-beta2Wei Wang
The GPU internal counter is updated every IPA_CONTROL_TIMER_DEFAULT_VALUE_MS milliseconds. If an utilization update occurs prematurely and the counter has not been updated, the same counter value will be obtained, resulting in a difference of zero. To handle this scenario, this change will skip the utilization update if the counter difference is zero and the update occurred less than 1.5 times the internal update period (IPA_CONTROL_TIMER_DEFAULT_VALUE_MS). Bug: 277649158 Test: boot and trace Change-Id: I6e3063355f560a2872297fd32d66be8a468cdf79 Signed-off-by: Wei Wang <wvw@google.com>
2023-05-09mali_kbase: Remove incorrect WARN()Debarshi Dutta
The WARN() call at the beginning of the function schedule_on_tick() is incorrect as kbase_gpu_interrupt() might enqueue another tick_work(2) into the scheduler before the already inflight worker tick_work(1) sets the tick_timer_active variable to true. This could result in a condition where the hrtimer hasn't still expired and tick_work(1) starts executing resulting in the WARN_ON() being fired. The timer works asynchronously with the tick_work() and hence this warning can be removed from here. Bug 207824944 Change-Id: I873624c76b0de102bbcdd451a8402cb1c096edda
2023-05-09MIDCET-4324/GPUCORE-35611 Unmapping of aliased sink-page memorySuzanne Candanedo
Aliased regions containing the BASE_MEM_WRITE_ALLOC_PAGES_HANDLE MMU sink-page were not previously being unmapped correctly. In particular, the PGD entries for these pages. This change addresses that issue. Further, care is taken to ensure the flush_pa_range path operates correctly, for applicable GPUs. Also updated various WARN_ONs to WARN_ONCEs in MMU layer, in places where these could potentially occur in large numbers, rapidly - thereby helping to reduce the chances of system stress in future, as could potentially have been caused by this particular issue. GPUCORE-36048 Remove SAME_VA flag from regular allocation This patchset removes the SAME_VA flag from the regular allocation done in the defect test for GPUCORE-35611. The test was failing on 32-bit systems because there was no way to enforce that the aliased memory and the regular allocation would fall into the same region, and thus a later assumption in the test would not hold. Change-Id: Ie665fb9330a7338b7e148d1c1db13fe3cc98ee5c (cherry picked from commit 823c7b2de1933ca42cf179862d033d79d1289073) Provenance: https://code.ipdelivery.arm.com/c/GPU/mali-ddk/+/4800 Bug: 260122837
2023-05-09[Official] MIDCET-4458, GPUCORE-36402: Check for process exit before page ↵Suzanne Candanedo
alloc from kthread The backing pages for native GPU allocations aren't always allocated in the ioctl context. A JIT_ALLOC softjob or KCPU command can get processed in the kernel worker thread. GPU page fault handling is anyways done in a kernel thread. Userspace can make Kbase allocate large number of backing pages from the kernel thread to cause out of memory situation, which would eventually lead to a kernel panic as OoM killer would run out of suitable processes to kill. Though Kbase will account for the backing pages and OoM killer will try to kill the culprit process, the memory already allocated by the process won't get freed as context termination would remain blocked or won't kick-in until kernel thread keeps trying to allocate the backing pages. For the allocation that is done from the context of kernel thread, OoM killer won't consider the kernel thread for killing and kernel would keep retrying to allocate physical page as long as the OoM killer is able to kill processes. For the memory allocation done from the ioctl context, kernel would eventually stop retrying when it sees that process has been marked for killing by the OoM killer. This commit adds a check for process exit in the page allocation loop. The check allows kernel thread to swiftly exit the page allocation loop once OoM killer has initiated the killing of culprit process (for which kernel thread is trying to allocate pages) thereby unblocking context termination and freeing of GPU memory already allocated by the process. This helps in preventing the kernel panic and also limits the number of innocent processes that gets killed. The use of __GFP_RETRY_MAYFAIL flag didn't help in all the scenarios. The flag ensures that OoM killer is not invoked directly and kernel doesn't keep retrying to allocate the page. But when system is running low on memory, other threads can invoke the OoM killer and the page allocation request from kthread could continue to get satisfied due to the killing of other processes and so the kthread may not always timely exit the page allocation loop. (cherry picked from commit 3c5c9328a7fc552e61972c1bbff4b56696682d30) GPUCORE-36402: Fix potential memleak and NULL ptr deref issue in Kbase The commit 3c5c9328a7fc552e61972c1bbff4b56696682d30 updated Kbase to check for the process exit in every iteration of the page allocation loop when the allocation is done from the context of kernel worker thread. The commit introduced a potential memleak and NULL pointer dereference issue (which was reported by Coverity). This commit adds the required fix for the 2 issues and also sets the task pointer only for the Userspace created contexts and not for the contexts created by Kbase i.e. privileged context created for the HW counter dumping and for the WA of HW issue TRYM-3485. Bug: 275614526 Change-Id: I8107edce09a2cb52d8586fc9f7990a25166f590e Signed-off-by: Guus Sliepen <gsliepen@google.com> Provenance: https://code.ipdelivery.arm.com/c/GPU/mali-ddk/+/5169 (cherry picked from commit 8294169160ebb0d11d7d22b11311ddf887fb0b63)
2023-05-05Revert "Revert "GPUCORE-36748 Fix kbase_gpu_mmap() error handling""android-13.0.0_r0.107android-13.0.0_r0.106android-13.0.0_r0.105android-13.0.0_r0.104android-13.0.0_r0.103android-13.0.0_r0.100Guus Sliepen
This reverts commit 75b4a4ab15df252b112439300203dbc9b6d46922. Bug: 274002431 Change-Id: I7055294a6615e8ff282b47f822d67ecb709307a3
2023-04-27mali_kbase: [SLC-VK] Add CCTX memory class for explicit SLC allocations.Aleks Todorov
Bug: 265007605 Test: build_slider.sh UMD: http://ag/22336262 Change-Id: Ifc22c6b961860ad7955e974d21c2b7960fa55647
2023-04-21Revert "GPUCORE-36748 Fix kbase_gpu_mmap() error handling"android-t-qpr3-beta-3.1_r0.5android-t-qpr3-beta-3.1_r0.4android-t-qpr3-beta-3.1_r0.3android-13.0.0_r0.92android-13.0.0_r0.85android-13.0.0_r0.84android-13.0.0_r0.83android-13.0.0_r0.82android-gs-raviole-5.10-t-qpr3-beta-3android-gs-pantah-5.10-t-qpr3-beta-3android-gs-bluejay-5.10-t-qpr3-beta-3Kevin DuBois
This reverts commit 04bf4049652e9aa3e952bdc30c560054e1c0f060. Bug: 274827412 Reason for revert: stability Change-Id: I923387539eabbf72f51376decf95526f13339656
2023-04-21Revert "GPUCORE-36682 Lock MMU while disabling AS to prevent use after free"android-u-beta-2_r0.4android-u-beta-2_r0.3android-u-beta-2_r0.2Kevin DuBois
This reverts commit d4a9cc691fdde6aae0f5d40ad3d949ab76518e42. Bug: 274827412 Reason for revert: stability Change-Id: Id952d2656a642b0f363d579a51843a03e7750c2c
2023-04-21Revert "GPUCORE-36748 Fix kbase_gpu_mmap() error handling"Kevin DuBois
This reverts commit 04bf4049652e9aa3e952bdc30c560054e1c0f060. Bug: 274827412 Reason for revert: stability Change-Id: I530dc9425d9cb52ab88e8211c789def29b7607ac
2023-04-21Revert "GPUCORE-36682 Lock MMU while disabling AS to prevent use after free"Kevin DuBois
This reverts commit d4a9cc691fdde6aae0f5d40ad3d949ab76518e42. Bug: 274827412 Reason for revert: stability Change-Id: I929c4e7b11bd5b62a0c14a5b960b32127b26233a
2023-04-21[Official] MIDCET-4458, GPUCORE-36429: Prevent JIT allocations following ↵Suzanne Candanedo
unmap of tracking page This commit introduces new checks to ensure that, like allocations of native memory, JIT memory allocations are blocked after the unmap of the tracking page. Bug: 275615867 Provenance: https://code.ipdelivery.arm.com/c/GPU/mali-ddk/+/5168/ Change-Id: I32460df4e8898784e75084193e038a912f67b33e (cherry picked from commit 240d4e9206528a43340c22aa69b124436f9a4e01)
2023-04-21[Official] MIDCET-4458, GPUCORE-36635 Fix memory leak via GROUP_SUSPENDSuzanne Candanedo
Userspace can cause a memory leak for physical pages of SAME_VA allocations through GROUP_SUSPEND kcpu command. This commit fixes the memleak issue Bug: 275620394 Provenance: https://code.ipdelivery.arm.com/c/GPU/mali-ddk/+/5167 Change-Id: Iec155e23ea135cf1ea7592f38934dc617cc6b10e (cherry picked from commit 1f565b867e7bff3b3307db0960fabf028f95d981)
2023-04-17Flush mmu updates regardless of coherency modeVarad Gautam
kbase avoids flushing MMU updates on coherent systems, as these systems are expected to snoop CPU caches instead. This presents a problem on GS101/GS201 devices, where GPU->CPU cache snoop requests do not work as intended when the GPU is in protected mode (b/192236116) and the GPU ends up seeing stale memory / runs into page faults. As a software workaround, always flush MMU updates regardless of coherency mode, so that the GPU page tables are accurate. Note: This was initially added in I5473345d and reverted in I2a41a2044. Bug: 200555454 Change-Id: I51187cd7c042bde42c4fcdf976a9f7f8828155e1 Signed-off-by: Varad Gautam <varadgautam@google.com>
2023-04-17kbase: Add a debugfs file to test GPU ueventsVarad Gautam
This adds a trigger_uevent debugfs node that takes the uevent type and info as write parameter and fires the corresponding uevent. Bug: 275367216 Bug: 275367223 Test: Combined with userspace patches: b/276704984#comment2 Change-Id: Ic1e069259e5d068a4677c8d1472d74485b8a904c Signed-off-by: Varad Gautam <varadgautam@google.com>
2023-04-17kbase: Add new GPU uevents to kbaseVarad Gautam
Add the following types of GPU uevents: 1. KMD_ERROR: Reports incidents where kbase runs into an error (includes FW errors). 2. GPU_RESET: Reports failed or successful GPU reset incidents. Bug: 275367216 Bug: 275367223 Test: Combined with userspace patches: b/276704984#comment2 Change-Id: Ie0d18f96c590cba561e8425eba210136bfef039d Signed-off-by: Varad Gautam <varadgautam@google.com>
2023-04-17pixel: Introduce GPU uevents to notify userspace of GPU failuresVarad Gautam
Add an interface to emit uevents with env GPU_UEVENT_TYPE and GPU_UEVENT_INFO from kbase. This will be used to report common GPU failure conditions. To avoid flooding the userspace with uevents, these are ratelimited to one uevent per GPU_UEVENT_TYPE per GPU_UEVENT_TIMEOUT_MS. Bug: 275367216 Bug: 275367223 Test: Combined with userspace patches: b/276704984#comment2 Change-Id: I557df22c87f435aca4d05e0038609e1c9f82de54 Doc: go/pixel-gpu-instability-monitoring Signed-off-by: Varad Gautam <varadgautam@google.com>
2023-04-17[Official] MIDCET-4458, GPUCORE-36654 Use %pK on GPU bus faultSuzanne Candanedo
Physical address of GPU bus fault is useful for debugging purpose. However the physical address (emitted via 0x%016llX) is also sensitive information so it should be cautiously exposed to user space. Linux kernel provides the control for physical address exposure via 'kptr_restrict'. To allow this control to work, '%pK' must be used. Bug: 275623256 Provenance: https://code.ipdelivery.arm.com/c/GPU/mali-ddk/+/5166 Change-Id: I23171bafc47e96045e42dad533ed28fc8bbcef6b (cherry picked from commit d35be16de81d9bc55dc0a586d661391e1989d6c0)
2023-04-14Merge "Merge android13-gs-pixel-5.10-tm-qpr3 into ↵Pindar Yang
android13-gs-pixel-5.10-udc" into android13-gs-pixel-5.10-udc
2023-04-13mali_kbase: platform: Init GPU SLC contextJack Diver
Init call for SLC portion of the GPU context was missing. Bug: 276392249 Change-Id: I7f9b8a89a463f66845f5da91adca63d30f138c83 Signed-off-by: Jack Diver <diverj@google.com>
2023-04-13Add partial term support to pixel gpu initAnkit Goyal
Bug: 254279889 Test: Boots to home Signed-off-by: Jack Diver <diverj@google.com> (cherry picked from https://partner-android-review.googlesource.com/q/commit:08be62386b8b087e1979c0396a23847246ca36bb) Change-Id: I1427019107b67139381390a5a73bf518b99927c8
2023-04-12mali_kbase: Add missing wake_up(poweroff_wait) when cancelling poweroff.Michael Stokes
kbase_pm_update_active() may cancel an ongoing poweroff before the poweroff has completed and enqueued a gpu_poweroff_wait_work item, which when executed would have unblocked any waiters in kbase_pm_wait_for_poweroff_work_complete(). kbase_pm_update_active() must therefore also call wake_up(poweroff_wait) after resetting poweroff_wait_in_progress to false, to prevent kbase_pm_wait_for_poweroff_work_complete() from waiting indefinitely. This change also modifies the diagnostic patch in kbase_pm_wait_for_poweroff_work_complete() to avoid triggering a subsystem coredump if a gpu_poweroff_wait_work item is actually pending. Bug: 274137481 Test: Stability soak testing Change-Id: I9009a6eed7aa305ae04179263e308ba4259afc6a
2023-04-09Merge android13-gs-pixel-5.10-tm-qpr3 into android13-gs-pixel-5.10-udcPixelBot AutoMerger
SBMerger: 516612970 Change-Id: I2e07185ea841a4f0de9998a41ddfbef7d9e6aa8e Signed-off-by: SecurityBot <android-nexus-securitybot@system.gserviceaccount.com>
2023-04-06mali_kbase: platform: mgm: Get accurate SLC partition sizeJack Diver
Use mgm_resize_callback to update memory group size. Add entry point allowing memory group size to be queried. Bug: 264990406 Test: Boot to home Test: gfx-bench mh3.1 Change-Id: I80f595724c7418b97e07679719d2b76e4ee7b96f Signed-off-by: Jack Diver <diverj@google.com>
2023-04-05mali_kbase: Remove redundant if check to unblock suspendJack Diver
Completed atoms are expected to always have a flag indicating they were submitted. A warning is present to assert this fact. Currently, if the flag is not present it will block GPU suspend. Remove the if to unblock suspend and prevent a kernel lockup. Bug: 233522199 Change-Id: I541ac835ec36562f7724b35e171d71537e763ed9 Signed-off-by: Jack Diver <diverj@google.com>
2023-04-04mali_kbase: reset: Flush SSCD worker before resetting the GPUVarad Gautam
coredump_work isn't guaranteed to happen before reset, which means the resulting SSCD can contain either of pre-reset or post-reset state. post-reset state isn't helpful in debugging a GPU hang. Ensure that we always collect the pre-reset state by flushing the coredump worker before resetting the GPU. Bug: 264595878 Test: Raced reset debugfs write with trigger_core_dump sysfs write to check that the device is stable and coredump happens before reset. Change-Id: I7a553f8dd156d5dbee2d8008a70545641ed8dbe9 Signed-off-by: Varad Gautam <varadgautam@google.com>
2023-03-31pixel_gpu_sscd: Prevent dumping multiple SSCDs when the GPU hangsVarad Gautam
Add a heuristic to ratelimit SSCD generation for "GPU hang"-type coredumps. Typically when the GPU hangs, this codepath is hit multiple times leading to unnecessary SSCD generation per hang (sometimes > 200 coredumps for a single incident). The heuristic skips SSCD generation depending on: 1. whether there was a "GPU hang" coredump recently within the GPU_HANG_SSCD_TIMEOUT_MS time window. 2. whether there was an unsuccesful GPU reset, which implies the system will end up rebooting soon. Change-Id: I761057aee9c4ff9f32d658c49b99eb162486033b Bug: 264595878 Signed-off-by: Varad Gautam <varadgautam@google.com> Test: b/264595878#comment7
2023-03-31mali_kbase: reset: Add a helper to check GPU reset failureVarad Gautam
Kbase upstreaming: Pending Change-Id: I867d64897785348d499ad4d9a4f4c95f95e8df85 Signed-off-by: Varad Gautam <varadgautam@google.com> Bug: 264595878
2023-03-29Revert "mali_kbase: mem: Prevent vma splits"Debarshi Dutta
In the original bug, the protected memory imports via Base were ignoring the actual size of the import that came back from the kernel memory import routines. These resulted in errors as when these imports were freed, the incorrect size was passed resulting in only sub-regions of the original mapped range being unmapped resulting in cases where the GPU and CPU VAs ended up being inconsistent. A WAR was added to prevent VMA splits temporarily until a fix was provided for the protected memory size mismatch. As a result of this fix this WAR is no longer necessary. The consequences of this WAR is now resulting in failures for the case when an application tries to call mprotect(restrictive) on a memory already allocated and mmapped on by Vulkan API calls. Vulkan alloc() invokes the cmem_heap_alloc() function, which for the general case allocates some extra memory to fulfil the worse case alignment requirements. As a result invoking mprotect on the partial user provided range always result in VMA splits(). For further reference look at this article. https://lwn.net/Articles/182847/ Bug 269535398 This reverts commit 6d1d889156e68493842f5bb18fc9aed74cc57454. Change-Id: Ic5749fab2613d6495fd3669356697ff40bfafcb7
2023-03-28GPUCORE-36682 Lock MMU while disabling AS to prevent use after freeandroid-t-qpr3-beta-3_r0.5android-t-qpr3-beta-3_r0.4android-t-qpr3-beta-3_r0.3Suzanne Candanedo
*Affects CSF GPUs only, but changes to common code.* During an invalid GPU page fault, kbase will try to flush the GPU cache and disable the faulting address space (AS). There is a small window between flushing of the GPU L2 cache (MMU resumes) and when the AS is disabled where existing jobs on the GPU may access memory for that AS, dirtying the GPU cache. This is a problem as the kctx->as_nr is marked as KBASEP_AS_NR_INVALID and thus no cache maintenance will be performed on the AS of the faulty context when cleaning up the csg_slot and releasing the context. This patch addresses that issue by: 1. locking the AS via a GPU command 2. flushing the cache 3. disabling the AS 4. unlocking the AS This ensures that any jobs remaining on the GPU will not be able to access the memory due to the locked AS. Once the AS is unlocked, any memory access will fail as the AS is now disabled. Change-Id: I5e02face6ca0fa4526576dd70d0261ea3ee69506 (cherry picked from commit 566789dffda3dfec00ecf00f9819e7a515fb2c61) Provenance: https://code.ipdelivery.arm.com/c/GPU/mali-ddk/+/5071 Bug: 274014055
2023-03-27GPUCORE-36748 Fix kbase_gpu_mmap() error handlingSuzanne Candanedo
The error recovery path of kbase_gpu_mmap() has been fixed to handle failures in creating GPU mappings, and in particular to handle the case of memory aliases. The new logic doesn't try to teardown GPU mappings, because all MMU functions to insert pages undo their insertion in case of failure. The only use case that needs special attention is memory aliases: only the previous iterations of the loop shall be undone, by using the physical pages which are referenced by the memory alias descriptor. The bug described in GPUCORE-37557 has been fixed too: the GPU VA of the region created for the Base memory alias shall be set to 0, otherwise the kbase_remove_va_region() will be called twice on the same region: the first time to undo the mapping, and the second time when the user space frees the Base memory handle. Change-Id: I018c50c2c9ff0a8f9175d4c74764bf64054a060f (cherry picked from commit b2fdd6abc5b9a2a1c1889e3cdeaf8b54c00a35d8) Provenance: https://code.ipdelivery.arm.com/c/GPU/mali-ddk/+/5072 Bug: 274002431
2023-03-26Merge android13-gs-pixel-5.10-tm-qpr3 into android13-gs-pixel-5.10-udcPixelBot AutoMerger
SBMerger: 516612970 Change-Id: Ic3745d8ba6e262a0f971a80e3e304ce2cc91cc26 Signed-off-by: SecurityBot <android-nexus-securitybot@system.gserviceaccount.com>
2023-03-23Powercycle mali to recover from a PM timeoutVarad Gautam
The existing reset flow (kbase_pm_do_reset()) is: 1. Write to SOFT_RESET and wait for irq until timeout. 2. If RESET_COMPLETED irq timed out, write to HARD_RESET and wait for irq until timeout. 3. If RESET_COMPLETED irq timed out, powercycle the GPU via kbase_pm_hw_reset(). If a power transition timed out (ie, kbase_pm_timed_out()), writing to SOFT/HARD_RESET regs is unreliable and can send the GPU into an undefined state (eg, when writing to SOFT/HARD_RESET regs if L2 is transitioning) and prevent recovery. Introduce a RESET_FLAGS_FORCE_PM_HW_RESET flag to allow resetting the GPU via powercycle, which currently only happens when soft/hard reset both fail, and use only this method to reset the GPU from kbase_pm_timed_out(). Note: Originally pushed as pa/Ic57680225, re-merge this patch per go/p22-udc-gfx-rollout kbase upstreaming: WIP: b/243522189#comment23 Change-Id: I5b8ca3b9e49cf355f665c0b56061e06ef3ed9e0b Signed-off-by: Varad Gautam <varadgautam@google.com> Bug: 241217496 Bug: 270305834 Test: (v2) SST ~5700h (b/271438225#comment14) / (v1) SST ~2500h (b/265003962)
2023-03-23mali_pixel: Downgrade invalid region warning to dev_dbgJack Diver
A malicious app could cause kernel log storm by intentionally supplying invalid virtual addresses. The message is primarily intended for debugging cache performance, and is not needed in bug reports. Bug: 264990406 Test: Boot to home Test: Malicious test app Change-Id: If9292038f286456dc593ad0dcbc6e0a74e063e5c Signed-off-by: Jack Diver <diverj@google.com>
2023-03-23mali_kbase: platform: Perform partition resize and region migrationJack Diver
Implement demand based SLC partition resizing. Implement region migration into the SLC memory group. Bug: 264990406 Test: Boot to home Test: gfx-bench mh3.1 Change-Id: Ibf763652f3db133066c254b66b1316a49803e54f Signed-off-by: Jack Diver <diverj@google.com>
2023-03-23mali_pixel: Add entry point for resizing a memory groupJack Diver
Add a backdoor entry point to allow the mali_kbase platform integration to resize the GPU SLC memory group, by mutating the underlying partition. Bug: 264990406 Test: Build mali_kbase, mali_pixel Test: Boot to home Change-Id: I8f933625b040d419b9e5676976ea3cf9cde87cec Signed-off-by: Jack Diver <diverj@google.com>
2023-03-23mali_kbase: mali_pixel: Add mali_kbase dependency on mali_pixelJack Diver
Add a dependency on mali_pixel so that it gets built in the same sandbox as mali_kbase. This enabled mali_kbase to access exported symbols from mali_pixel. Bug: 264990406 Test: build mali_kbase, mali_pixel Change-Id: Ibb36df774b2578dab7d7c37ab76834ffbcd66106 Signed-off-by: Jack Diver <diverj@google.com>
2023-03-23platform: Implement SLC partition accountingJack Diver
Track the per kctx usage and demand of SLC, updating in response to the buffer_liveness_ioctl. Setting SLC PBHA bits and resizing the partition are stubbed for now. Bug: 264990406 Test: Boot to home Test: Manual ioctl call Change-Id: Idfe54f7baad25b9403a69f7269b7c8fc53dedaaa Signed-off-by: Jack Diver <diverj@google.com>
2023-03-23mali_kbase: Implement buffer liveness ioctlJack Diver
Add SLC platform integration, and plumb custom ioctl through. Bug: 264990406 Test: Boot to home Test: Manual ioctl call Change-Id: I0009cec83f54cfed8e12477c5ebd7aa01cf50cc8 Signed-off-by: Jack Diver <diverj@google.com>
2023-03-23mali_kbase: platform: Make kctx platform_data extensibleJack Diver
Allocate a platform_data struct, which can be extended for more uses than per-UID DVFS metrics. Bug: 264990406 Test: Boot Test: Check DVFS time_in_state reporting Change-Id: Iae17f85e6ece87e5bd8aa6f13c75c6f5504e8436 Signed-off-by: Jack Diver <diverj@google.com>
2023-03-23mali_kbase: Add buffer liveness ioctlJack Diver
Add an ioctl that userspace can use to inform the kernel of buffer live ranges. The icotl is currently a stub. Bug: 264990406 Change-Id: Ie36395be5a1e835ed1ed39ba29737f4e51b8deee Signed-off-by: Jack Diver <diverj@google.com>
2023-03-21Avoid L2 powerup delay workaround on JM GPUsVarad Gautam
b/265461971 showed JM devices regressing due to this delay, which is only needed on CSF GPUs. Change-Id: I3c37dc8d344965bec5c8697e84e7355e725513cd Bug: 265461971 Test: Ran launcher-action-suite microbenchmark on raven, perfetto_ft_launcher-frame_dur_p50 reported 1962260 with patch and 3000472.56 without