linux/kernel/power
Kairui Song 3697615914 mm, swap: cleanup swap entry management workflow
The current swap entry allocation/freeing workflow has never had a clear
definition.  This makes it hard to debug or add new optimizations.

This commit introduces a proper definition of how swap entries would be
allocated and freed.  Now, most operations are folio based, so they will
never exceed one swap cluster, and we now have a cleaner border between
swap and the rest of mm, making it much easier to follow and debug,
especially with new added sanity checks.  Also making more optimization
possible.

Swap entry will be mostly freed and free with a folio bound.  The folio
lock will be useful for resolving many swap related races.

Now swap allocation (except hibernation) always starts with a folio in the
swap cache, and gets duped/freed protected by the folio lock:

- folio_alloc_swap() - The only allocation entry point now.
  Context: The folio must be locked.
  This allocates one or a set of continuous swap slots for a folio and
  binds them to the folio by adding the folio to the swap cache. The
  swap slots' swap count start with zero value.

- folio_dup_swap() - Increase the swap count of one or more entries.
  Context: The folio must be locked and in the swap cache. For now, the
  caller still has to lock the new swap entry owner (e.g., PTL).
  This increases the ref count of swap entries allocated to a folio.
  Newly allocated swap slots' count has to be increased by this helper
  as the folio got unmapped (and swap entries got installed).

- folio_put_swap() - Decrease the swap count of one or more entries.
  Context: The folio must be locked and in the swap cache. For now, the
  caller still has to lock the new swap entry owner (e.g., PTL).
  This decreases the ref count of swap entries allocated to a folio.
  Typically, swapin will decrease the swap count as the folio got
  installed back and the swap entry got uninstalled

  This won't remove the folio from the swap cache and free the
  slot. Lazy freeing of swap cache is helpful for reducing IO.
  There is already a folio_free_swap() for immediate cache reclaim.
  This part could be further optimized later.

The above locking constraints could be further relaxed when the swap table
is fully implemented.  Currently dup still needs the caller to lock the
swap entry container (e.g.  PTL), or a concurrent zap may underflow the
swap count.

Some swap users need to interact with swap count without involving folio
(e.g.  forking/zapping the page table or mapping truncate without swapin).
In such cases, the caller has to ensure there is no race condition on
whatever owns the swap count and use the below helpers:

- swap_put_entries_direct() - Decrease the swap count directly.
  Context: The caller must lock whatever is referencing the slots to
  avoid a race.

  Typically the page table zapping or shmem mapping truncate will need
  to free swap slots directly. If a slot is cached (has a folio bound),
  this will also try to release the swap cache.

- swap_dup_entry_direct() - Increase the swap count directly.
  Context: The caller must lock whatever is referencing the entries to
  avoid race, and the entries must already have a swap count > 1.

  Typically, forking will need to copy the page table and hence needs to
  increase the swap count of the entries in the table. The page table is
  locked while referencing the swap entries, so the entries all have a
  swap count > 1 and can't be freed.

Hibernation subsystem is a bit different, so two special wrappers are here:

- swap_alloc_hibernation_slot() - Allocate one entry from one device.
- swap_free_hibernation_slot() - Free one entry allocated by the above
  helper.

All hibernation entries are exclusive to the hibernation subsystem and
should not interact with ordinary swap routines.

By separating the workflows, it will be possible to bind folio more
tightly with swap cache and get rid of the SWAP_HAS_CACHE as a temporary
pin.

This commit should not introduce any behavior change

[kasong@tencent.com: fix leak, per Chris Mason.  Remove WARN_ON, per Lai Yi]
  Link: https://lkml.kernel.org/r/CAMgjq7AUz10uETVm8ozDWcB3XohkOqf0i33KGrAquvEVvfp5cg@mail.gmail.com
[ryncsn@gmail.com: fix KSM copy pages for swapoff, per Chris]
  Link: https://lkml.kernel.org/r/aXxkANcET3l2Xu6J@KASONG-MC4
Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-14-8862a265a033@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Signed-off-by: Kairui Song <ryncsn@gmail.com>
Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Deepanshu Kartikey <kartikey406@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Chris Mason <clm@fb.com>
Cc: Chris Mason <clm@meta.com>
Cc: Lai Yi <yi1.lai@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31 14:22:56 -08:00
..
Kconfig PM: QoS: Introduce a CPU system wakeup QoS limit 2025-11-25 19:01:29 +01:00
Makefile PM: EM: Add a skeleton code for netlink notification 2025-10-22 21:44:37 +02:00
autosleep.c PM: sleep: autosleep: don't include 'pm_wakeup.h' directly 2024-12-05 12:14:26 +01:00
console.c PM: console: Fix memory allocation error handling in pm_vt_switch_required() 2025-10-18 14:38:23 +02:00
em_netlink.c PM: EM: Add dump to get-perf-domains in the EM YNL spec 2026-01-09 21:44:46 +01:00
em_netlink.h PM: EM: Implement em_notify_pd_created/updated() 2025-10-22 21:44:38 +02:00
em_netlink_autogen.c PM: EM: Add dump to get-perf-domains in the EM YNL spec 2026-01-09 21:44:46 +01:00
em_netlink_autogen.h PM: EM: Add dump to get-perf-domains in the EM YNL spec 2026-01-09 21:44:46 +01:00
energy_model.c PM: EM: Fix memory leak in em_create_pd() error path 2026-01-08 16:55:21 +01:00
hibernate.c Merge branch 'pm-sleep' 2025-11-28 16:01:13 +01:00
main.c Merge branch 'pm-sleep' 2025-11-28 16:01:13 +01:00
power.h PM: sleep: Add support for wakeup during filesystem sync 2025-11-20 22:29:40 +01:00
poweroff.c
process.c Revert "PM: sleep: Make pm_wakeup_clear() call more clear" 2025-10-23 12:48:04 +02:00
qos.c PM: QoS: Introduce a CPU system wakeup QoS limit 2025-11-25 19:01:29 +01:00
snapshot.c PM: hibernate: Rework message printing in swsusp_save() 2025-10-20 20:43:09 +02:00
suspend.c PM: sleep: Fix suspend_test() at the TEST_CORE level 2025-12-28 13:01:39 +01:00
suspend_test.c
swap.c mm, swap: cleanup swap entry management workflow 2026-01-31 14:22:56 -08:00
user.c PM: sleep: Call pm_sleep_fs_sync() instead of ksys_sync_helper() 2025-11-20 22:29:40 +01:00
wakelock.c PM: wakeup: Delete space in the end of string shown by pm_show_wakelocks() 2025-05-09 15:48:39 +02:00