linux/kernel
Bing Jiao 1aceed565f mm/vmscan: fix demotion targets checks in reclaim/demotion
Patch series "mm/vmscan: fix demotion targets checks in reclaim/demotion",
v9.

This patch series addresses two issues in demote_folio_list(),
can_demote(), and next_demotion_node() in reclaim/demotion.

1. demote_folio_list() and can_demote() do not correctly check
   demotion target against cpuset.mems_effective, which will cause (a)
   pages to be demoted to not-allowed nodes and (b) pages fail demotion
   even if the system still has allowed demotion nodes.

   Patch 1 fixes this bug by updating cpuset_node_allowed() and
   mem_cgroup_node_allowed() to return effective_mems, allowing directly
   logic-and operation against demotion targets.

2. next_demotion_node() returns a preferred demotion target, but it
   does not check the node against allowed nodes.

   Patch 2 ensures that next_demotion_node() filters against the allowed
   node mask and selects the closest demotion target to the source node.


This patch (of 2):

Fix two bugs in demote_folio_list() and can_demote() due to incorrect
demotion target checks against cpuset.mems_effective in reclaim/demotion.

Commit 7d709f49ba ("vmscan,cgroup: apply mems_effective to reclaim")
introduces the cpuset.mems_effective check and applies it to can_demote().
However:

  1. It does not apply this check in demote_folio_list(), which leads
     to situations where pages are demoted to nodes that are
     explicitly excluded from the task's cpuset.mems.

  2. It checks only the nodes in the immediate next demotion hierarchy
     and does not check all allowed demotion targets in can_demote().
     This can cause pages to never be demoted if the nodes in the next
     demotion hierarchy are not set in mems_effective.

These bugs break resource isolation provided by cpuset.mems.  This is
visible from userspace because pages can either fail to be demoted
entirely or are demoted to nodes that are not allowed in multi-tier memory
systems.

To address these bugs, update cpuset_node_allowed() and
mem_cgroup_node_allowed() to return effective_mems, allowing directly
logic-and operation against demotion targets.  Also update can_demote()
and demote_folio_list() accordingly.

Bug 1 reproduction:
  Assume a system with 4 nodes, where nodes 0-1 are top-tier and
  nodes 2-3 are far-tier memory. All nodes have equal capacity.

  Test script:
    echo 1 > /sys/kernel/mm/numa/demotion_enabled
    mkdir /sys/fs/cgroup/test
    echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
    echo "0-2" > /sys/fs/cgroup/test/cpuset.mems
    echo $$ > /sys/fs/cgroup/test/cgroup.procs
    swapoff -a
    # Expectation: Should respect node 0-2 limit.
    # Observation: Node 3 shows significant allocation (MemFree drops)
    stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1

Bug 2 reproduction:
  Assume a system with 6 nodes, where nodes 0-2 are top-tier,
  node 3 is a far-tier node, and nodes 4-5 are the farthest-tier nodes.
  All nodes have equal capacity.

  Test script:
    echo 1 > /sys/kernel/mm/numa/demotion_enabled
    mkdir /sys/fs/cgroup/test
    echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
    echo "0-2,4-5" > /sys/fs/cgroup/test/cpuset.mems
    echo $$ > /sys/fs/cgroup/test/cgroup.procs
    swapoff -a
    # Expectation: Pages are demoted to Nodes 4-5
    # Observation: No pages are demoted before oom.
    stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1,2

Link: https://lkml.kernel.org/r/20260114205305.2869796-1-bingjiao@google.com
Link: https://lkml.kernel.org/r/20260114205305.2869796-2-bingjiao@google.com
Fixes: 7d709f49ba ("vmscan,cgroup: apply mems_effective to reclaim")
Signed-off-by: Bing Jiao <bingjiao@google.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Gregory Price <gourry@gourry.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-12 15:42:52 -08:00
..
bpf bpf: Reject BPF_MAP_TYPE_INSN_ARRAY in check_reg_const_str() 2026-01-07 19:03:46 -08:00
cgroup mm/vmscan: fix demotion targets checks in reclaim/demotion 2026-02-12 15:42:52 -08:00
configs hung_task: panic when there are more than N hung tasks at the same time 2025-11-12 10:00:14 -08:00
debug
dma dma/pool: Avoid allocating redundant pools 2026-01-14 11:00:00 +01:00
entry
events Fix perf swevent hrtimer deinit regression. 2026-01-11 06:55:27 -10:00
futex Futex changes for v6.19: 2025-12-10 17:21:30 +09:00
gcov gcov: add support for GCC 15 2025-11-09 21:19:44 -08:00
irq treewide: Update email address 2026-01-11 06:09:11 -10:00
kcsan
livepatch livepatching changes for 6.19 2025-12-03 13:46:48 -08:00
liveupdate kho: kho_preserve_vmalloc(): don't return 0 when ENOMEM 2026-01-26 19:03:48 -08:00
locking RCU pull request for v6.19 2025-12-03 12:18:07 -08:00
module kernel: modules: Add SPDX license identifier to kmod.c 2026-01-15 16:58:28 -08:00
power mm, swap: cleanup swap entry management workflow 2026-01-31 14:22:56 -08:00
printk printk fixup for 6.19 rc6 2026-01-16 09:46:59 -08:00
rcu RCU pull request for v6.19 2025-12-03 12:18:07 -08:00
sched sched/deadline: Use ENQUEUE_MOVE to allow priority change 2026-01-15 21:57:53 +01:00
time hrtimer: Fix softirq base check in update_needs_ipi() 2026-01-13 11:04:41 +01:00
trace ftrace: Do not over-allocate ftrace memory 2026-01-15 10:17:53 -05:00
unwind
.gitignore
Kconfig.freezer
Kconfig.hz
Kconfig.kexec liveupdate: kho: move to kernel/liveupdate 2025-11-27 14:24:33 -08:00
Kconfig.locks
Kconfig.preempt
Makefile liveupdate: kho: move to kernel/liveupdate 2025-11-27 14:24:33 -08:00
acct.c
async.c
audit.c
audit.h
audit_fsnotify.c
audit_tree.c
audit_watch.c
auditfilter.c
auditsc.c
backtracetest.c
bounds.c x86/asm: Remove ANNOTATE_DATA_SPECIAL usage 2025-12-03 16:53:19 +01:00
capability.c
cfi.c
compat.c
configs.c
context_tracking.c
cpu.c cpu: Make atomic hotplug callbacks run with interrupts disabled on UP 2025-12-10 15:49:11 +09:00
cpu_pm.c syscore: Pass context data to callbacks 2025-11-14 10:01:52 +01:00
crash_core.c crash: fix crashkernel resource shrink 2025-11-15 10:52:01 -08:00
crash_core_test.c
crash_dump_dm_crypt.c
crash_reserve.c crash: let architecture decide crash memory export to iomem_resource 2025-11-12 10:00:15 -08:00
cred.c kernel-6.19-rc1.cred 2025-12-01 13:45:41 -08:00
delayacct.c
dma.c
elfcorehdr.c
exec_domain.c
exit.c Significant patch series in this pull request: 2025-12-06 14:01:20 -08:00
exit.h
extable.c
fail_function.c
fork.c Significant patch series in this pull request: 2025-12-06 14:01:20 -08:00
freezer.c
gen_kheaders.sh
groups.c
hung_task.c hung_task: add hung_task_sys_info sysctl to dump sys info on task-hung 2025-11-20 14:03:43 -08:00
iomem.c
irq_work.c
jump_label.c
kallsyms.c kallsyms: Fix wrong "big" kernel symbol type read from procfs 2025-11-24 16:56:24 +01:00
kallsyms_internal.h
kallsyms_selftest.c
kallsyms_selftest.h
kcmp.c
kcov.c
kexec.c
kexec_core.c kernel/kexec: fix IMA when allocation happens in CMA area 2025-12-23 11:23:14 -08:00
kexec_elf.c
kexec_file.c
kexec_internal.h
kheaders.c
kprobes.c
kstack_erase.c sysctl: remove __user qualifier from stack_erasing_sysctl buffer argument 2025-11-27 15:44:53 +01:00
ksyms_common.c
ksysfs.c kexec: move sysfs entries to /sys/kernel/kexec 2025-11-27 14:24:42 -08:00
kthread.c kthread: Warn if mm_struct lacks user_ns in kthread_use_mm() 2025-12-24 21:32:58 +01:00
latencytop.c
module_signature.c
notifier.c
nscommon.c ns: rename is_initial_namespace() 2025-11-11 10:01:31 +01:00
nsproxy.c nsproxy: fix free_nsproxy() and simplify create_new_namespaces() 2025-11-14 13:10:38 +01:00
nstree.c nstree: fix kernel-doc comments for internal functions 2025-11-14 13:10:38 +01:00
padata.c padata: remove __padata_list_init() 2025-11-14 18:15:49 +08:00
panic.c panic: only warn about deprecated panic_print on write access 2026-01-19 12:30:01 -08:00
params.c
pid.c ns: drop custom reference count initialization for initial namespaces 2025-11-11 10:01:32 +01:00
pid_namespace.c pid: rely on common reference count behavior 2025-11-11 10:01:32 +01:00
pid_sysctl.h
profile.c
ptrace.c
range.c
reboot.c
regset.c
relay.c relay: update relay to use mmap_prepare 2025-11-16 17:28:11 -08:00
resource.c Reinstate "resource: avoid unnecessary lookups in find_next_iomem_res()" 2025-11-27 14:24:45 -08:00
resource_kunit.c
rseq.c
scftorture.c
scs.c scs: fix a wrong parameter in __scs_magic 2025-11-12 10:00:13 -08:00
seccomp.c
signal.c signal: Move MMCID exit out of sighand lock 2025-11-25 19:45:40 +01:00
smp.c smp: Introduce a helper function to check for pending IPIs 2025-11-19 18:06:50 +01:00
smpboot.c
smpboot.h
softirq.c
stacktrace.c
static_call.c
static_call_inline.c
stop_machine.c
sys.c
sys_ni.c
sysctl-test.c
sysctl.c sysctl: Wrap do_proc_douintvec with the public function proc_douintvec_conv 2025-11-27 15:45:38 +01:00
task_work.c
taskstats.c
torture.c
tracepoint.c
tsacct.c
ucount.c
uid16.c
uid16.h
umh.c
up.c
user-return-notifier.c
user.c ns: drop custom reference count initialization for initial namespaces 2025-11-11 10:01:32 +01:00
user_namespace.c
utsname.c
utsname_sysctl.c
vhost_task.c
vmcore_info.c vmcoreinfo: make hwerr_data visible for debugging 2026-01-26 19:03:49 -08:00
watch_queue.c watch_queue: Use local kmap in post_one_notification() 2025-11-19 12:17:28 +01:00
watchdog.c powerpc/watchdog: add support for hardlockup_sys_info sysctl 2026-01-14 22:16:22 -08:00
watchdog_buddy.c
watchdog_perf.c
workqueue.c workqueue: Don't rely on wq->rescuer to stop rescuer 2025-11-21 09:45:36 -10:00
workqueue_internal.h