mirror of https://github.com/torvalds/linux.git
There are code paths that rely on zero_pfn to be fully initialized
before core_initcall. For example, wq_sysfs_init() is a core_initcall
function that eventually results in a call to kernel_execve, which
causes a page fault with a subsequent mmput. If zero_pfn is not
initialized by then it may not get cleaned up properly and result in an
error:
BUG: Bad rss-counter state mm:(ptrval) type:MM_ANONPAGES val:1
Here is an analysis of the race as seen on a MIPS device. On this
particular MT7621 device (Ubiquiti ER-X), zero_pfn is PFN 0 until
initialized, at which point it becomes PFN 5120:
1. wq_sysfs_init calls into kobject_uevent_env at core_initcall:
kobject_uevent_env+0x7e4/0x7ec
kset_register+0x68/0x88
bus_register+0xdc/0x34c
subsys_virtual_register+0x34/0x78
wq_sysfs_init+0x1c/0x4c
do_one_initcall+0x50/0x1a8
kernel_init_freeable+0x230/0x2c8
kernel_init+0x10/0x100
ret_from_kernel_thread+0x14/0x1c
2. kobject_uevent_env() calls call_usermodehelper_exec() which executes
kernel_execve asynchronously.
3. Memory allocations in kernel_execve cause a page fault, bumping the
MM reference counter:
add_mm_counter_fast+0xb4/0xc0
handle_mm_fault+0x6e4/0xea0
__get_user_pages.part.78+0x190/0x37c
__get_user_pages_remote+0x128/0x360
get_arg_page+0x34/0xa0
copy_string_kernel+0x194/0x2a4
kernel_execve+0x11c/0x298
call_usermodehelper_exec_async+0x114/0x194
4. In case zero_pfn has not been initialized yet, zap_pte_range does
not decrement the MM_ANONPAGES RSS counter and the BUG message is
triggered shortly afterwards when __mmdrop checks the ref counters:
__mmdrop+0x98/0x1d0
free_bprm+0x44/0x118
kernel_execve+0x160/0x1d8
call_usermodehelper_exec_async+0x114/0x194
ret_from_kernel_thread+0x14/0x1c
To avoid races such as described above, initialize init_zero_pfn at
early_initcall level. Depending on the architecture, ZERO_PAGE is
either constant or gets initialized even earlier, at paging_init, so
there is no issue with initializing zero_pfn earlier.
Link: https://lkml.kernel.org/r/CALCv0x2YqOXEAy2Q=hafjhHCtTHVodChv1qpM=niAXOpqEbt7w@mail.gmail.com
Signed-off-by: Ilya Lipnitskiy <ilya.lipnitskiy@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: stable@vger.kernel.org
Tested-by: 周琰杰 (Zhou Yanjie) <zhouyanjie@wanyeetech.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|---|---|---|
| .. | ||
| kasan | ||
| kfence | ||
| Kconfig | ||
| Kconfig.debug | ||
| Makefile | ||
| backing-dev.c | ||
| balloon_compaction.c | ||
| cleancache.c | ||
| cma.c | ||
| cma.h | ||
| cma_debug.c | ||
| compaction.c | ||
| debug.c | ||
| debug_page_ref.c | ||
| debug_vm_pgtable.c | ||
| dmapool.c | ||
| early_ioremap.c | ||
| fadvise.c | ||
| failslab.c | ||
| filemap.c | ||
| frontswap.c | ||
| gup.c | ||
| gup_test.c | ||
| gup_test.h | ||
| highmem.c | ||
| hmm.c | ||
| huge_memory.c | ||
| hugetlb.c | ||
| hugetlb_cgroup.c | ||
| hwpoison-inject.c | ||
| init-mm.c | ||
| internal.h | ||
| interval_tree.c | ||
| ioremap.c | ||
| khugepaged.c | ||
| kmemleak.c | ||
| ksm.c | ||
| list_lru.c | ||
| maccess.c | ||
| madvise.c | ||
| mapping_dirty_helpers.c | ||
| memblock.c | ||
| memcontrol.c | ||
| memfd.c | ||
| memory-failure.c | ||
| memory.c | ||
| memory_hotplug.c | ||
| mempolicy.c | ||
| mempool.c | ||
| memremap.c | ||
| memtest.c | ||
| migrate.c | ||
| mincore.c | ||
| mlock.c | ||
| mm_init.c | ||
| mmap.c | ||
| mmap_lock.c | ||
| mmu_gather.c | ||
| mmu_notifier.c | ||
| mmzone.c | ||
| mprotect.c | ||
| mremap.c | ||
| msync.c | ||
| nommu.c | ||
| oom_kill.c | ||
| page-writeback.c | ||
| page_alloc.c | ||
| page_counter.c | ||
| page_ext.c | ||
| page_idle.c | ||
| page_io.c | ||
| page_isolation.c | ||
| page_owner.c | ||
| page_poison.c | ||
| page_reporting.c | ||
| page_reporting.h | ||
| page_vma_mapped.c | ||
| pagewalk.c | ||
| percpu-internal.h | ||
| percpu-km.c | ||
| percpu-stats.c | ||
| percpu-vm.c | ||
| percpu.c | ||
| pgalloc-track.h | ||
| pgtable-generic.c | ||
| process_vm_access.c | ||
| ptdump.c | ||
| readahead.c | ||
| rmap.c | ||
| rodata_test.c | ||
| shmem.c | ||
| shuffle.c | ||
| shuffle.h | ||
| slab.c | ||
| slab.h | ||
| slab_common.c | ||
| slob.c | ||
| slub.c | ||
| sparse-vmemmap.c | ||
| sparse.c | ||
| swap.c | ||
| swap_cgroup.c | ||
| swap_slots.c | ||
| swap_state.c | ||
| swapfile.c | ||
| truncate.c | ||
| usercopy.c | ||
| userfaultfd.c | ||
| util.c | ||
| vmacache.c | ||
| vmalloc.c | ||
| vmpressure.c | ||
| vmscan.c | ||
| vmstat.c | ||
| workingset.c | ||
| z3fold.c | ||
| zbud.c | ||
| zpool.c | ||
| zsmalloc.c | ||
| zswap.c | ||