linux/mm
Frank van der Linden 7365ff2c8e mm/cma: export total and free number of pages for CMA areas
Patch series "hugetlb/CMA improvements for large systems", v5.

On large systems, we observed some issues with hugetlb and CMA:

1) When specifying a large number of hugetlb boot pages (hugepages= on
   the commandline), the kernel may run out of memory before it even gets
   to HVO.  For example, if you have a 3072G system, and want to use 3024
   1G hugetlb pages for VMs, that should leave you plenty of space for the
   hypervisor, provided you have the hugetlb vmemmap optimization (HVO)
   enabled.  However, since the vmemmap pages are always allocated first,
   and then later in boot freed, you will actually run yourself out of
   memory before you can do HVO.  This means not getting all the hugetlb
   pages you want, and worse, failure to boot if there is an allocation
   failure in the system from which it can't recover.

2) There is a system setup where you might want to use hugetlb_cma with
   a large value (say, again, 3024 out of 3072G like above), and then
   lower that if system usage allows it, to make room for non-hugetlb
   processes.  For this, a variation of the problem above applies: the
   kernel runs out of unmovable space to allocate from before you finish
   boot, since your CMA area takes up all the space.

3) CMA wants to use one big contiguous area for allocations.  Which
   fails if you have the aforementioned 3T system with a gap in the middle
   of physical memory (like the < 40bits BIOS DMA area seen on some AMD
   systems).  You then won't be able to set up a CMA area for one of the
   NUMA nodes, leading to loss of half of your hugetlb CMA area.

4) Under the scenario mentioned in 2), when trying to grow the number
   of hugetlb pages after dropping it for a while, new CMA allocations may
   fail occasionally.  This is not unexpected, some transient references
   on pages may prevent cma_alloc from succeeding under memory pressure. 
   However, the hugetlb code then falls back to a normal contiguous alloc,
   which may end up succeeding.  This is not always desired behavior.  If
   you have a large CMA area, then the kernel has a restricted amount of
   memory it can do unmovable allocations from (a well known issue).  A
   normal contiguous alloc may eat further in to this space.


To resolve these issues, do the following:
* Add hooks to the section init code to do custom initialization of
  memmap pages.  Hugetlb bootmem (memblock) allocated pages can then be
  pre-HVOed.  This avoids allocating a large number of vmemmap pages early
  in boot, only to have them be freed again later, and also avoids running
  out of memory as described under 1).  Using these hooks for hugetlb is
  optional.  It requires moving hugetlb bootmem allocation to an earlier
  spot by the architecture.  This has been enabled on x86.
* hugetlb_cma doesn't care about the CMA area it uses being one large
  contiguous range.  Multiple smaller ranges are fine.  The only
  requirements are that the areas should be on one NUMA node, and
  individual gigantic pages should be allocatable from them.  So,
  implement multi-range support for CMA, avoiding issue 3).
* Introduce a hugetlb_cma_only option on the commandline.  This only
  allows allocations from CMA for gigantic pages, if hugetlb_cma= is also
  specified.
* With hugetlb_cma_only active, it also makes sense to be able to
  pre-allocate gigantic hugetlb pages at boot time from the CMA area(s). 
  Add a rudimentary early CMA allocation interface, that just grabs a
  piece of memblock-allocated space from the CMA area, which gets marked
  as allocated in the CMA bitmap when the CMA area is initialized.  With
  this, hugepages= can be supported with hugetlb_cma=, making scenario 2)
  work.

Additionally, fix some minor bugs, with one worth mentioning: since
hugetlb gigantic bootmem pages are allocated by memblock, they may span
multiple zones, as memblock doesn't (and mostly can't) know about zones. 
This can cause problems.  A hugetlb page spanning multiple zones is bad,
and it's worse with HVO, when the de-HVO step effectively sneakily
re-assigns pages to a different zone than originally configured, since the
tail pages all inherit the zone from the first 60 tail pages.  This
condition is not common, but can be easily reproduced using ZONE_MOVABLE. 
To fix this, add checks to see if gigantic bootmem pages intersect with
multiple zones, and do not use them if they do, giving them back to the
page allocator instead.

The first patch is kind of along for the ride, except that maintaining an
available_count for a CMA area is convenient for the multiple range
support.


This patch (of 27):

In addition to the number of allocations and releases, system management
software may like to be aware of the size of CMA areas, and how many pages
are available in it.  This information is currently not available, so
export it in total_page and available_pages, respectively.

The name 'available_pages' was picked over 'free_pages' because 'free'
implies that the pages are unused.  But they might not be, they just
haven't been used by cma_alloc

The number of available pages is tracked regardless of CONFIG_CMA_SYSFS,
allowing for a few minor shortcuts in the code, avoiding bitmap
operations.

Link: https://lkml.kernel.org/r/20250228182928.2645936-2-fvdl@google.com
Signed-off-by: Frank van der Linden <fvdl@google.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev>
Cc: Usama Arif <usamaarif642@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16 22:06:24 -07:00
..
damon mm/damon/sysfs-schemes: add files for setting damos_filter->sz_range 2025-03-16 22:06:12 -07:00
kasan kasan: don't call find_vm_area() in a PREEMPT_RT kernel 2025-02-17 22:40:04 -08:00
kfence kfence: skip __GFP_THISNODE allocations on NUMA systems 2025-02-01 03:53:26 -08:00
kmsan dma: kmsan: export kmsan_handle_dma() for modules 2025-03-05 21:36:14 -08:00
Kconfig mm: zbud: remove zbud 2025-03-16 22:06:01 -07:00
Kconfig.debug
Makefile mm: zbud: remove zbud 2025-03-16 22:06:01 -07:00
backing-dev.c
balloon_compaction.c
bootmem_info.c bootmem: stop using page->index 2024-11-07 14:38:07 -08:00
cma.c mm/cma: export total and free number of pages for CMA areas 2025-03-16 22:06:24 -07:00
cma.h mm/cma: export total and free number of pages for CMA areas 2025-03-16 22:06:24 -07:00
cma_debug.c mm/cma: export total and free number of pages for CMA areas 2025-03-16 22:06:24 -07:00
cma_sysfs.c mm/cma: export total and free number of pages for CMA areas 2025-03-16 22:06:24 -07:00
compaction.c NFS: fix nfs_release_folio() to not deadlock via kcompactd writeback 2025-03-05 21:36:15 -08:00
debug.c mm/debug: print vm_refcnt state when dumping the vma 2025-03-16 22:06:20 -07:00
debug_page_alloc.c
debug_page_ref.c
debug_vm_pgtable.c mm/debug_vm_pgtable: Use pxdp_get() for accessing page table entries 2024-09-17 01:07:01 -07:00
dmapool.c
dmapool_test.c
early_ioremap.c mm/early_ioremap: add null pointer checks to prevent NULL-pointer dereference 2025-01-13 22:40:59 -08:00
execmem.c alloc_tag: populate memory for module tags as needed 2024-11-07 14:25:16 -08:00
fadvise.c fdget(), trivial conversions 2024-11-03 01:28:06 -05:00
fail_page_alloc.c fault-inject: improve build for CONFIG_FAULT_INJECTION=n 2024-09-01 20:43:33 -07:00
failslab.c fault-inject: improve build for CONFIG_FAULT_INJECTION=n 2024-09-01 20:43:33 -07:00
filemap.c filemap: remove redundant folio_test_large check in filemap_free_folio 2025-03-16 22:06:16 -07:00
folio-compat.c mm/writeback: add folio_mark_dirty_lock() 2024-11-05 11:14:32 +01:00
gup.c mm: remove the access_ok() call from gup_fast_fallback() 2025-03-16 22:06:10 -07:00
gup_test.c
gup_test.h
highmem.c
hmm.c
huge_memory.c mm: avoid splitting pmd for lazyfree pmd-mapped THP in try_to_unmap 2025-03-16 22:06:17 -07:00
hugetlb.c mm/hugetlb: fix surplus pages in dissolve_free_huge_page() 2025-03-16 17:40:23 -07:00
hugetlb_cgroup.c mm/hugetlb-cgroup: convert hugetlb_cgroup_css_offline() to work on folios 2025-01-25 20:22:42 -08:00
hugetlb_vmemmap.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
hugetlb_vmemmap.h
hwpoison-inject.c
init-mm.c mm: replace vm_lock and detached flag with a reference count 2025-03-16 22:06:20 -07:00
internal.h mm: memory-failure: update ttu flag inside unmap_poisoned_folio 2025-03-05 21:36:13 -08:00
interval_tree.c
io-mapping.c
ioremap.c mm/ioremap: pass pgprot_t to ioremap_prot() instead of unsigned long 2025-03-16 22:06:23 -07:00
khugepaged.c The various patchsets are summarized below. Plus of course many 2025-01-26 18:36:23 -08:00
kmemleak.c mm: kmemleak: add support for dumping physical and __percpu object info 2025-03-16 22:06:08 -07:00
ksm.c mm/ksm: handle device-exclusive entries correctly in write_protect_page() 2025-03-16 22:05:58 -07:00
list_lru.c mm/list_lru: fix false warning of negative counter 2024-12-30 17:59:10 -08:00
maccess.c kasan: migrate copy_user_test to kunit 2024-11-11 00:26:44 -08:00
madvise.c mm: allow guard regions in file-backed and read-only mappings 2025-03-16 22:06:14 -07:00
mapping_dirty_helpers.c
memblock.c mm/memblock: add memblock_alloc_or_panic interface 2025-01-25 20:22:38 -08:00
memcontrol-v1.c mm: memcontrol: move memsw charge callbacks to v1 2025-03-16 22:05:55 -07:00
memcontrol-v1.h mm: memcontrol: move memsw charge callbacks to v1 2025-03-16 22:05:55 -07:00
memcontrol.c mm: memcontrol: move memsw charge callbacks to v1 2025-03-16 22:05:55 -07:00
memfd.c mm/memfd: fix spelling and grammatical issues 2025-03-16 22:06:04 -07:00
memory-failure.c mm: memory-failure: update ttu flag inside unmap_poisoned_folio 2025-03-05 21:36:13 -08:00
memory-tiers.c memory tiers: use default_dram_perf_ref_source in log message 2024-09-26 14:01:44 -07:00
memory.c mm/ioremap: pass pgprot_t to ioremap_prot() instead of unsigned long 2025-03-16 22:06:23 -07:00
memory_hotplug.c hwpoison, memory_hotplug: lock folio before unmap hwpoisoned folio 2025-03-05 21:36:13 -08:00
mempolicy.c mm/hugetlb: rename isolate_hugetlb() to folio_isolate_hugetlb() 2025-01-25 20:22:41 -08:00
mempool.c
memremap.c
memtest.c
migrate.c mm: use READ/WRITE_ONCE() for vma->vm_flags on migrate, mprotect 2025-03-16 22:06:09 -07:00
migrate_device.c mm/migrate_device: don't add folio to be freed to LRU in migrate_device_finalize() 2025-02-17 22:40:02 -08:00
mincore.c
mlock.c mm/mlock: set the correct prev on failure 2024-11-07 14:14:58 -08:00
mm_init.c mm/mm_init.c: use round_up() to calculate usermap size 2025-03-16 22:06:14 -07:00
mm_slot.h
mmap.c mm: make vma cache SLAB_TYPESAFE_BY_RCU 2025-03-16 22:06:21 -07:00
mmap_lock.c mm: mmap_lock: optimize mmap_lock tracepoints 2025-01-13 22:40:34 -08:00
mmu_gather.c mm/mmu_gather: update comment on RCU freeing 2025-03-16 22:06:12 -07:00
mmu_notifier.c
mmzone.c mm: improve code consistency with zonelist_* helper functions 2024-09-01 20:25:55 -07:00
mprotect.c mm: use READ/WRITE_ONCE() for vma->vm_flags on migrate, mprotect 2025-03-16 22:06:09 -07:00
mremap.c mm: clear uffd-wp PTE/PMD state on mremap() 2025-01-12 19:03:37 -08:00
mseal.c mseal: remove can_do_mseal() 2025-01-13 22:40:51 -08:00
msync.c
nommu.c mm: introduce vma_iter_store_attached() to use with attached vmas 2025-03-16 22:06:18 -07:00
numa.c mm/memblock: add memblock_alloc_or_panic interface 2025-01-25 20:22:38 -08:00
numa_emulation.c mm/fake-numa: allow later numa node hotplug 2025-01-25 20:22:29 -08:00
numa_memblks.c mm/fake-numa: allow later numa node hotplug 2025-01-25 20:22:29 -08:00
oom_kill.c mm/oom_kill: fix trivial typo in comment 2025-03-16 22:05:55 -07:00
page-writeback.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
page_alloc.c alloc_tag: uninline code gated by mem_alloc_profiling_key in page allocator 2025-03-16 22:06:03 -07:00
page_counter.c kernel/cgroup: Add "dmem" memory accounting cgroup 2025-01-06 17:24:38 +01:00
page_ext.c
page_frag_cache.c mm/page_alloc: export free_frozen_pages() instead of free_unref_page() 2025-01-13 22:40:31 -08:00
page_idle.c mm/page_idle: handle device-exclusive entries correctly in page_idle_clear_pte_refs_one() 2025-03-16 22:05:59 -07:00
page_io.c mm, swap: clean up device availability check 2025-01-25 20:22:36 -08:00
page_isolation.c mm/hugetlb: wait for hugetlb folios to be freed 2025-03-05 21:36:14 -08:00
page_owner.c
page_poison.c
page_reporting.c
page_reporting.h
page_table_check.c mm: use single SWP_DEVICE_EXCLUSIVE entry type 2025-03-16 22:05:58 -07:00
page_vma_mapped.c mm/page_vma_mapped: device-exclusive entries are not migration entries 2025-03-16 22:05:58 -07:00
pagewalk.c mm: pagewalk: add the ability to install PTEs 2024-11-11 00:26:44 -08:00
percpu-internal.h
percpu-km.c
percpu-stats.c
percpu-vm.c
percpu.c mm, percpu: do not consider sleepable allocations atomic 2025-03-16 22:06:08 -07:00
pgalloc-track.h
pgtable-generic.c mm: add RCU annotation to pte_offset_map(_lock) 2024-12-18 19:04:43 -08:00
process_vm_access.c mm: refactor mm_access() to not return NULL 2024-11-05 16:56:23 -08:00
pt_reclaim.c mm: pgtable: reclaim empty PTE page in madvise(MADV_DONTNEED) 2025-01-13 22:40:48 -08:00
ptdump.c
readahead.c The various patchsets are summarized below. Plus of course many 2025-01-26 18:36:23 -08:00
rmap.c mm: avoid splitting pmd for lazyfree pmd-mapped THP in try_to_unmap 2025-03-16 22:06:17 -07:00
rodata_test.c mm/rodata_test: verify test data is unchanged, rather than non-zero 2025-01-13 22:40:38 -08:00
secretmem.c add a string-to-qstr constructor 2025-01-27 19:25:45 -05:00
shmem.c mm: memcontrol: move memsw charge callbacks to v1 2025-03-16 22:05:55 -07:00
shmem_quota.c
show_mem.c mm/show_mem: use str_yes_no() helper in show_free_areas() 2024-11-07 14:38:08 -08:00
shrinker.c mm: shrinker: avoid memleak in alloc_shrinker_info 2024-10-31 20:27:04 -07:00
shrinker_debug.c saner replacement for debugfs_rename() 2025-01-15 13:14:37 +01:00
shuffle.c
shuffle.h
slab.h mm/slab: fix kernel-doc func param names 2025-01-13 10:22:04 +01:00
slab_common.c mm/slab/kvfree_rcu: Switch to WQ_MEM_RECLAIM wq 2025-03-04 08:51:53 +01:00
slub.c alloc_tag: uninline code gated by mem_alloc_profiling_key in slab allocator 2025-03-16 22:06:03 -07:00
sparse-vmemmap.c mm/memmap: prevent double scanning of memmap by kmemleak 2025-01-25 20:22:30 -08:00
sparse.c mm/memblock: add memblock_alloc_or_panic interface 2025-01-25 20:22:38 -08:00
swap.c mm/filemap: add read support for RWF_DONTCACHE 2025-01-25 20:22:43 -08:00
swap.h mm: fix swap_read_folio_zeromap() for large folios with partial zeromap 2024-09-17 01:07:01 -07:00
swap_cgroup.c mm: memcontrol: fix swap counter leak from offline cgroup 2025-03-16 17:40:24 -07:00
swap_slots.c mm, swap_slots: remove slot cache for freeing path 2025-01-25 20:22:37 -08:00
swap_state.c mm/swap: rename swap_swapcount() to swap_entry_swapped() 2025-03-16 22:06:07 -07:00
swapfile.c mm/swapfile.c: open code cluster_alloc_swap() 2025-03-16 22:06:07 -07:00
truncate.c mm/truncate: don't skip dirty page in folio_unmap_invalidate() 2025-02-21 14:09:47 +01:00
usercopy.c
userfaultfd.c mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail 2025-03-16 22:06:18 -07:00
util.c mm: add comments to do_mmap(), mmap_region() and vm_mmap() 2025-01-13 22:40:59 -08:00
vma.c mm: make vma cache SLAB_TYPESAFE_BY_RCU 2025-03-16 22:06:21 -07:00
vma.h mm: make vma cache SLAB_TYPESAFE_BY_RCU 2025-03-16 22:06:21 -07:00
vma_internal.h mm/vma: move brk() internals to mm/vma.c 2025-01-13 22:40:42 -08:00
vmalloc.c mm: don't skip arch_sync_kernel_mappings() in error paths 2025-03-05 21:36:18 -08:00
vmpressure.c
vmscan.c vmscan, cleanup: add for_each_managed_zone_pgdat macro 2025-03-16 22:06:10 -07:00
vmstat.c vmstat: disable vmstat_work on vmstat_cpu_down_prep() 2025-01-12 19:03:38 -08:00
workingset.c mm/mglru: rework workingset protection 2025-01-25 20:22:39 -08:00
zpdesc.h mm/zsmalloc: introduce __zpdesc_clear/set_zsmalloc() 2025-01-25 20:22:35 -08:00
zpool.c mm: zbud: remove zbud 2025-03-16 22:06:01 -07:00
zsmalloc.c mm/zsmalloc: add __maybe_unused attribute for is_first_zpdesc() 2025-02-01 03:53:23 -08:00
zswap.c mm: zswap: use ATOMIC_LONG_INIT to initialize zswap_stored_pages 2025-03-05 21:36:17 -08:00