Compare commits

...

424 Commits

Author SHA1 Message Date
Linus Torvalds 4a26e7032d Core kernel bug handling infrastructure changes for v6.19:
- Improve WARN(), which has vararg printf like arguments,
     to work with the x86 #UD based WARN-optimizing infrastructure
     by hiding the format in the bug_table and replacing this
     first argument with the address of the bug-table entry,
     while making the actual function that's called a UD1 instruction.
     (Peter Zijlstra)
 
   - Introduce the CONFIG_DEBUG_BUGVERBOSE_DETAILED Kconfig switch
     (Ingo Molnar, s390 support by Heiko Carstens)
 
 Fixes and cleanups:
 
   - bugs/s390: Remove private WARN_ON() implementation (Heiko Carstens)
 
   - <asm/bugs.h>: Make i386 use GENERIC_BUG_RELATIVE_POINTERS
     (Peter Zijlstra)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmktffcRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gLCQ//ZMHTpv6UF+xY4Kq3rdVFJWRR7Prs2hDn
 4V/cFHu7Xb2lRwpJDBNWnbiEkLbGR5qlWD+0CpAMK4mGuL9VjnoeflEzXxOPP9W1
 7SWw1HDc6NkjHM3BNnvkPQbzWF4RL4ZGs4Lbb1nv7pjoTSdBMXNrD5RL+iNmfHHd
 QfOG9ZiRyD5A/b07bfyjIXNaeC/Hot+FeVXTMfD7/vCfc2ywhL2Sm5G/igY/shTY
 Una7q8sbFDD/bFFtWSR2eFQeHQQfT6c/Pu39ZcAIdTlLk1FtZ+A2wcRtwYv/FdPV
 6KDOAxZK7fLMHoCdNTswsg+LbazhABOb/V1J7TaHq2tF/PeyN+B+ucxVY2KFcxQJ
 V5eG5crMrUIL22QO/UaT3dPWxGbJYkNlAWl416tAKdgXA52W4OsPd+k4DGjeP569
 KogAy3SY0D/80v1QN2HcFEJMvr7W2SukxtErqdtA5OKt/ZevB49lGqZoBecqASDO
 QjI1K0yLKnb+erbMIpCpNj+o/Fr1JQgWbYVpwipL2GON4an6vrowimGTsUG7qqxN
 Hwb7IaTNnWQ/4iyCkVV44q6Ln1gyP2hz5Lnzo7QJINUdzp98UjJLAEK4NRG44T4L
 p0t5NMxnvREpAv35rwy03xhmITYeQMOqzN9JEBBvegQyld5ghbThxI7g9BDzcXyL
 Bd0mF7WOV9M=
 =bMUL
 -----END PGP SIGNATURE-----

Merge tag 'core-bugs-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull bug handling infrastructure updates from Ingo Molnar:
 "Core updates:

   - Improve WARN(), which has vararg printf like arguments, to work
     with the x86 #UD based WARN-optimizing infrastructure by hiding the
     format in the bug_table and replacing this first argument with the
     address of the bug-table entry, while making the actual function
     that's called a UD1 instruction (Peter Zijlstra)

   - Introduce the CONFIG_DEBUG_BUGVERBOSE_DETAILED Kconfig switch (Ingo
     Molnar, s390 support by Heiko Carstens)

  Fixes and cleanups:

   - bugs/s390: Remove private WARN_ON() implementation (Heiko Carstens)

   - <asm/bugs.h>: Make i386 use GENERIC_BUG_RELATIVE_POINTERS (Peter
     Zijlstra)"

* tag 'core-bugs-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
  x86/bugs: Make i386 use GENERIC_BUG_RELATIVE_POINTERS
  x86/bug: Fix BUG_FORMAT vs KASLR
  x86_64/bug: Inline the UD1
  x86/bug: Implement WARN_ONCE()
  x86_64/bug: Implement __WARN_printf()
  x86/bug: Use BUG_FORMAT for DEBUG_BUGVERBOSE_DETAILED
  x86/bug: Add BUG_FORMAT basics
  bug: Allow architectures to provide __WARN_printf()
  bug: Implement WARN_ON() using __WARN_FLAGS()
  bug: Add report_bug_entry()
  bug: Add BUG_FORMAT_ARGS infrastructure
  bug: Clean up CONFIG_GENERIC_BUG_RELATIVE_POINTERS
  bug: Add BUG_FORMAT infrastructure
  x86: Rework __bug_table helpers
  bugs/s390: Remove private WARN_ON() implementation
  bugs/core: Reorganize fields in the first line of WARNING output, add ->comm[] output
  bugs/sh: Concatenate 'cond_str' with '__FILE__' in __WARN_FLAGS(), to extend WARN_ON/BUG_ON output
  bugs/parisc: Concatenate 'cond_str' with '__FILE__' in __WARN_FLAGS(), to extend WARN_ON/BUG_ON output
  bugs/riscv: Concatenate 'cond_str' with '__FILE__' in __BUG_FLAGS(), to extend WARN_ON/BUG_ON output
  bugs/riscv: Pass in 'cond_str' to __BUG_FLAGS()
  ...
2025-12-01 21:33:01 -08:00
Linus Torvalds dcd8637edb Core x86 changes for v6.19:
- x86/alternatives: Drop unnecessary test after call to
    alt_replace_call() (Juergen Gross)
 
  - x86/dumpstack: Prevent KASAN false positive warnings in
    __show_regs() (Tengda Wu)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmktetMRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1jLBRAAluu/0ACAzh1A5gw6Mr9uQ45KFxCt4HZa
 gjIyfDNqsdb6fqJ3yFAzw36gYViIzEGZo26S0mGWXqLVmMHus2HqvHTxYCPA7mov
 YfMxMUKsyTIaDoGHOYuhHwwdJL+KzXuKQri5/5MVbxNPPYedk4qV4weTTDPz/pkV
 biYOw6S4GDsOC5O1RT1AX+NBcIZkQzqoTSnkCXjDSCxXpVxYg7CQ6besMtkCCArZ
 QMl5ZCh6cEhhyOXIVL1Huz9pxxWpzxXM0oqNzhBjCjijYN5HGxL5kUv9tWHatP8h
 Q5SFnhyhPu/GtzZUw32O5r7sC93HweJeUtvaE0ekTSMK8A/oIwBup3Z4pRGpY/A2
 B16VAYSEYzoJEUmptPkHTbOouKMR45P02M0RM5rkojeh75LRivW6T0Zd1+LkEXJr
 yVwWfr+DoXaTtyK1GYdQNbwEWjpIxUwcBtVZN2h3Ajo28yPwB6Eh0mHUpl1Yol83
 Ex5XL9lsCljIkmDTup+6ai9fyVpsRUbLANnBZ76DiRRw44vUto/Dj8pkIOysZ/8/
 K/zVxou2KFtIDB/oFOIc1AIEjenq6IXqLhyKoerm2c0cALH67KUsFmHTH8/6hbUV
 WLWtROB9tLizBBE7Xo4uu1LkmIMWVZ8nHbX/QYQSQBv2Nhw29a681JaHjVgRYBIJ
 K97PRgZS6dM=
 =i6oY
 -----END PGP SIGNATURE-----

Merge tag 'x86-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull core x86 updates from Ingo Molnar:

 - x86/alternatives: Drop unnecessary test after call to
   alt_replace_call() (Juergen Gross)

 - x86/dumpstack: Prevent KASAN false positive warnings in
   __show_regs() (Tengda Wu)

* tag 'x86-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/dumpstack: Prevent KASAN false positive warnings in __show_regs()
  x86/alternative: Drop not needed test after call of alt_replace_call()
2025-12-01 21:31:02 -08:00
Linus Torvalds e7d81c1ed6 A single fix for an ancient prototype in the math-emu code, by Arnd Bergmann.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmktecYRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gu0Q//S8JrbuW+o5XecSPgK+TEk8RhLRUYZPNK
 pObhFOZbWN8XPpVLGfHDLKTahyWNki8zfuTgdL+jwiCMvCp8u/vrBUr2bRYahDuI
 I4V6pau7zqSvDFG8bqpYbYyo8h0cGXm/f74wquSiaO7B6kEATDAQKUzCYOg1aTHR
 +hDYjgJssjBJg5e/F7gwri/V008aNCAJhOAlD75Yf6O209sMt8PjwaLQ9m+5zDPX
 9WD9wXAlV5sDcz7ukcsggKPc0fXKaqqjEu/mFXOHCe5qFbXeUQ+CE+ExGVT6oBo+
 3u/m4uH+TZ3LpP1ebWtGD22jZbd4xjKj5XyJ7KEh8zpu9xRaF4h0J5koLHRKvo9H
 KZ9j+tSQ7N6jH+fW0cj2NWA9onicSDgu5t0m5354Ui0TSjAqSFcErfMJJL+gEjHr
 zUHqIBvcMZqreg2KYYeS0VAslbD1Wr0f4GWBlJ26h71DZf5orHVYCaktWfCYCd72
 6HW34USKbzXK/X4ygl3VIvBOxSXeeOz8ce0nBJTZpwJsmMbQsmVFqyvqa8TLf131
 xntjS4mPkhIrnh8JpwwhoYC9dN58OtYbQ2cmkiXLwkqn60u3bkQATir0Zb1As0yz
 lZ7sdtxJ+OBImctySTM+mz1OjO8pNMmb37ZNcujuuSHaPS22cHqFg4TVpQWmDG3C
 ebH3gi+irqs=
 =hfpM
 -----END PGP SIGNATURE-----

Merge tag 'x86-build-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 math-emu fix from Ingo Molnar:
 "A single fix for an ancient prototype in the math-emu code, by Arnd
  Bergmann"

* tag 'x86-build-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/math-emu: Fix div_Xsig() prototype
2025-12-01 21:28:23 -08:00
Linus Torvalds de2f75d55e x86/apic changes for v6.19:
- x86/apic: Fix the frequency in apic=verbose log output (Julian Stecklina)
 
   - Simplify mp_irqdomain_alloc() slightly (Christophe JAILLET)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmktePURHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iIsQ//ZGocltNkF7GrYytJbLTactwvj/w1C9ly
 ejEh0enSVibyvc+ADkEwFBPIXqQDo9KssYlwTZmgI3SSv1nohMch7CWPQ87Rnajo
 NxzWueS4q5Yp/IRVAiwXME7cYmJ2MtGxRk+gyq8RSFoV2gVOmPkFlWX0DZzDv5Oi
 Z0OPM2sTlw+hrFKd0DoG5PySaYhVooYfKynXCFFwn8KrRNYuEfQwA4wZFesn2+u2
 xwKaAKricpePG2/th3DdcBRUhkVZliVIUitbBUW5Ya9LsDeYkaq5+vLmywWEQC1q
 WCY8J3ptlH02E9xVoNA0M3lPhx5XPCcYQvSkgIobDE4d41e8yAj7Rgx1l5opNK6b
 KRMcsL9eBRMDbFSY0x6YA17XTz0yCN49BtiCx8hTGhPhBNyWICw5n8t1k5kTH9XQ
 Aj3oGSdtBstJIZjM4biPC5sP8HPXNn/QCHwMXBZEWG5IbyjbEFWGQvd3w2cNTuBd
 nMxaovs7eo+Q9ftbVlVJFhIgq9wEUUG9rTm7KLdqjglolhn7WzC/T7iVw+o8X9OX
 3IZRYimbfoP90vgqROVQiRjPxMj7t2F8Y11Bf4jIizTeSSCZQhrXQjZdUX1JNY3F
 T2Yhk8AXBaUEgf7kZlwzy+jco0Sd60MWHD18k1Ouot/RONgAbmkTn2BIXD/CkkkP
 UjiPoILd1kg=
 =RTrX
 -----END PGP SIGNATURE-----

Merge tag 'x86-apic-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 apic updates from Ingo Molnar:

 - x86/apic: Fix the frequency in apic=verbose log output (Julian
   Stecklina)

 - Simplify mp_irqdomain_alloc() slightly (Christophe JAILLET)

* tag 'x86-apic-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/apic: Fix frequency in apic=verbose log output
  x86/ioapic: Simplify mp_irqdomain_alloc() slightly
2025-12-01 21:26:35 -08:00
Linus Torvalds 6d2c10e889 Scheduler changes for v6.19:
Scalability and load-balancing improvements:
 
   - Enable scheduler feature NEXT_BUDDY (Mel Gorman)
 
   - Reimplement NEXT_BUDDY to align with EEVDF goals (Mel Gorman)
 
   - Skip sched_balance_running cmpxchg when balance is not due (Tim Chen)
 
   - Implement generic code for architecture specific sched domain
     NUMA distances (Tim Chen)
 
   - Optimize the NUMA distances of the sched-domains builds of Intel
     Granite Rapids (GNR) and Clearwater Forest (CWF) platforms
     (Tim Chen)
 
   - Implement proportional newidle balance: a randomized algorithm
     that runs newidle balancing proportional to its success rate.
     (Peter Zijlstra)
 
 Scheduler infrastructure changes:
 
   - Implement the 'sched_change' scoped_guard() pattern for
     the entire scheduler (Peter Zijlstra)
 
   - More broadly utilize the sched_change guard (Peter Zijlstra)
 
   - Add support to pick functions to take runqueue-flags (Joel Fernandes)
 
   - Provide and use set_need_resched_current() (Peter Zijlstra)
 
 Fair scheduling enhancements:
 
   - Forfeit vruntime on yield (Fernand Sieber)
   - Only update stats for allowed CPUs when looking for dst group (Adam Li)
 
 CPU-core scheduling enhancements:
 
   - Optimize core cookie matching check (Fernand Sieber)
 
 Deadline scheduler fixes:
 
   - Only set free_cpus for online runqueues (Doug Berger)
   - Fix dl_server time accounting (Peter Zijlstra)
   - Fix dl_server stop condition (Peter Zijlstra)
 
 Proxy scheduling fixes:
 
   - Yield the donor task (Fernand Sieber)
 
 Fixes and cleanups:
 
   - Fix do_set_cpus_allowed() locking (Peter Zijlstra)
   - Fix migrate_disable_switch() locking (Peter Zijlstra)
   - Remove double update_rq_clock() in __set_cpus_allowed_ptr_locked() (Hao Jia)
   - Increase sched_tick_remote timeout (Phil Auld)
   - sched/deadline: Use cpumask_weight_and() in dl_bw_cpus() (Shrikanth Hegde)
   - sched/deadline: Clean up select_task_rq_dl() (Shrikanth Hegde)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmktd+MRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hRFw//TNK2tN9D/U3rh66RU/evkv4ZoEfdkqwl
 L3MDs9evupkggELFBjImd+lxDTjKnvq6aSb9ryJEbmivJMVUAcCF1fBILcs5m51N
 pCIapavjsENMdCPRzC7xDE3t5GSAQ35NIWkLbr5d4bpEIp8YmchOKoikWNMT09A1
 ifHJ+BMWiX5AY3r0bcvsi8XRzo/bSw1OA+gTh0BUyD9VBbIrqjvYR+W6nJXw6FKB
 rHp+VlaidFIbfsi25KU7Ixn52K4Cqm0AlapMIlDZABPZpKFAkhDCRSb6CGigZl4T
 rNSSmnkmF6GFVVIKnNfWiW2OsojnS9mm1tAiM8huu+KumUqFXhfzlfoNPF9oMphR
 TCeyK66br55/WK86Of5zowxQLwj0/nLHCT9HbV4Bre7kakTUJjtFKBqxBcdoENjt
 bx/tLoFeS4EeMSULAT0vZvE064XT0gHnRU0UXSBwRTuNcjMsN6RQVFWfkBHRx90g
 HLoK2AkvdqUFeiHWXiIx9wu6CpWBPPmiGL3+7RyXJ93z2EccmCRP5jmzYFxNoaD/
 uKnz9gYETNyVIhU44pSJcImpxr2m9SKfqetouhaYrPi9SRhhOBP+ZSzNCUPMQeIN
 ZVpzYinMhZo4m0c8W2hitlGLmj3hIgt7vesyiFlqMJrS3Y5PNnPsN608PI27IGLa
 GSbXtohZZIM=
 =BZY4
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "Scalability and load-balancing improvements:

   - Enable scheduler feature NEXT_BUDDY (Mel Gorman)

   - Reimplement NEXT_BUDDY to align with EEVDF goals (Mel Gorman)

   - Skip sched_balance_running cmpxchg when balance is not due (Tim
     Chen)

   - Implement generic code for architecture specific sched domain NUMA
     distances (Tim Chen)

   - Optimize the NUMA distances of the sched-domains builds of Intel
     Granite Rapids (GNR) and Clearwater Forest (CWF) platforms (Tim
     Chen)

   - Implement proportional newidle balance: a randomized algorithm that
     runs newidle balancing proportional to its success rate. (Peter
     Zijlstra)

  Scheduler infrastructure changes:

   - Implement the 'sched_change' scoped_guard() pattern for the entire
     scheduler (Peter Zijlstra)

   - More broadly utilize the sched_change guard (Peter Zijlstra)

   - Add support to pick functions to take runqueue-flags (Joel
     Fernandes)

   - Provide and use set_need_resched_current() (Peter Zijlstra)

  Fair scheduling enhancements:

   - Forfeit vruntime on yield (Fernand Sieber)

   - Only update stats for allowed CPUs when looking for dst group (Adam
     Li)

  CPU-core scheduling enhancements:

   - Optimize core cookie matching check (Fernand Sieber)

  Deadline scheduler fixes:

   - Only set free_cpus for online runqueues (Doug Berger)

   - Fix dl_server time accounting (Peter Zijlstra)

   - Fix dl_server stop condition (Peter Zijlstra)

  Proxy scheduling fixes:

   - Yield the donor task (Fernand Sieber)

  Fixes and cleanups:

   - Fix do_set_cpus_allowed() locking (Peter Zijlstra)

   - Fix migrate_disable_switch() locking (Peter Zijlstra)

   - Remove double update_rq_clock() in __set_cpus_allowed_ptr_locked()
     (Hao Jia)

   - Increase sched_tick_remote timeout (Phil Auld)

   - sched/deadline: Use cpumask_weight_and() in dl_bw_cpus() (Shrikanth
     Hegde)

   - sched/deadline: Clean up select_task_rq_dl() (Shrikanth Hegde)"

* tag 'sched-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
  sched: Provide and use set_need_resched_current()
  sched/fair: Proportional newidle balance
  sched/fair: Small cleanup to update_newidle_cost()
  sched/fair: Small cleanup to sched_balance_newidle()
  sched/fair: Revert max_newidle_lb_cost bump
  sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  sched/fair: Enable scheduler feature NEXT_BUDDY
  sched: Increase sched_tick_remote timeout
  sched/fair: Have SD_SERIALIZE affect newidle balancing
  sched/fair: Skip sched_balance_running cmpxchg when balance is not due
  sched/deadline: Minor cleanup in select_task_rq_dl()
  sched/deadline: Use cpumask_weight_and() in dl_bw_cpus
  sched/deadline: Document dl_server
  sched/deadline: Fix dl_server stop condition
  sched/deadline: Fix dl_server time accounting
  sched/core: Remove double update_rq_clock() in __set_cpus_allowed_ptr_locked()
  sched/eevdf: Fix min_vruntime vs avg_vruntime
  sched/core: Add comment explaining force-idle vruntime snapshots
  sched/core: Optimize core cookie matching check
  sched/proxy: Yield the donor task
  ...
2025-12-01 21:04:45 -08:00
Linus Torvalds 6c26fbe8c9 Performance events changes for v6.19:
Callchain support:
 
  - Add support for deferred user-space stack unwinding for
    perf, enabled on x86. (Peter Zijlstra, Steven Rostedt)
 
  - unwind_user/x86: Enable frame pointer unwinding on x86
    (Josh Poimboeuf)
 
 x86 PMU support and infrastructure:
 
  - x86/insn: Simplify for_each_insn_prefix() (Peter Zijlstra)
 
  - x86/insn,uprobes,alternative: Unify insn_is_nop()
    (Peter Zijlstra)
 
 Intel PMU driver:
 
  - Large series to prepare for and implement architectural PEBS
    support for Intel platforms such as Clearwater Forest (CWF)
    and Panther Lake (PTL). (Dapeng Mi, Kan Liang)
 
  - Check dynamic constraints (Kan Liang)
 
  - Optimize PEBS extended config (Peter Zijlstra)
 
  - cstates: Remove PC3 support from LunarLake (Zhang Rui)
 
  - cstates: Add Pantherlake support (Zhang Rui)
 
  - cstates: Clearwater Forest support (Zide Chen)
 
 AMD PMU driver:
 
  - x86/amd: Check event before enable to avoid GPF (George Kennedy)
 
 Fixes and cleanups:
 
  - task_work: Fix NMI race condition (Peter Zijlstra)
 
  - perf/x86: Fix NULL event access and potential PEBS record loss
    (Dapeng Mi)
 
  - Misc other fixes and cleanups.
    (Dapeng Mi, Ingo Molnar, Peter Zijlstra)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmktcU0RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gNKw//ThLmbkoGJ0/yLOdEcW8rA/7HB43Oz6j9
 k0Vs7zwDBMRFP4zQg2XeF5SH7CWS9p/nI3eMhorgmH77oJCvXJxVtD5991zmlZhf
 eafOar5ZMVaoMz+tK8WWiENZyuN0bt0mumZmz9svXR3KV1S/q18XZ8bCas0itwnq
 D0T3Gqi/Z39gJIy7bHNgLoFY2zvI9b2EJNDKlzHk3NJ7UamA4GuMHN0cM2dIzKGK
 2L+wXOe2BH9YYzYrz/cdKq7sBMjOvFsCQ/5jh23A2Yu6JI4nJbw0WmexZRK1OWCp
 GAdMjBuqIShibLRxK746WRO9iut49uTsah4iSG80hXzhpwf7VaegOarost1nLaqm
 zweIOr3iwJRf273r6IqRuaporVHpQYMj2w2H63z36sQtGtkKHNyxZ50b6bqpwwjU
 LikLEJ9Bmh3mlvlXsOx2wX6dTb1fUk+cy2ezCDKUHqOLjqy4dM8V+jYhuRO4yxXz
 mj9aHZKgyuREt8yo/3nLqAzF5Okj9cXp7H6F1hCKWuCoAhNXkrvYcvbg8h6aRxOX
 2vGhMYjpElkl/DG6OWCSwuqCt9nVEC/dazW9fKQjh4S0CFOVopaMGSkGcS/xUPub
 92J4XMDEJX4RJ6dfspeQr97+1fETXEIWNv4WbKnDjqJlAucU1gnOTprVnAYUjcWw
 74320FjGN1E=
 =/8GE
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull performance events updates from Ingo Molnar:
 "Callchain support:

   - Add support for deferred user-space stack unwinding for perf,
     enabled on x86. (Peter Zijlstra, Steven Rostedt)

   - unwind_user/x86: Enable frame pointer unwinding on x86 (Josh
     Poimboeuf)

  x86 PMU support and infrastructure:

   - x86/insn: Simplify for_each_insn_prefix() (Peter Zijlstra)

   - x86/insn,uprobes,alternative: Unify insn_is_nop() (Peter Zijlstra)

  Intel PMU driver:

   - Large series to prepare for and implement architectural PEBS
     support for Intel platforms such as Clearwater Forest (CWF) and
     Panther Lake (PTL). (Dapeng Mi, Kan Liang)

   - Check dynamic constraints (Kan Liang)

   - Optimize PEBS extended config (Peter Zijlstra)

   - cstates:
      - Remove PC3 support from LunarLake (Zhang Rui)
      - Add Pantherlake support (Zhang Rui)
      - Clearwater Forest support (Zide Chen)

  AMD PMU driver:

   - x86/amd: Check event before enable to avoid GPF (George Kennedy)

  Fixes and cleanups:

   - task_work: Fix NMI race condition (Peter Zijlstra)

   - perf/x86: Fix NULL event access and potential PEBS record loss
     (Dapeng Mi)

   - Misc other fixes and cleanups (Dapeng Mi, Ingo Molnar, Peter
     Zijlstra)"

* tag 'perf-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
  perf/x86/intel: Fix and clean up intel_pmu_drain_arch_pebs() type use
  perf/x86/intel: Optimize PEBS extended config
  perf/x86/intel: Check PEBS dyn_constraints
  perf/x86/intel: Add a check for dynamic constraints
  perf/x86/intel: Add counter group support for arch-PEBS
  perf/x86/intel: Setup PEBS data configuration and enable legacy groups
  perf/x86/intel: Update dyn_constraint base on PEBS event precise level
  perf/x86/intel: Allocate arch-PEBS buffer and initialize PEBS_BASE MSR
  perf/x86/intel: Process arch-PEBS records or record fragments
  perf/x86/intel/ds: Factor out PEBS group processing code to functions
  perf/x86/intel/ds: Factor out PEBS record processing code to functions
  perf/x86/intel: Initialize architectural PEBS
  perf/x86/intel: Correct large PEBS flag check
  perf/x86/intel: Replace x86_pmu.drain_pebs calling with static call
  perf/x86: Fix NULL event access and potential PEBS record loss
  perf/x86: Remove redundant is_x86_event() prototype
  entry,unwind/deferred: Fix unwind_reset_info() placement
  unwind_user/x86: Fix arch=um build
  perf: Support deferred user unwind
  unwind_user/x86: Teach FP unwind about start of function
  ...
2025-12-01 20:42:01 -08:00
Linus Torvalds 63e6995005 objtool updates for v6.19:
- klp-build livepatch module generation (Josh Poimboeuf)
 
    Introduce new objtool features and a klp-build
    script to generate livepatch modules using a
    source .patch as input.
 
    This builds on concepts from the longstanding out-of-tree
    kpatch project which began in 2012 and has been used for
    many years to generate livepatch modules for production kernels.
    However, this is a complete rewrite which incorporates
    hard-earned lessons from 12+ years of maintaining kpatch.
 
    Key improvements compared to kpatch-build:
 
     - Integrated with objtool: Leverages objtool's existing control-flow
       graph analysis to help detect changed functions.
 
     - Works on vmlinux.o: Supports late-linked objects, making it
       compatible with LTO, IBT, and similar.
 
     - Simplified code base: ~3k fewer lines of code.
 
     - Upstream: No more out-of-tree #ifdef hacks, far less cruft.
 
     - Cleaner internals: Vastly simplified logic for symbol/section/reloc
       inclusion and special section extraction.
 
     - Robust __LINE__ macro handling: Avoids false positive binary diffs
       caused by the __LINE__ macro by introducing a fix-patch-lines script
       which injects #line directives into the source .patch to preserve
       the original line numbers at compile time.
 
  - Disassemble code with libopcodes instead of running objdump
    (Alexandre Chartre)
 
  - Disassemble support (-d option to objtool) by Alexandre Chartre,
    which supports the decoding of various Linux kernel code generation
    specials such as alternatives:
 
       17ef:  sched_balance_find_dst_group+0x62f                 mov    0x34(%r9),%edx
       17f3:  sched_balance_find_dst_group+0x633               | <alternative.17f3>             | X86_FEATURE_POPCNT
       17f3:  sched_balance_find_dst_group+0x633               | call   0x17f8 <__sw_hweight64> | popcnt %rdi,%rax
       17f8:  sched_balance_find_dst_group+0x638                 cmp    %eax,%edx
 
    ... jump table alternatives:
 
       1895:  sched_use_asym_prio+0x5                            test   $0x8,%ch
       1898:  sched_use_asym_prio+0x8                            je     0x18a9 <sched_use_asym_prio+0x19>
       189a:  sched_use_asym_prio+0xa                          | <jump_table.189a>                        | JUMP
       189a:  sched_use_asym_prio+0xa                          | jmp    0x18ae <sched_use_asym_prio+0x1e> | nop2
       189c:  sched_use_asym_prio+0xc                            mov    $0x1,%eax
       18a1:  sched_use_asym_prio+0x11                           and    $0x80,%ecx
 
    ... exception table alternatives:
 
     native_read_msr:
       5b80:  native_read_msr+0x0                                                     mov    %edi,%ecx
       5b82:  native_read_msr+0x2                                                   | <ex_table.5b82> | EXCEPTION
       5b82:  native_read_msr+0x2                                                   | rdmsr           | resume at 0x5b84 <native_read_msr+0x4>
       5b84:  native_read_msr+0x4                                                     shl    $0x20,%rdx
 
    .... x86 feature flag decoding (also see the X86_FEATURE_POPCNT
         example in sched_balance_find_dst_group() above):
 
       2faaf:  start_thread_common.constprop.0+0x1f                                    jne    0x2fba4 <start_thread_common.constprop.0+0x114>
       2fab5:  start_thread_common.constprop.0+0x25                                  | <alternative.2fab5>                  | X86_FEATURE_ALWAYS                                  | X86_BUG_NULL_SEG
       2fab5:  start_thread_common.constprop.0+0x25                                  | jmp    0x2faba <.altinstr_aux+0x2f4> | jmp    0x4b0 <start_thread_common.constprop.0+0x3f> | nop5
       2faba:  start_thread_common.constprop.0+0x2a                                    mov    $0x2b,%eax
 
    ... NOP sequence shortening:
 
       1048e2:  snapshot_write_finalize+0xc2                                            je     0x104917 <snapshot_write_finalize+0xf7>
       1048e4:  snapshot_write_finalize+0xc4                                            nop6
       1048ea:  snapshot_write_finalize+0xca                                            nop11
       1048f5:  snapshot_write_finalize+0xd5                                            nop11
       104900:  snapshot_write_finalize+0xe0                                            mov    %rax,%rcx
       104903:  snapshot_write_finalize+0xe3                                            mov    0x10(%rdx),%rax
 
    ... and much more.
 
  - Function validation tracing support (Alexandre Chartre)
 
  - Various -ffunction-sections fixes (Josh Poimboeuf)
 
  - Clang AutoFDO (Automated Feedback-Directed Optimizations) support (Josh Poimboeuf)
 
  - Misc fixes and cleanups (Borislav Petkov, Chen Ni,
    Dylan Hatch, Ingo Molnar, John Wang, Josh Poimboeuf,
    Pankaj Raghav, Peter Zijlstra, Thorsten Blum)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmktavcRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1j3IhAAvc9tRV8SJcohim6DrkPGxCN/S80uzt5S
 q8v1x5tBzMYmUxftfpoLsPCri6Ww0jprNuhnbRCvWAzXFuW79HWBNdVkEO7V/cym
 OsCKQv3r0mWv5UXP3o8VM5K3tnU61wOAIx3yZCz5XKWeOg6NPXBJCSGWYpLuA7z0
 1wUWAXuHgmj4RHMlHu5x0FZnSqGU3/TkUDGAqdxrY+myhdwm0Ul+dSwWGQdQjCgA
 59Y/gDsWWEe5BVL56suwKZ1e+8UFnpbncbWkjELD6euJpYpDSNMOW/S6PYqOOz5M
 rjMv06XIX5ma7QQbF5fMG/sXW64tZtc090UocDnx7hpDq9mLEyNNkXsqRQlmd8Wt
 wG19IaeWo8aG9DTQkiv8OhtmssPKZHJsVjRUvXGnjktvxnsYSomgOT1lNme38dJD
 X9jHgZCFMdPsQmG0dp00Y0oejfTChqIDef7qSpYwT96R7l9VQQF7K7AxfJwSeLGO
 3hClZ0Gz/u9NiJTUUWTxUmR+YEy+1xIeaQSDq6t4JRtNJaMZlcevfVW+F2Lm04XH
 9eSeF7bJS2XKrlLHVdPgWCGZOmee+ghdQ7svsyEGpzdzaAZ7UveTucHJ9CvW3Fft
 Dcrl8rxX2NiD2PLz03HCHR/JVUDc3W3Exrer1TD8PD4LcZhFoBEGQbZ/gFlkBTxb
 TOcemtJT03U=
 =yPrS
 -----END PGP SIGNATURE-----

Merge tag 'objtool-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull objtool updates from Ingo Molnar:

 - klp-build livepatch module generation (Josh Poimboeuf)

   Introduce new objtool features and a klp-build script to generate
   livepatch modules using a source .patch as input.

   This builds on concepts from the longstanding out-of-tree kpatch
   project which began in 2012 and has been used for many years to
   generate livepatch modules for production kernels. However, this is a
   complete rewrite which incorporates hard-earned lessons from 12+
   years of maintaining kpatch.

   Key improvements compared to kpatch-build:

    - Integrated with objtool: Leverages objtool's existing control-flow
      graph analysis to help detect changed functions.

    - Works on vmlinux.o: Supports late-linked objects, making it
      compatible with LTO, IBT, and similar.

    - Simplified code base: ~3k fewer lines of code.

    - Upstream: No more out-of-tree #ifdef hacks, far less cruft.

    - Cleaner internals: Vastly simplified logic for
      symbol/section/reloc inclusion and special section extraction.

    - Robust __LINE__ macro handling: Avoids false positive binary diffs
      caused by the __LINE__ macro by introducing a fix-patch-lines
      script which injects #line directives into the source .patch to
      preserve the original line numbers at compile time.

 - Disassemble code with libopcodes instead of running objdump
   (Alexandre Chartre)

 - Disassemble support (-d option to objtool) by Alexandre Chartre,
   which supports the decoding of various Linux kernel code generation
   specials such as alternatives:

      17ef:  sched_balance_find_dst_group+0x62f                 mov    0x34(%r9),%edx
      17f3:  sched_balance_find_dst_group+0x633               | <alternative.17f3>             | X86_FEATURE_POPCNT
      17f3:  sched_balance_find_dst_group+0x633               | call   0x17f8 <__sw_hweight64> | popcnt %rdi,%rax
      17f8:  sched_balance_find_dst_group+0x638                 cmp    %eax,%edx

   ... jump table alternatives:

      1895:  sched_use_asym_prio+0x5                            test   $0x8,%ch
      1898:  sched_use_asym_prio+0x8                            je     0x18a9 <sched_use_asym_prio+0x19>
      189a:  sched_use_asym_prio+0xa                          | <jump_table.189a>                        | JUMP
      189a:  sched_use_asym_prio+0xa                          | jmp    0x18ae <sched_use_asym_prio+0x1e> | nop2
      189c:  sched_use_asym_prio+0xc                            mov    $0x1,%eax
      18a1:  sched_use_asym_prio+0x11                           and    $0x80,%ecx

   ... exception table alternatives:

    native_read_msr:
      5b80:  native_read_msr+0x0                                                     mov    %edi,%ecx
      5b82:  native_read_msr+0x2                                                   | <ex_table.5b82> | EXCEPTION
      5b82:  native_read_msr+0x2                                                   | rdmsr           | resume at 0x5b84 <native_read_msr+0x4>
      5b84:  native_read_msr+0x4                                                     shl    $0x20,%rdx

   .... x86 feature flag decoding (also see the X86_FEATURE_POPCNT
        example in sched_balance_find_dst_group() above):

      2faaf:  start_thread_common.constprop.0+0x1f                                    jne    0x2fba4 <start_thread_common.constprop.0+0x114>
      2fab5:  start_thread_common.constprop.0+0x25                                  | <alternative.2fab5>                  | X86_FEATURE_ALWAYS                                  | X86_BUG_NULL_SEG
      2fab5:  start_thread_common.constprop.0+0x25                                  | jmp    0x2faba <.altinstr_aux+0x2f4> | jmp    0x4b0 <start_thread_common.constprop.0+0x3f> | nop5
      2faba:  start_thread_common.constprop.0+0x2a                                    mov    $0x2b,%eax

   ... NOP sequence shortening:

      1048e2:  snapshot_write_finalize+0xc2                                            je     0x104917 <snapshot_write_finalize+0xf7>
      1048e4:  snapshot_write_finalize+0xc4                                            nop6
      1048ea:  snapshot_write_finalize+0xca                                            nop11
      1048f5:  snapshot_write_finalize+0xd5                                            nop11
      104900:  snapshot_write_finalize+0xe0                                            mov    %rax,%rcx
      104903:  snapshot_write_finalize+0xe3                                            mov    0x10(%rdx),%rax

   ... and much more.

 - Function validation tracing support (Alexandre Chartre)

 - Various -ffunction-sections fixes (Josh Poimboeuf)

 - Clang AutoFDO (Automated Feedback-Directed Optimizations) support
   (Josh Poimboeuf)

 - Misc fixes and cleanups (Borislav Petkov, Chen Ni, Dylan Hatch, Ingo
   Molnar, John Wang, Josh Poimboeuf, Pankaj Raghav, Peter Zijlstra,
   Thorsten Blum)

* tag 'objtool-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (129 commits)
  objtool: Fix segfault on unknown alternatives
  objtool: Build with disassembly can fail when including bdf.h
  objtool: Trim trailing NOPs in alternative
  objtool: Add wide output for disassembly
  objtool: Compact output for alternatives with one instruction
  objtool: Improve naming of group alternatives
  objtool: Add Function to get the name of a CPU feature
  objtool: Provide access to feature and flags of group alternatives
  objtool: Fix address references in alternatives
  objtool: Disassemble jump table alternatives
  objtool: Disassemble exception table alternatives
  objtool: Print addresses with alternative instructions
  objtool: Disassemble group alternatives
  objtool: Print headers for alternatives
  objtool: Preserve alternatives order
  objtool: Add the --disas=<function-pattern> action
  objtool: Do not validate IBT for .return_sites and .call_sites
  objtool: Improve tracing of alternative instructions
  objtool: Add functions to better name alternatives
  objtool: Identify the different types of alternatives
  ...
2025-12-01 20:18:59 -08:00
Linus Torvalds b53440f8e5 Locking updates for v6.19:
Mutexes:
 
  - Redo __mutex_init() to reduce generated code size
    (Sebastian Andrzej Siewior)
 
 Seqlocks:
 
  - Introduce scoped_seqlock_read() (Peter Zijlstra)
 
  - Change thread_group_cputime() to use scoped_seqlock_read()
    (Oleg Nesterov)
 
  - Change do_task_stat() to use scoped_seqlock_read()
    (Oleg Nesterov)
 
  - Change do_io_accounting() to use scoped_seqlock_read()
    (Oleg Nesterov)
 
  - Fix the incorrect documentation of read_seqbegin_or_lock() /
    need_seqretry() (Oleg Nesterov)
 
  - Allow KASAN to fail optimizing (Peter Zijlstra)
 
 Local lock updates:
 
  - Fix all kernel-doc warnings (Randy Dunlap)
 
  - Add the <linux/local_lock*.h> headers to MAINTAINERS
    (Sebastian Andrzej Siewior)
 
  - Reduce the risk of shadowing via s/l/__l/ and s/tl/__tl/
    (Vincent Mailhol)
 
 Lock debugging:
 
  - spinlock/debug: Fix data-race in do_raw_write_lock
    (Alexander Sverdlin)
 
 Atomic primitives infrastructure:
 
  - atomic: Skip alignment check for try_cmpxchg() old arg
    (Arnd Bergmann)
 
 Rust runtime integration:
 
  - sync: atomic: Enable generated Atomic<T> usage (Boqun Feng)
 
  - sync: atomic: Implement Debug for Atomic<Debug> (Boqun Feng)
 
  - debugfs: Remove Rust native atomics and replace them with
    Linux versions (Boqun Feng)
 
  - debugfs: Implement Reader for Mutex<T> only when T is Unpin
    (Boqun Feng)
 
  - lock: guard: Add T: Unpin bound to DerefMut (Daniel Almeida)
 
  - lock: Pin the inner data (Daniel Almeida)
 
  - lock: Add a Pin<&mut T> accessor (Daniel Almeida)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmktVmgRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hDqw//e9BTs9Yx6yLxZVRDagS7KrDKy3V3OMg1
 9h+tXzCosGeNz7XwVzt590wsACJvX7/QtqgTFQy/GPYBW56WeVOSYSpA6I4s43G1
 sz+EAc4BYPBWEwBQxCfPDwHI/M0MD0Im6mjpATc6Uv1Ct0+KS8iFXRnGUALe4bis
 8aKV8ZHo81Wnu6v1B8GroExHolL/AMORYfEYHABpWEe+GpwxqQcRdZjc/eiEUzOg
 umwMC9bNc5FAiPlku9Mh6pcBPjkMd9bGjVEIG8deJhm/aD8f/b0mgaxyaKgoHx8J
 ptauY3kLYBLuV793U37SXuQVw6o2LGHCYvN1fX+g1D0YTIuqIq9Pz7ObZs7w4xDd
 6iIK4QYP4Gjkvn0ZA275eI3ItcBEjJ2FD3ZDbkXNj+O4vEOrmG/OX4h2H5WGq/AU
 zr9YfmkRn0InPeHeLU1UM3NdbKgwc/Bd6MubSwX4v7G7ws4CGDtlvA2d3xg5q8Ls
 MQoAV+9QtiZ9prQjtd8nukgmh/+okPmCcnuEVXhSOZHpPjqXXnyUCTPyKXVkltdF
 1u4oUHiQY7Jydfn0wZgEV4nASDeV2gz5BwKoSAuKvYc5HGhXnXxvzyJyHJy3dL8R
 afGGQ3XfQhA0hUKoMiQFUk7p7dvjdAiHxN1EcvxxJqWVsaE/Gpik1GOm+FRn7Oqs
 UMvspgGrbI4=
 =KPgY
 -----END PGP SIGNATURE-----

Merge tag 'locking-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking updates from Ingo Molnar:
 "Mutexes:

   - Redo __mutex_init() to reduce generated code size (Sebastian
     Andrzej Siewior)

  Seqlocks:

   - Introduce scoped_seqlock_read() (Peter Zijlstra)

   - Change thread_group_cputime() to use scoped_seqlock_read() (Oleg
     Nesterov)

   - Change do_task_stat() to use scoped_seqlock_read() (Oleg Nesterov)

   - Change do_io_accounting() to use scoped_seqlock_read() (Oleg
     Nesterov)

   - Fix the incorrect documentation of read_seqbegin_or_lock() /
     need_seqretry() (Oleg Nesterov)

   - Allow KASAN to fail optimizing (Peter Zijlstra)

  Local lock updates:

   - Fix all kernel-doc warnings (Randy Dunlap)

   - Add the <linux/local_lock*.h> headers to MAINTAINERS (Sebastian
     Andrzej Siewior)

   - Reduce the risk of shadowing via s/l/__l/ and s/tl/__tl/ (Vincent
     Mailhol)

  Lock debugging:

   - spinlock/debug: Fix data-race in do_raw_write_lock (Alexander
     Sverdlin)

  Atomic primitives infrastructure:

   - atomic: Skip alignment check for try_cmpxchg() old arg (Arnd
     Bergmann)

  Rust runtime integration:

   - sync: atomic: Enable generated Atomic<T> usage (Boqun Feng)

   - sync: atomic: Implement Debug for Atomic<Debug> (Boqun Feng)

   - debugfs: Remove Rust native atomics and replace them with Linux
     versions (Boqun Feng)

   - debugfs: Implement Reader for Mutex<T> only when T is Unpin (Boqun
     Feng)

   - lock: guard: Add T: Unpin bound to DerefMut (Daniel Almeida)

   - lock: Pin the inner data (Daniel Almeida)

   - lock: Add a Pin<&mut T> accessor (Daniel Almeida)"

* tag 'locking-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/local_lock: Fix all kernel-doc warnings
  locking/local_lock: s/l/__l/ and s/tl/__tl/ to reduce the risk of shadowing
  locking/local_lock: Add the <linux/local_lock*.h> headers to MAINTAINERS
  locking/mutex: Redo __mutex_init() to reduce generated code size
  rust: debugfs: Replace the usage of Rust native atomics
  rust: sync: atomic: Implement Debug for Atomic<Debug>
  rust: sync: atomic: Make Atomic*Ops pub(crate)
  seqlock: Allow KASAN to fail optimizing
  rust: debugfs: Implement Reader for Mutex<T> only when T is Unpin
  seqlock: Change do_io_accounting() to use scoped_seqlock_read()
  seqlock: Change do_task_stat() to use scoped_seqlock_read()
  seqlock: Change thread_group_cputime() to use scoped_seqlock_read()
  seqlock: Introduce scoped_seqlock_read()
  documentation: seqlock: fix the wrong documentation of read_seqbegin_or_lock/need_seqretry
  atomic: Skip alignment check for try_cmpxchg() old arg
  rust: lock: Add a Pin<&mut T> accessor
  rust: lock: Pin the inner data
  rust: lock: guard: Add T: Unpin bound to DerefMut
  locking/spinlock/debug: Fix data-race in do_raw_write_lock
2025-12-01 19:50:58 -08:00
Linus Torvalds 1b5dd29869 vfs-6.19-rc1.fd_prepare.fs
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZwAKCRCRxhvAZXjc
 op0AAP4oNVJkFyvgKoPos5K2EGFB8M7merGhpYtsOoeg8UK6OwD/UySQErHsXQDR
 sUDDa5uFOhfrkcfM8REtAN4wF8p5qAc=
 =QgFD
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.19-rc1.fd_prepare.fs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull fd prepare updates from Christian Brauner:
 "This adds the FD_ADD() and FD_PREPARE() primitive. They simplify the
  common pattern of get_unused_fd_flags() + create file + fd_install()
  that is used extensively throughout the kernel and currently requires
  cumbersome cleanup paths.

  FD_ADD() - For simple cases where a file is installed immediately:

      fd = FD_ADD(O_CLOEXEC, vfio_device_open_file(device));
      if (fd < 0)
          vfio_device_put_registration(device);
      return fd;

  FD_PREPARE() - For cases requiring access to the fd or file, or
  additional work before publishing:

      FD_PREPARE(fdf, O_CLOEXEC, sync_file->file);
      if (fdf.err) {
          fput(sync_file->file);
          return fdf.err;
      }

      data.fence = fd_prepare_fd(fdf);
      if (copy_to_user((void __user *)arg, &data, sizeof(data)))
          return -EFAULT;

      return fd_publish(fdf);

  The primitives are centered around struct fd_prepare. FD_PREPARE()
  encapsulates all allocation and cleanup logic and must be followed by
  a call to fd_publish() which associates the fd with the file and
  installs it into the caller's fdtable. If fd_publish() isn't called,
  both are deallocated automatically. FD_ADD() is a shorthand that does
  fd_publish() immediately and never exposes the struct to the caller.

  I've implemented this in a way that it's compatible with the cleanup
  infrastructure while also being usable separately. IOW, it's centered
  around struct fd_prepare which is aliased to class_fd_prepare_t and so
  we can make use of all the basica guard infrastructure"

* tag 'vfs-6.19-rc1.fd_prepare.fs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (42 commits)
  io_uring: convert io_create_mock_file() to FD_PREPARE()
  file: convert replace_fd() to FD_PREPARE()
  vfio: convert vfio_group_ioctl_get_device_fd() to FD_ADD()
  tty: convert ptm_open_peer() to FD_ADD()
  ntsync: convert ntsync_obj_get_fd() to FD_PREPARE()
  media: convert media_request_alloc() to FD_PREPARE()
  hv: convert mshv_ioctl_create_partition() to FD_ADD()
  gpio: convert linehandle_create() to FD_PREPARE()
  pseries: port papr_rtas_setup_file_interface() to FD_ADD()
  pseries: convert papr_platform_dump_create_handle() to FD_ADD()
  spufs: convert spufs_gang_open() to FD_PREPARE()
  papr-hvpipe: convert papr_hvpipe_dev_create_handle() to FD_PREPARE()
  spufs: convert spufs_context_open() to FD_PREPARE()
  net/socket: convert __sys_accept4_file() to FD_ADD()
  net/socket: convert sock_map_fd() to FD_ADD()
  net/kcm: convert kcm_ioctl() to FD_PREPARE()
  net/handshake: convert handshake_nl_accept_doit() to FD_PREPARE()
  secretmem: convert memfd_secret() to FD_ADD()
  memfd: convert memfd_create() to FD_ADD()
  bpf: convert bpf_token_create() to FD_PREPARE()
  ...
2025-12-01 17:32:07 -08:00
Linus Torvalds ffbf700df2 vfs-6.19-rc1.autofs
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZwAKCRCRxhvAZXjc
 ov48AP4qgtSo78euYDtsxkgU1IKRow1Hc3L/rql6uIP7dFtQ8AEAjZjCMK3vDSZy
 DUqStlgPZ3/GWyzdnDKOoAuwvn56gQQ=
 =KElr
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.19-rc1.autofs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull autofs update from Christian Brauner:
 "Prevent futile mount triggers in private mount namespaces.

  Fix a problematic loop in autofs when a mount namespace contains
  autofs mounts that are propagation private and there is no
  namespace-specific automount daemon to handle possible automounting.

  Previously, attempted path resolution would loop until MAXSYMLINKS was
  reached before failing, causing significant noise in the log.

 The fix adds a check in autofs ->d_automount() so that the VFS can
 immediately return EPERM in this case. Since the mount is propagation
 private, EPERM is the most appropriate error code"

* tag 'vfs-6.19-rc1.autofs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  autofs: dont trigger mount if it cant succeed
2025-12-01 16:38:21 -08:00
Linus Torvalds d0deeb803c vfs-6.19-rc1.ovl
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZwAKCRCRxhvAZXjc
 othbAP97FNSOMUMAXTUxocE7vMgq3B/MG54e22ZrYhnZeP8NsgEA3zo4GpPCeM0p
 e8EjiLz0wUlveF68MZUg52eXT5/nTAE=
 =tbY/
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.19-rc1.ovl' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull overlayfs cred guard conversion from Christian Brauner:
 "This converts all of overlayfs to use credential guards, eliminating
  manual credential management throughout the filesystem.

  Credential guard conversion:

   - Convert all of overlayfs to use credential guards, replacing the
     manual ovl_override_creds()/ovl_revert_creds() pattern with scoped
     guards.

     This makes credential handling visually explicit and eliminates a
     class of potential bugs from mismatched override/revert calls.

     (1) Basic credential guard (with_ovl_creds)
     (2) Creator credential guard (ovl_override_creator_creds):

         Introduced a specialized guard for file creation operations
         that handles the two-phase credential override (mounter
         credentials, then fs{g,u}id override). The new pattern is much
         clearer:

         with_ovl_creds(dentry->d_sb) {
                 scoped_class(prepare_creds_ovl, cred, dentry, inode, mode) {
                         if (IS_ERR(cred))
                                 return PTR_ERR(cred);
                         /* creation operations */
                 }
         }

     (3) Copy-up credential guard (ovl_cu_creds):

         Introduced a specialized guard for copy-up operations,
         simplifying the previous struct ovl_cu_creds helper and
         associated functions.

         Ported ovl_copy_up_workdir() and ovl_copy_up_tmpfile() to this
         pattern.

  Cleanups:

   - Remove ovl_revert_creds() after all callers converted to guards

   - Remove struct ovl_cu_creds and associated functions

   - Drop ovl_setup_cred_for_create() after conversion

   - Refactor ovl_fill_super(), ovl_lookup(), ovl_iterate(),
     ovl_rename() for cleaner credential guard scope

   - Introduce struct ovl_renamedata to simplify rename handling

   - Don't override credentials for ovl_check_whiteouts() (unnecessary)

   - Remove unneeded semicolon"

* tag 'vfs-6.19-rc1.ovl' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (54 commits)
  ovl: remove unneeded semicolon
  ovl: remove struct ovl_cu_creds and associated functions
  ovl: port ovl_copy_up_tmpfile() to cred guard
  ovl: mark *_cu_creds() as unused temporarily
  ovl: port ovl_copy_up_workdir() to cred guard
  ovl: add copy up credential guard
  ovl: drop ovl_setup_cred_for_create()
  ovl: port ovl_create_or_link() to new ovl_override_creator_creds cleanup guard
  ovl: mark ovl_setup_cred_for_create() as unused temporarily
  ovl: reflow ovl_create_or_link()
  ovl: port ovl_create_tmpfile() to new ovl_override_creator_creds cleanup guard
  ovl: add ovl_override_creator_creds cred guard
  ovl: remove ovl_revert_creds()
  ovl: port ovl_fill_super() to cred guard
  ovl: refactor ovl_fill_super()
  ovl: port ovl_lower_positive() to cred guard
  ovl: port ovl_lookup() to cred guard
  ovl: refactor ovl_lookup()
  ovl: port ovl_copyfile() to cred guard
  ovl: port ovl_rename() to cred guard
  ...
2025-12-01 16:31:21 -08:00
Linus Torvalds a8058f8442 vfs-6.19-rc1.directory.locking
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZwAKCRCRxhvAZXjc
 op9tAQCJ//STOkvYHfqgsdRD+cW9MRg/gPzfVZgnV1FTyf8sMgEA0IsY5zCZB9eh
 9FdD0E57P8PlWRwWZ+LktnWBzRAUqwI=
 =MOVR
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.19-rc1.directory.locking' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull directory locking updates from Christian Brauner:
 "This contains the work to add centralized APIs for directory locking
  operations.

  This series is part of a larger effort to change directory operation
  locking to allow multiple concurrent operations in a directory. The
  ultimate goal is to lock the target dentry(s) rather than the whole
  parent directory.

  To help with changing the locking protocol, this series centralizes
  locking and lookup in new helper functions. The helpers establish a
  pattern where it is the dentry that is being locked and unlocked
  (currently the lock is held on dentry->d_parent->d_inode, but that can
  change in the future).

  This also changes vfs_mkdir() to unlock the parent on failure, as well
  as dput()ing the dentry. This allows end_creating() to only require
  the target dentry (which may be IS_ERR() after vfs_mkdir()), not the
  parent"

* tag 'vfs-6.19-rc1.directory.locking' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  nfsd: fix end_creating() conversion
  VFS: introduce end_creating_keep()
  VFS: change vfs_mkdir() to unlock on failure.
  ecryptfs: use new start_creating/start_removing APIs
  Add start_renaming_two_dentries()
  VFS/ovl/smb: introduce start_renaming_dentry()
  VFS/nfsd/ovl: introduce start_renaming() and end_renaming()
  VFS: add start_creating_killable() and start_removing_killable()
  VFS: introduce start_removing_dentry()
  smb/server: use end_removing_noperm for for target of smb2_create_link()
  VFS: introduce start_creating_noperm() and start_removing_noperm()
  VFS/nfsd/cachefiles/ovl: introduce start_removing() and end_removing()
  VFS/nfsd/cachefiles/ovl: add start_creating() and end_creating()
  VFS: tidy up do_unlinkat()
  VFS: introduce start_dirop() and end_dirop()
  debugfs: rename end_creating() to debugfs_end_creating()
2025-12-01 16:13:46 -08:00
Linus Torvalds db74a7d02a vfs-6.19-rc1.directory.delegations
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZgAKCRCRxhvAZXjc
 ooiEAPwNZfkqiSs6G1B2EmjFpMrA2BDqskaOsnN2sywra0sNewD9EQxJwlYXUn+z
 nNUIAvmegJGg2OiU2UaNGwxMR3lR3w8=
 =YELr
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.19-rc1.directory.delegations' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull directory delegations update from Christian Brauner:
 "This contains the work for recall-only directory delegations for
  knfsd.

  Add support for simple, recallable-only directory delegations. This
  was decided at the fall NFS Bakeathon where the NFS client and server
  maintainers discussed how to merge directory delegation support.

  The approach starts with recallable-only delegations for several reasons:

   1. RFC8881 has gaps that are being addressed in RFC8881bis. In
      particular, it requires directory position information for
      CB_NOTIFY callbacks, which is difficult to implement properly
      under Linux. The spec is being extended to allow that information
      to be omitted.

   2. Client-side support for CB_NOTIFY still lags. The client side
      involves heuristics about when to request a delegation.

   3. Early indication shows simple, recallable-only delegations can
      help performance. Anna Schumaker mentioned seeing a multi-minute
      speedup in xfstests runs with them enabled.

  With these changes, userspace can also request a read lease on a
  directory that will be recalled on conflicting accesses. This may be
  useful for applications like Samba. Users can disable leases
  altogether via the fs.leases-enable sysctl if needed.

  VFS changes:

   - Dedicated Type for Delegations

     Introduce struct delegated_inode to track inodes that may have
     delegations that need to be broken. This replaces the previous
     approach of passing raw inode pointers through the delegation
     breaking code paths, providing better type safety and clearer
     semantics for the delegation machinery.

   - Break parent directory delegations in open(..., O_CREAT) codepath

   - Allow mkdir to wait for delegation break on parent

   - Allow rmdir to wait for delegation break on parent

   - Add try_break_deleg calls for parents to vfs_link(), vfs_rename(),
     and vfs_unlink()

   - Make vfs_create(), vfs_mknod(), and vfs_symlink() break delegations
     on parent directory

   - Clean up argument list for vfs_create()

   - Expose delegation support to userland

  Filelock changes:

   - Make lease_alloc() take a flags argument

   - Rework the __break_lease API to use flags

   - Add struct delegated_inode

   - Push the S_ISREG check down to ->setlease handlers

   - Lift the ban on directory leases in generic_setlease

  NFSD changes:

   - Allow filecache to hold S_IFDIR files

   - Allow DELEGRETURN on directories

   - Wire up GET_DIR_DELEGATION handling

  Fixes:

   - Fix kernel-doc warnings in __fcntl_getlease

   - Add needed headers for new struct delegation definition"

* tag 'vfs-6.19-rc1.directory.delegations' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  vfs: add needed headers for new struct delegation definition
  filelock: __fcntl_getlease: fix kernel-doc warnings
  vfs: expose delegation support to userland
  nfsd: wire up GET_DIR_DELEGATION handling
  nfsd: allow DELEGRETURN on directories
  nfsd: allow filecache to hold S_IFDIR files
  filelock: lift the ban on directory leases in generic_setlease
  vfs: make vfs_symlink break delegations on parent dir
  vfs: make vfs_mknod break delegations on parent directory
  vfs: make vfs_create break delegations on parent directory
  vfs: clean up argument list for vfs_create()
  vfs: break parent dir delegations in open(..., O_CREAT) codepath
  vfs: allow rmdir to wait for delegation break on parent
  vfs: allow mkdir to wait for delegation break on parent
  vfs: add try_break_deleg calls for parents to vfs_{link,rename,unlink}
  filelock: push the S_ISREG check down to ->setlease handlers
  filelock: add struct delegated_inode
  filelock: rework the __break_lease API to use flags
  filelock: make lease_alloc() take a flags argument
2025-12-01 15:34:41 -08:00
Ingo Molnar 6ec33db1aa objtool: Fix segfault on unknown alternatives
So 'objtool --link -d vmlinux.o' gets surprised by this endbr64+endbr64 pattern
in ___bpf_prog_run():

	___bpf_prog_run:
	1e7680:  ___bpf_prog_run+0x0                                                     push   %r12
	1e7682:  ___bpf_prog_run+0x2                                                     mov    %rdi,%r12
	1e7685:  ___bpf_prog_run+0x5                                                     push   %rbp
	1e7686:  ___bpf_prog_run+0x6                                                     xor    %ebp,%ebp
	1e7688:  ___bpf_prog_run+0x8                                                     push   %rbx
	1e7689:  ___bpf_prog_run+0x9                                                     mov    %rsi,%rbx
	1e768c:  ___bpf_prog_run+0xc                                                     movzbl (%rbx),%esi
	1e768f:  ___bpf_prog_run+0xf                                                     movzbl %sil,%edx
	1e7693:  ___bpf_prog_run+0x13                                                    mov    %esi,%eax
	1e7695:  ___bpf_prog_run+0x15                                                    mov    0x0(,%rdx,8),%rdx
	1e769d:  ___bpf_prog_run+0x1d                                                    jmp    0x1e76a2 <__x86_indirect_thunk_rdx>
	1e76a2:  ___bpf_prog_run+0x22                                                    endbr64
	1e76a6:  ___bpf_prog_run+0x26                                                    endbr64
	1e76aa:  ___bpf_prog_run+0x2a                                                    mov    0x4(%rbx),%edx

And crashes due to blindly dereferencing alt->insn->alt_group.

Bail out on NULL ->alt_group, which produces this warning and continues
with the disassembly, instead of a segfault:

  .git/O/vmlinux.o: warning: objtool: <alternative.1e769d>: failed to disassemble alternative

Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-12-01 10:42:27 +01:00
Randy Dunlap 43decb6b62 locking/local_lock: Fix all kernel-doc warnings
Modify kernel-doc comments in local_lock.h to prevent warnings:

  Warning: include/linux/local_lock.h:9 function parameter 'lock' not described in 'local_lock_init'
  Warning: include/linux/local_lock.h:56 function parameter 'lock' not described in 'local_trylock_init'
  Warning: include/linux/local_lock.h:56 expecting prototype for local_lock_init(). Prototype was for local_trylock_init() instead

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251128065925.917917-1-rdunlap@infradead.org
2025-12-01 06:56:16 +01:00
Vincent Mailhol 719e357fc0 locking/local_lock: s/l/__l/ and s/tl/__tl/ to reduce the risk of shadowing
The Linux kernel coding style advises to avoid common variable
names in function-like macros to reduce the risk of namespace
collisions.

Throughout local_lock_internal.h, several macros use the rather common
variable names 'l' and 'tl'. This already resulted in an actual
collision: the __local_lock_acquire() function like macro is currently
shadowing the parameter 'l' of the:

  class_##_name##_t class_##_name##_constructor(_type *l)

function factory from <linux/cleanup.h>.

Rename the variable 'l' to '__l' and the variable 'tl' to '__tl'
throughout the file to fix the current namespace collision and
to prevent future ones.

[ bigeasy: Rebase, update all l and tl instances in macros ]

Signed-off-by: Vincent Mailhol <mailhol@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://patch.msgid.link/20251127144140.215722-3-bigeasy@linutronix.de
2025-12-01 06:56:16 +01:00
Sebastian Andrzej Siewior 52ed746147 locking/local_lock: Add the <linux/local_lock*.h> headers to MAINTAINERS
The local_lock_t was never added to the MAINTAINERS file since its
inclusion.

Add local_lock_t to the locking primitives section.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://patch.msgid.link/20251127144140.215722-2-bigeasy@linutronix.de
2025-12-01 06:56:10 +01:00
Sebastian Andrzej Siewior 51d7a05452 locking/mutex: Redo __mutex_init() to reduce generated code size
mutex_init() invokes __mutex_init() providing the name of the lock and
a pointer to a the lock class. With LOCKDEP enabled this information is
useful but without LOCKDEP it not used at all. Passing the pointer
information of the lock class might be considered negligible but the
name of the lock is passed as well and the string is stored. This
information is wasting storage.

Split __mutex_init() into a _genereic() variant doing the initialisation
of the lock and a _lockdep() version which does _genereic() plus the
lockdep bits. Restrict the lockdep version to lockdep enabled builds
allowing the compiler to remove the unused parameter.

This results in the following size reduction:

        text     data       bss        dec  filename
  | 30237599  8161430   1176624   39575653  vmlinux.defconfig
  | 30233269  8149142   1176560   39558971  vmlinux.defconfig.patched
     -4.2KiB   -12KiB

  | 32455099  8471098  12934684   53860881  vmlinux.defconfig.lockdep
  | 32455100  8471098  12934684   53860882  vmlinux.defconfig.patched.lockdep

  | 27152407  7191822   2068040   36412269  vmlinux.defconfig.preempt_rt
  | 27145937  7183630   2067976   36397543  vmlinux.defconfig.patched.preempt_rt
     -6.3KiB    -8KiB

  | 29382020  7505742  13784608   50672370  vmlinux.defconfig.preempt_rt.lockdep
  | 29376229  7505742  13784544   50666515  vmlinux.defconfig.patched.preempt_rt.lockdep
     -5.6KiB

[peterz: folded fix from boqun]

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Link: https://lkml.kernel.org/r/20251125145425.68319-1-boqun.feng@gmail.com
Link: https://patch.msgid.link/20251105142350.Tfeevs2N@linutronix.de
2025-12-01 06:51:57 +01:00
Christian Brauner 0512bf9701
Merge patch series "file: FD_{ADD,PREPARE}()"
Christian Brauner <brauner@kernel.org> says:

This now removes roughly double the code that it adds.

I've been playing with this to allow for moderately flexible usage of
the get_unused_fd_flags() + create file + fd_install() pattern that's
used quite extensively and requires cumbersome cleanup paths.

How callers allocate files is really heterogenous so it's not really
convenient to fold them into a single class. It's possibe to split them
into subclasses like for anon inodes. I think that's not necessarily
nice as well. This adds two primitives:

(1) FD_ADD() the simple cases a file is installed:

    fd = FD_ADD(O_CLOEXEC, vfio_device_open_file(device));
    if (fd < 0)
            vfio_device_put_registration(device);
    return fd;

(2) FD_PREPARE() that captures all the cases where access to fd or file
    or additional work before publishing the fd is needed:

    FD_PREPARE(fdf, O_CLOEXEC, sync_file->file);
    if (fdf.err) {
            fput(sync_file->file);
            return fdf.err;
    }

    data.fence = fd_prepare_fd(fdf);
    if (copy_to_user((void __user *)arg, &data, sizeof(data)))
            return -EFAULT;

    return fd_publish(fdf);

I've converted all of the easy cases over to it and it gets rid of an
aweful lot of convoluted cleanup logic. There are a bunch of other cases
that can also be converted after a bit of massaging.

It's centered around a simple struct. FD_PREPARE() encapsulates all of
allocation and cleanup logic and must be followed by a call to
fd_publish() which associates the fd with the file and installs it into
the callers fdtable. If fd_publish() isn't called both are deallocated.
FD_ADD() is a shorthand that does the fd_publish() and never exposes the
struct to the caller. That's often the case when they don't need access
to anything after installing the fd.

It mandates a specific order namely that first we allocate the fd and
then instantiate the file. But that shouldn't be a problem. Nearly
everyone I've converted used this order anyway.

There's a bunch of additional cases where it would be easy to convert
them to this pattern. For example, the whole sync file stuff in dma
currently returns the containing structure of the file instead of the
file itself even though it's only used to allocate files. Changing that
would make it fall into the FD_PREPARE() pattern easily. I've not done
that work yet.

There's room for extending this in a way that wed'd have subclasses for
some particularly often use patterns but as I said I'm not even sure
that's worth it.

* patches from https://patch.msgid.link/20251123-work-fd-prepare-v4-0-b6efa1706cfd@kernel.org: (47 commits)
  kvm: convert kvm_vcpu_ioctl_get_stats_fd() to FD_PREPARE()
  kvm: convert kvm_arch_supports_gmem_init_shared() to FD_PREPARE()
  io_uring: convert io_create_mock_file() to FD_PREPARE()
  file: convert replace_fd() to FD_PREPARE()
  vfio: convert vfio_group_ioctl_get_device_fd() to FD_PREPARE()
  tty: convert ptm_open_peer() to FD_PREPARE()
  ntsync: convert ntsync_obj_get_fd() to FD_PREPARE()
  media: convert media_request_alloc() to FD_PREPARE()
  hv: convert mshv_ioctl_create_partition() to FD_PREPARE()
  gpio: convert linehandle_create() to FD_PREPARE()
  dma: port sw_sync_ioctl_create_fence() to FD_PREPARE()
  pseries: port papr_rtas_setup_file_interface() to FD_PREPARE()
  pseries: convert papr_platform_dump_create_handle() to FD_PREPARE()
  spufs: convert spufs_gang_open() to FD_PREPARE()
  papr-hvpipe: convert papr_hvpipe_dev_create_handle() to FD_PREPARE()
  spufs: convert spufs_context_open() to FD_PREPARE()
  net/socket: convert __sys_accept4_file() to FD_PREPARE()
  net/socket: convert sock_map_fd() to FD_PREPARE()
  net/sctp: convert sctp_getsockopt_peeloff_common() to FD_PREPARE()
  net/kcm: convert kcm_ioctl() to FD_PREPARE()
  ...

Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-0-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:36 +01:00
Christian Brauner 6fb1022918
io_uring: convert io_create_mock_file() to FD_PREPARE()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-45-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:36 +01:00
Christian Brauner 99d4f12f17
file: convert replace_fd() to FD_PREPARE()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-44-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:36 +01:00
Christian Brauner 5f3ea1c201
vfio: convert vfio_group_ioctl_get_device_fd() to FD_ADD()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-43-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:36 +01:00
Christian Brauner 3fd5edfe1d
tty: convert ptm_open_peer() to FD_ADD()
Christian Brauner <brauner@kernel.org> says:

The fix sent in [1] was squashed into this commit.

Fixes: https://lore.kernel.org/37ac7af5-584f-4768-a462-4d1071c43eaf@sirena.org.uk [1]
Reported-by: Mark Brown <broonie@kernel.org> [1]
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> [1]
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-42-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:36 +01:00
Christian Brauner af66279a01
ntsync: convert ntsync_obj_get_fd() to FD_PREPARE()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-41-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:36 +01:00
Christian Brauner 6f504cbf10
media: convert media_request_alloc() to FD_PREPARE()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-40-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:35 +01:00
Christian Brauner c99dc44562
hv: convert mshv_ioctl_create_partition() to FD_ADD()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-39-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:35 +01:00
Christian Brauner da7e394bf5
gpio: convert linehandle_create() to FD_PREPARE()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-38-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:35 +01:00
Christian Brauner 6ae8da4846
pseries: port papr_rtas_setup_file_interface() to FD_ADD()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-36-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:35 +01:00
Christian Brauner 274d937006
pseries: convert papr_platform_dump_create_handle() to FD_ADD()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-35-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:35 +01:00
Christian Brauner 0b9d4a6b51
spufs: convert spufs_gang_open() to FD_PREPARE()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-34-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:35 +01:00
Christian Brauner 6d3789d347
papr-hvpipe: convert papr_hvpipe_dev_create_handle() to FD_PREPARE()
Fixes a UAF for src_info as well.

Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-33-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:35 +01:00
Christian Brauner 843e7b5c29
spufs: convert spufs_context_open() to FD_PREPARE()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-32-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:34 +01:00
Christian Brauner 4667d63872
net/socket: convert __sys_accept4_file() to FD_ADD()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-31-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:34 +01:00
Christian Brauner 245f0d1c62
net/socket: convert sock_map_fd() to FD_ADD()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-30-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:34 +01:00
Christian Brauner 0d52d06a19
net/kcm: convert kcm_ioctl() to FD_PREPARE()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-28-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:34 +01:00
Christian Brauner fe67b063f6
net/handshake: convert handshake_nl_accept_doit() to FD_PREPARE()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-27-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:34 +01:00
Christian Brauner 910c361f9a
secretmem: convert memfd_secret() to FD_ADD()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-26-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:34 +01:00
Christian Brauner 1afcbbe5d6
memfd: convert memfd_create() to FD_ADD()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-25-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:34 +01:00
Christian Brauner 981bec8f69
bpf: convert bpf_token_create() to FD_PREPARE()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-24-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:33 +01:00
Christian Brauner 798c2da490
bpf: convert bpf_iter_new_fd() to FD_PREPARE()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-23-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:33 +01:00
Christian Brauner f2573685bd
ipc: convert do_mq_open() to FD_ADD()
Christian Brauner <brauner@kernel.org> says:

The fix sent in [1] was squashed into this commit.

Fixes: https://lore.kernel.org/c41de645-8234-465f-a3be-f0385e3a163c@sirena.org.uk [1]
Reported-by: Mark Brown <broonie@kernel.org> [1]
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> [1]
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-22-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:33 +01:00
Christian Brauner 1ad7810c6d
exec: convert begin_new_exec() to FD_ADD()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-21-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:33 +01:00
Christian Brauner 7352c6fce3
af_unix: convert unix_file_open() to FD_ADD()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-19-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:33 +01:00
Christian Brauner 34dfce523c
dma: convert dma_buf_fd() to FD_ADD()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-18-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:33 +01:00
Christian Brauner 993f30468e
xfs: convert xfs_open_by_handle() to FD_PREPARE()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-17-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:33 +01:00
Christian Brauner 39f6e7581a
userfaultfd: convert new_userfaultfd() to FD_PREPARE()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-16-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:33 +01:00
Christian Brauner 14010faa1b
timerfd: convert timerfd_create() to FD_ADD()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-15-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:32 +01:00
Christian Brauner 5b755da105
signalfd: convert do_signalfd4() to FD_ADD()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-14-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:32 +01:00
Christian Brauner 360fbf808a
open: convert do_sys_openat2() to FD_ADD()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-13-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:32 +01:00
Christian Brauner 13dce771bb
eventpoll: convert do_epoll_create() to FD_PREPARE()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-12-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:32 +01:00
Christian Brauner 0f4288410c
autofs: convert autofs_dev_ioctl_open_mountpoint() to FD_ADD()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-11-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:32 +01:00
Christian Brauner 3d8aefd49a
nsfs: convert ns_ioctl() to FD_PREPARE()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-10-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:32 +01:00
Christian Brauner 00de6e2448
nsfs: convert open_namespace() to FD_PREPARE()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-9-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:32 +01:00
Christian Brauner 7129098f4f
fanotify: convert fanotify_init() to FD_PREPARE()
Christian Brauner <brauner@kernel.org> says:

The fix sent in [1] was squashed into this commit.

Link: https://lore.kernel.org/20251127201618.2115275-1-kuniyu@google.com [1]
Reported-by: syzbot+321168dfa622eda99689@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/lkml/6928b121.a70a0220.d98e3.0110.GAE@google.com
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-8-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:31 +01:00
Christian Brauner 05885f4165
namespace: convert fsmount() to FD_PREPARE()
Christian Brauner <brauner@kernel.org> says:

A variant of the fix sent in [1] was squashed into this commit.

Link: https://lore.kernel.org/20251128035149.392402-1-kartikey406@gmail.com [1]
Reported-by: Deepanshu Kartikey <kartikey406@gmail.com>
Reported-by: syzbot+94048264da5715c251f9@syzkaller.appspotmail.com
Tested-by: syzbot+94048264da5715c251f9@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=94048264da5715c251f9
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-7-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:31 +01:00
Christian Brauner 416b0d1659
namespace: convert open_tree_attr() to FD_PREPARE()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-6-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:31 +01:00
Christian Brauner 542a406543
namespace: convert open_tree() to FD_ADD()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-5-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:31 +01:00
Christian Brauner fbe58faa69
fhandle: convert do_handle_open() to FD_ADD()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-4-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:31 +01:00
Christian Brauner a5fa9ab846
eventfd: convert do_eventfd() to FD_PREPARE()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-3-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:31 +01:00
Christian Brauner 8797dd5600
anon_inodes: convert to FD_ADD()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-2-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:31 +01:00
Christian Brauner 011703a9ac
file: add FD_{ADD,PREPARE}()
I've been playing with this to allow for moderately flexible usage of
the get_unused_fd_flags() + create file + fd_install() pattern that's
used quite extensively.

How callers allocate files is really heterogenous so it's not really
convenient to fold them into a single class. It's possibe to split them
into subclasses like for anon inodes. I think that's not necessarily
nice as well.

My take is to add two primites:
(1) FD_ADD() the simple cases a file is installed:

    fd = FD_ADD(O_CLOEXEC, open_file(some, args)));
    if (fd >= 0)
            kvm_get_kvm(vcpu->kvm);
    return fd;

(2) FD_PREPARE() that captures all the cases where access to fd or file
    or additional work before publishing the fd is needed:

    FD_PREPARE(fdf, open_flag, file_open_handle(&path, open_flag));
    if (fdf.err)
            return fdf.err;

    if (copy_to_user(/* something something */))
            return -EFAULT;

    return fd_publish(fdf);

I've converted all of the easy cases over to it and it gets rid of an
aweful lot of convoluted cleanup logic.

It's centered around struct fd_prepare. FD_PREPARE() encapsulates all of
allocation and cleanup logic and must be followed by a call to
fd_publish() which associates the fd with the file and installs it into
the callers fdtable. If fd_publish() isn't called both are deallocated.

It mandates a specific order namely that first we allocate the fd and
then instantiate the file. But that shouldn't be a problem nearly
everyone I've converted uses this exact pattern anyway.

There's a bunch of additional cases where it would be easy to convert
them to this pattern. For example, the whole sync file stuff in dma
currently retains the containing structure of the file instead of the
file itself even though it's only used to allocate files. Changing that
would make it fall into the FD_PREPARE() pattern easily. I've not done
that work yet.

There's room for extending this in a way that wed'd have subclasses for
some particularly often use patterns but as I said I'm not even sure
that's worth it.

Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-1-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:23 +01:00
Chen Ni 2579e21be5
ovl: remove unneeded semicolon
Remove unnecessary semicolons reported by Coccinelle/coccicheck and the
semantic patch at scripts/coccinelle/misc/semicolon.cocci.

Signed-off-by: Chen Ni <nichen@iscas.ac.cn>
Fixed: 7ab96df840 ("VFS/nfsd/cachefiles/ovl: add start_creating() and end_creating()")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 11:05:52 +01:00
Jeff Layton 4be9e04ebf
vfs: add needed headers for new struct delegation definition
The definition of struct delegation uses stdint.h integer types. Add the
necessary headers to ensure that always works.

Fixes: 1602bad16d ("vfs: expose delegation support to userland")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 10:55:34 +01:00
Randy Dunlap 01c9c30aae
filelock: __fcntl_getlease: fix kernel-doc warnings
Use the correct function name and add description for the @flavor
parameter to avoid these kernel-doc warnings:

Warning: fs/locks.c:1706 function parameter 'flavor' not described in
 '__fcntl_getlease'
WARNING: fs/locks.c:1706 expecting prototype for fcntl_getlease().
 Prototype was for __fcntl_getlease() instead

Fixes: 1602bad16d ("vfs: expose delegation support to userland")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://patch.msgid.link/20251128000826.457120-1-rdunlap@infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 10:30:41 +01:00
Neil Brown eeec741ee0
nfsd: fix end_creating() conversion
Avoid a double-unlock as nfs_create_locked() will have unlocked the
parent and do the dput() manually.

Christian Brauner <brauner@kernel.org> says:

I've taken Neil's proposed fix from [1] and added a commit message.

Fixes: https://lore.kernel.org/202511252132.2c621407-lkp@intel.com [1]
Fixes: bd6ede8a06 ("VFS/nfsd/cachefiles/ovl: introduce start_removing() and end_removing()")
Signed-off-by: Neil Brown <neil@brown.name>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 09:51:16 +01:00
Peter Zijlstra b0a848f4a4 x86/bugs: Make i386 use GENERIC_BUG_RELATIVE_POINTERS
Linus figured less #ifdef is more better and making x86-32 use
GENERIC_BUG_RELATIVE_POINTERS removes one layer of macro magic from
the bug.h bits.

Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-11-27 09:32:48 +01:00
Peter Zijlstra d62e4f2b95 x86/bug: Fix BUG_FORMAT vs KASLR
Encoding a relative NULL pointer doesn't work for KASLR, when the
whole kernel image gets shifted, the __bug_table and the target string
get shifted by the same amount and the relative offset is preserved.

However when the target is an absolute 0 value and the __bug_table
gets moved about, the end result in a pointer equivalent to
kaslr_offset(), not NULL.

Notably, this will generate SHN_UNDEF relocations, and Ard would
really like to not have those at all.

Use the empty string to denote no-string.

Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-11-27 09:32:47 +01:00
Alexandre Chartre 59bfa64082 objtool: Build with disassembly can fail when including bdf.h
Building objtool with disassembly support can fail when including
the bdf.h file:

  In file included from tools/objtool/include/objtool/arch.h:108,
                   from check.c:14:
  /usr/include/bfd.h:35:2: error: #error config.h must be included before this header
     35 | #error config.h must be included before this header
        |  ^~~~~

This check is present in the bfd.h file generated from the binutils
source code, but it is not necessarily present in the bfd.h file
provided in a binutil package (for example, it is not present in
the binutil RPM).

The solution to this issue is to define the PACKAGE macro before
including bfd.h. This is the solution suggested by the binutil
developer in bug 14243, and it is used by other kernel tools
which also use bfd.h (perf and bpf).

Fixes: 5995330382 ("objtool: Disassemble code with libopcodes instead of running objdump")
Closes: https://lore.kernel.org/all/3fa261fd-3b46-4cbe-b48d-7503aabc96cb@oracle.com/
Reported-by: Nathan Chancellor <nathan@kernel.org>
Suggested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://sourceware.org/bugzilla/show_bug.cgi?id=14243
Link: https://patch.msgid.link/20251126134519.1760889-1-alexandre.chartre@oracle.com
2025-11-27 09:32:46 +01:00
Alexandre Chartre c0a67900dc objtool: Trim trailing NOPs in alternative
When disassembling alternatives replace trailing NOPs with a single
indication of the number of bytes covered with NOPs.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://patch.msgid.link/20251121095340.464045-31-alexandre.chartre@oracle.com
2025-11-24 20:40:48 +01:00
Alexandre Chartre aff95e0d4e objtool: Add wide output for disassembly
Add the --wide option to provide a wide output when disassembling.
With this option, the disassembly of alternatives is displayed
side-by-side instead of one above the other.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://patch.msgid.link/20251121095340.464045-30-alexandre.chartre@oracle.com
2025-11-24 20:40:48 +01:00
Alexandre Chartre 07d70b271a objtool: Compact output for alternatives with one instruction
When disassembling, if an instruction has alternatives which are all
made of a single instruction then print each alternative on a single
line (instruction + description) so that the output is more compact.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://patch.msgid.link/20251121095340.464045-29-alexandre.chartre@oracle.com
2025-11-24 20:40:48 +01:00
Alexandre Chartre 56967b9a77 objtool: Improve naming of group alternatives
Improve the naming of group alternatives by showing the feature name and
flags used by the alternative.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://patch.msgid.link/20251121095340.464045-28-alexandre.chartre@oracle.com
2025-11-24 20:40:48 +01:00
Alexandre Chartre 8308fd0019 objtool: Add Function to get the name of a CPU feature
Add a function to get the name of a CPU feature. The function is
architecture dependent and currently only implemented for x86. The
feature names are automatically generated from the cpufeatures.h
include file.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://patch.msgid.link/20251121095340.464045-27-alexandre.chartre@oracle.com
2025-11-24 20:39:47 +01:00
Peter Zijlstra 860238af7a x86_64/bug: Inline the UD1
(Ab)use the static_call infrastructure to convert all:

  call __WARN_trap

instances into the desired:

  ud1 (%edx), %rdi

eliminating the CALL/RET, but more importantly, fixing the
fact that all WARNs will have:

  RIP: 0010:__WARN_trap+0

Basically, by making it a static_call trampoline call, objtool will
collect the callsites, and then the inline rewrite will hit the
special case and replace the code with the magic instruction.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251110115758.456717741@infradead.org
2025-11-24 20:23:25 +01:00
Peter Zijlstra 11bb4944f0 x86/bug: Implement WARN_ONCE()
Implement WARN_ONCE like WARN using BUGFLAG_ONCE.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251110115758.339309119@infradead.org
2025-11-24 20:23:25 +01:00
Peter Zijlstra 5b472b6e5b x86_64/bug: Implement __WARN_printf()
The basic idea is to have __WARN_printf() be a vararg function such
that the compiler can do the optimal calling convention for us. This
function body will be a #UD and then set up a va_list in the exception
from pt_regs.

But because the trap will be in a called function, the bug_entry must
be passed in. Have that be the first argument, with the format tucked
away inside the bug_entry.

The comments should clarify the real fun details.

The big downside is that all WARNs will now show:

 RIP: 0010:__WARN_trap:+0

One possible solution is to simply discard the top frame when
unwinding. A follow up patch takes care of this slightly differently
by abusing the x86 static_call implementation.

This changes (with the next patches):

	WARN_ONCE(preempt_count() != 2*PREEMPT_DISABLE_OFFSET,
		  "corrupted preempt_count: %s/%d/0x%x\n",

from:

	cmpl    $2, %ecx	#, _7
        jne     .L1472
	...

  .L1472:
        cmpb    $0, __already_done.11(%rip)
        je      .L1513
	...

  .L1513
	movb    $1, __already_done.11(%rip)
	movl    1424(%r14), %edx        # _15->pid, _15->pid
        leaq    1912(%r14), %rsi        #, _17
        movq    $.LC43, %rdi    #,
        call    __warn_printk   #
	ud2
  .pushsection __bug_table,"aw"
        2:
        .long 1b - .    # bug_entry::bug_addr
        .long .LC1 - .  # bug_entry::file
        .word 5093      # bug_entry::line
        .word 2313      # bug_entry::flags
        .org 2b + 12
  .popsection
  .pushsection .discard.annotate_insn,"M", @progbits, 8
        .long 1b - .
        .long 8         # ANNOTYPE_REACHABLE
  .popsection

into:

	cmpl    $2, %ecx        #, _7
        jne     .L1442  #,
	...

  .L1442:
        lea (2f)(%rip), %rdi
  1:
  .pushsection __bug_table,"aw"
        2:
        .long 1b - .    # bug_entry::bug_addr
        .long .LC43 - . # bug_entry::format
        .long .LC1 - .  # bug_entry::file
        .word 5093      # bug_entry::line
        .word 2323      # bug_entry::flags
        .org 2b + 16
  .popsection
        movl    1424(%r14), %edx        # _19->pid, _19->pid
        leaq    1912(%r14), %rsi        #, _13
	ud1 (%edx), %rdi

Notably, by pushing everything into the exception handler it can take
care of the ONCE thing.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251110115758.213813530@infradead.org
2025-11-24 20:23:05 +01:00
Peter Zijlstra 4f1b701f24 x86/bug: Use BUG_FORMAT for DEBUG_BUGVERBOSE_DETAILED
Since we have an explicit format string, use it for the condition string
instead of frobbing it in the file string.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251110115758.097401406@infradead.org
2025-11-24 20:22:21 +01:00
Peter Zijlstra 0a52d339d3 x86/bug: Add BUG_FORMAT basics
Opt-in to BUG_FORMAT for x86_64, adjust the BUGTABLE helper and for
now, just store NULL pointers.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251110115757.980264454@infradead.org
2025-11-24 20:22:11 +01:00
Alexandre Chartre be5ee60ac5 objtool: Provide access to feature and flags of group alternatives
Each alternative of a group alternative depends on a specific
feature and flags. Provide access to the feature/flags for each
alternative as an attribute (feature) in struct alt_group.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://patch.msgid.link/20251121095340.464045-26-alexandre.chartre@oracle.com
2025-11-21 15:30:14 +01:00
Alexandre Chartre 4aae0d3f77 objtool: Fix address references in alternatives
When using the --disas option, alternatives are disassembled but
address references in non-default alternatives can be incorrect.

The problem is that alternatives are shown as if they were replacing the
original code of the alternative. So if an alternative is referencing
an address inside the alternative then the reference has to be
adjusted to the location of the original code.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://patch.msgid.link/20251121095340.464045-25-alexandre.chartre@oracle.com
2025-11-21 15:30:14 +01:00
Alexandre Chartre 7e017720aa objtool: Disassemble jump table alternatives
When using the --disas option, also disassemble jump tables.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://patch.msgid.link/20251121095340.464045-24-alexandre.chartre@oracle.com
2025-11-21 15:30:14 +01:00
Alexandre Chartre 78df4590c5 objtool: Disassemble exception table alternatives
When using the --disas option, also disassemble exception tables
(EX_TABLE).

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://patch.msgid.link/20251121095340.464045-23-alexandre.chartre@oracle.com
2025-11-21 15:30:14 +01:00
Alexandre Chartre 15e7ad8667 objtool: Print addresses with alternative instructions
All alternatives are disassemble side-by-side when using the --disas
option. However the address of each instruction is not printed because
instructions from different alternatives are not necessarily aligned.

Change this behavior to print the address of each instruction. Spaces
will appear between instructions from the same alternative when
instructions from different alternatives do not have the same alignment.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://patch.msgid.link/20251121095340.464045-22-alexandre.chartre@oracle.com
2025-11-21 15:30:13 +01:00
Alexandre Chartre a4f1599672 objtool: Disassemble group alternatives
When using the --disas option, disassemble all group alternatives.
Jump tables and exception tables (which are handled as alternatives)
are not disassembled at the moment.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://patch.msgid.link/20251121095340.464045-21-alexandre.chartre@oracle.com
2025-11-21 15:30:13 +01:00
Alexandre Chartre 87343e6642 objtool: Print headers for alternatives
When using the --disas option, objtool doesn't currently disassemble
any alternative. Print an header for each alternative. This identifies
places where alternatives are present but alternative code is still
not disassembled at the moment.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://patch.msgid.link/20251121095340.464045-20-alexandre.chartre@oracle.com
2025-11-21 15:30:13 +01:00
Alexandre Chartre 7ad7a4a720 objtool: Preserve alternatives order
Preserve the order in which alternatives are defined. Currently
objtool stores alternatives in a list in reverse order.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://patch.msgid.link/20251121095340.464045-19-alexandre.chartre@oracle.com
2025-11-21 15:30:12 +01:00
Alexandre Chartre 5f326c8897 objtool: Add the --disas=<function-pattern> action
Add the --disas=<function-pattern> actions to disassemble the specified
functions. The function pattern can be a single function name (e.g.
--disas foo to disassemble the function with the name "foo"), or a shell
wildcard pattern (e.g. --disas foo* to disassemble all functions with a
name starting with "foo").

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://patch.msgid.link/20251121095340.464045-18-alexandre.chartre@oracle.com
2025-11-21 15:30:12 +01:00
Alexandre Chartre c3b7d044fc objtool: Do not validate IBT for .return_sites and .call_sites
The .return_sites and .call_sites sections reference text addresses,
but not with the intent to indirect branch to them, so they don't
need to be validated for IBT.

This is useful when running objtool on object files which already
have .return_sites or .call_sites sections, for example to re-run
objtool after it has reported an error or a warning.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://patch.msgid.link/20251121095340.464045-17-alexandre.chartre@oracle.com
2025-11-21 15:30:12 +01:00
Alexandre Chartre 350c7ab857 objtool: Improve tracing of alternative instructions
When tracing function validation, improve the reporting of
alternative instruction by more clearly showing the different
alternatives beginning and end.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://patch.msgid.link/20251121095340.464045-16-alexandre.chartre@oracle.com
2025-11-21 15:30:11 +01:00
Alexandre Chartre 9b580accac objtool: Add functions to better name alternatives
Add the disas_alt_name() and disas_alt_type_name() to provide a
name and a type name for an alternative. This will be used to
better name alternatives when tracing their execution.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://patch.msgid.link/20251121095340.464045-15-alexandre.chartre@oracle.com
2025-11-21 15:30:11 +01:00
Alexandre Chartre d490aa2197 objtool: Identify the different types of alternatives
Alternative code, including jump table and exception table, is represented
with the same struct alternative structure. But there is no obvious way to
identify whether the struct represents alternative instructions, a jump
table or an exception table.

So add a type to struct alternative to clearly identify the type of
alternative.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://patch.msgid.link/20251121095340.464045-14-alexandre.chartre@oracle.com
2025-11-21 15:30:11 +01:00
Alexandre Chartre 26a453fb56 objtool: Improve register reporting during function validation
When tracing function validation, instruction state changes can
report changes involving registers. These registers are reported
with the name "r<num>" (e.g. "r3"). Print the CPU specific register
name instead of a generic name (e.g. print "rbx" instead of "r3"
on x86).

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://patch.msgid.link/20251121095340.464045-13-alexandre.chartre@oracle.com
2025-11-21 15:30:10 +01:00
Alexandre Chartre fcb268b47a objtool: Trace instruction state changes during function validation
During function validation, objtool maintains a per-instruction state,
in particular to track call frame information. When tracing validation,
print any instruction state changes.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://patch.msgid.link/20251121095340.464045-12-alexandre.chartre@oracle.com
2025-11-21 15:30:10 +01:00
Alexandre Chartre 70589843b3 objtool: Add option to trace function validation
Add an option to trace and have information during the validation
of specified functions. Functions are specified with the --trace
option which can be a single function name (e.g. --trace foo to
trace the function with the name "foo"), or a shell wildcard
pattern (e.g. --trace foo* to trace all functions with a name
starting with "foo").

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://patch.msgid.link/20251121095340.464045-11-alexandre.chartre@oracle.com
2025-11-21 15:30:09 +01:00
Alexandre Chartre de0248fbbf objtool: Record symbol name max length
Keep track of the maximum length of symbol names. This will help
formatting the code flow between different functions.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://patch.msgid.link/20251121095340.464045-10-alexandre.chartre@oracle.com
2025-11-21 15:30:09 +01:00
Alexandre Chartre a0e5bf9fd6 objtool: Extract code to validate instruction from the validate branch loop
The code to validate a branch loops through all instructions of the
branch and validate each instruction. Move the code to validate an
instruction to a separated function.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://patch.msgid.link/20251121095340.464045-9-alexandre.chartre@oracle.com
2025-11-21 15:30:08 +01:00
Alexandre Chartre 0bb080ba64 objtool: Disassemble instruction on warning or backtrace
When an instruction warning (WARN_INSN) or backtrace (BT_INSN) is issued,
disassemble the instruction to provide more context.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://patch.msgid.link/20251121095340.464045-8-alexandre.chartre@oracle.com
2025-11-21 15:30:08 +01:00
Alexandre Chartre d4e13c2149 objtool: Store instruction disassembly result
When disassembling an instruction store the result instead of directly
printing it.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://patch.msgid.link/20251121095340.464045-7-alexandre.chartre@oracle.com
2025-11-21 15:30:08 +01:00
Alexandre Chartre 5d859dff26 objtool: Print symbol during disassembly
Print symbols referenced during disassembly instead of just printing
raw addresses. Also handle address relocation.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://patch.msgid.link/20251121095340.464045-6-alexandre.chartre@oracle.com
2025-11-21 15:30:07 +01:00
Alexandre Chartre f348a44c10 tool build: Remove annoying newline in build output
Remove the newline which is printed during feature discovery
when nothing else is printed.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://patch.msgid.link/20251121095340.464045-5-alexandre.chartre@oracle.com
2025-11-21 15:30:07 +01:00
Alexandre Chartre 5995330382 objtool: Disassemble code with libopcodes instead of running objdump
objtool executes the objdump command to disassemble code. Use libopcodes
instead to have more control about the disassembly scope and output.
If libopcodes is not present then objtool is built without disassembly
support.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://patch.msgid.link/20251121095340.464045-4-alexandre.chartre@oracle.com
2025-11-21 15:30:07 +01:00
Alexandre Chartre 1013f2e37b objtool: Create disassembly context
Create a structure to store information for disassembling functions.
For now, it is just a wrapper around an objtool file.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://patch.msgid.link/20251121095340.464045-3-alexandre.chartre@oracle.com
2025-11-21 15:30:06 +01:00
Alexandre Chartre 55d2a473f3 objtool: Move disassembly functions to a separated file
objtool disassembles functions which have warnings. Move the code
to do that to a dedicated file. The code is just moved, it is not
changed.

Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://patch.msgid.link/20251121095340.464045-2-alexandre.chartre@oracle.com
2025-11-21 15:30:06 +01:00
Peter Zijlstra b9b2c455f4 bug: Allow architectures to provide __WARN_printf()
In addition to providing __WARN_FLAGS(), allow an architecture to also
provide __WARN_printf().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251110115757.807154591@infradead.org
2025-11-21 11:21:32 +01:00
Peter Zijlstra 3fd45b871f bug: Implement WARN_ON() using __WARN_FLAGS()
This completes 3bc3c9c3ab ("bugs/core: Pass down the condition
string of WARN_ON_ONCE(cond) warnings to __WARN_FLAGS()") and makes
WARN_ON() and WARN_ON_ONCE() behaviour consistent.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251110115757.690999560@infradead.org
2025-11-21 11:21:32 +01:00
Peter Zijlstra 7d2c27a0ec bug: Add report_bug_entry()
Add a report_bug() variant where the bug_entry is already known. This
is useful when the exception instruction is not instantiated per-site.
But instead has a single instance. In such a case the bug_entry
address might be passed along in a known register or something.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251110115757.575795595@infradead.org
2025-11-21 11:21:31 +01:00
Peter Zijlstra 5c47b7f3d1 bug: Add BUG_FORMAT_ARGS infrastructure
Add BUG_FORMAT_ARGS; when an architecture is able to provide a va_list
given pt_regs, use this to print format arguments.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251110115757.457339417@infradead.org
2025-11-21 11:21:31 +01:00
Peter Zijlstra 30b82568b0 bug: Clean up CONFIG_GENERIC_BUG_RELATIVE_POINTERS
Three repeated CONFIG_GENERIC_BUG_RELATIVE_POINTERS #ifdefs right
after one another yields unreadable code. Add a helper.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251110115757.341703850@infradead.org
2025-11-21 11:21:31 +01:00
Peter Zijlstra d292dbb564 bug: Add BUG_FORMAT infrastructure
Add BUG_FORMAT; an architecture opt-in feature that allows adding the
WARN_printf() format string to the bug_entry table.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251110115757.223371452@infradead.org
2025-11-21 11:21:30 +01:00
Peter Zijlstra 1be1fac648 x86: Rework __bug_table helpers
Rework the __bug_table helpers such that extension becomes easier.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251110115757.111187573@infradead.org
2025-11-21 11:21:30 +01:00
Peter Zijlstra 2ace527183 Merge branch 'objtool/core'
Bring in the UDB and objtool data annotations to avoid conflicts while further extending the bug exceptions.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
2025-11-21 11:21:20 +01:00
Josh Poimboeuf 11991999a2 Revert "objtool: Warn on functions with ambiguous -ffunction-sections section names"
This reverts commit 9c7dc1dd89.

The check-function-names.sh script now provides the function name
checking functionality for all architectures, making the objtool check
redundant.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://patch.msgid.link/c7d549d4de8bd1490d106b99630eea5efc69a4dd.1763669451.git.jpoimboe@kernel.org
2025-11-21 10:04:10 +01:00
Josh Poimboeuf 93863f3f85 kbuild: Check for functions with ambiguous -ffunction-sections section names
Commit 9c7dc1dd89 ("objtool: Warn on functions with ambiguous
-ffunction-sections section names") only works for drivers which are
compiled on architectures supported by objtool.

Make a script to perform the same check for all architectures.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://patch.msgid.link/a6a49644a34964f7e02f3a8ce43af03e72817180.1763669451.git.jpoimboe@kernel.org
2025-11-21 10:04:10 +01:00
Josh Poimboeuf 3186333713 tty: synclink_gt: Fix namespace collision and startup() section placement with -ffunction-sections
When compiled with -ffunction-sections (e.g., for LTO, livepatch, dead
code elimination, AutoFDO, or Propeller), the startup() function gets
compiled into the .text.startup section (or in some cases
.text.startup.constprop.0 or .text.startup.isra.0).

However, the .text.startup and .text.startup.* sections are also used by
the compiler for __attribute__((constructor)) code.

This naming conflict causes the vmlinux linker script to wrongly place
startup() function code in .init.text, which gets freed during boot.

Some builds have a mix of objects, both with and without
-ffunctions-sections, so it's not possible for the linker script to
disambiguate with #ifdef CONFIG_FUNCTION_SECTIONS or similar.  This
means that "startup" unfortunately needs to be prohibited as a function
name.

Rename startup() to startup_hw().  For consistency, also rename its
shutdown() counterpart to shutdown_hw().

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://patch.msgid.link/f0ee750f35c878172cc09916a0724b74e62eadc2.1763669451.git.jpoimboe@kernel.org
2025-11-21 10:04:10 +01:00
Josh Poimboeuf 845c09e474 tty: amiserial: Fix namespace collision and startup() section placement with -ffunction-sections
When compiled with -ffunction-sections (e.g., for LTO, livepatch, dead
code elimination, AutoFDO, or Propeller), the startup() function gets
compiled into the .text.startup section (or in some cases
.text.startup.constprop.0 or .text.startup.isra.0).

However, the .text.startup and .text.startup.* sections are also used by
the compiler for __attribute__((constructor)) code.

This naming conflict causes the vmlinux linker script to wrongly place
startup() function code in .init.text, which gets freed during boot.

Some builds have a mix of objects, both with and without
-ffunctions-sections, so it's not possible for the linker script to
disambiguate with #ifdef CONFIG_FUNCTION_SECTIONS or similar.  This
means that "startup" unfortunately needs to be prohibited as a function
name.

Rename startup() to rs_startup().  For consistency, also rename its
shutdown() counterpart to rs_shutdown().

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://patch.msgid.link/9e56afff5268b0b12b99a8aa9bf244d6ebdcdf47.1763669451.git.jpoimboe@kernel.org
2025-11-21 10:04:09 +01:00
Josh Poimboeuf 2c715c9de2 media: atomisp: gc2235: Fix namespace collision and startup() section placement with -ffunction-sections
When compiled with -ffunction-sections (e.g., for LTO, livepatch, dead
code elimination, AutoFDO, or Propeller), the startup() function gets
compiled into the .text.startup section (or in some cases
.text.startup.constprop.0 or .text.startup.isra.0).

However, the .text.startup and .text.startup.* sections are also used by
the compiler for __attribute__((constructor)) code.

This naming conflict causes the vmlinux linker script to wrongly place
startup() function code in .init.text, which gets freed during boot.

Some builds have a mix of objects, both with and without
-ffunctions-sections, so it's not possible for the linker script to
disambiguate with #ifdef CONFIG_FUNCTION_SECTIONS or similar.  This
means that "startup" unfortunately needs to be prohibited as a function
name.

Rename startup() to gc2235_startup().

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://patch.msgid.link/d28103a6edf7beceb5e3c6fa24e49dbad1350389.1763669451.git.jpoimboe@kernel.org
2025-11-21 10:04:09 +01:00
Josh Poimboeuf da6202139a serial: icom: Fix namespace collision and startup() section placement with -ffunction-sections
When compiled with -ffunction-sections (e.g., for LTO, livepatch, dead
code elimination, AutoFDO, or Propeller), the startup() function gets
compiled into the .text.startup section (or in some cases
.text.startup.constprop.0 or .text.startup.isra.0).

However, the .text.startup and .text.startup.* sections are also used by
the compiler for __attribute__((constructor)) code.

This naming conflict causes the vmlinux linker script to wrongly place
startup() function code in .init.text, which gets freed during boot.

Some builds have a mix of objects, both with and without
-ffunctions-sections, so it's not possible for the linker script to
disambiguate with #ifdef CONFIG_FUNCTION_SECTIONS or similar.  This
means that "startup" unfortunately needs to be prohibited as a function
name.

Rename startup() to icom_startup().  For consistency, also rename its
shutdown() counterpart to icom_shutdown().

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://patch.msgid.link/1aee9ef69f9d40405676712b34f0c397706e7023.1763669451.git.jpoimboe@kernel.org
2025-11-21 10:04:09 +01:00
Josh Poimboeuf 106f11d43b objtool: Remove second pass of .cold function correlation
The .cold function parent/child correlation logic has two passes: one in
read_symbols() and one in add_jump_destinations().

The second pass was added with commit cd77849a69 ("objtool: Fix GCC 8
cold subfunction detection for aliased functions") to ensure that if the
parent symbol had aliases then the canonical symbol was chosen as the
parent.

That solution was rather clunky, not to mention incomplete due to the
existence of alternatives and switch tables.  Now that we have
sym->alias, the canonical alias fix can be done much simpler in the
first pass, making the second pass obsolete.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/bdab245a38000a5407f663a031f39e14c67a43d4.1763671318.git.jpoimboe@kernel.org
2025-11-21 10:04:08 +01:00
Josh Poimboeuf a91a61b290 objtool: Skip non-canonical aliased symbols in add_jump_table_alts()
If a symbol has aliases, make add_jump_table_alts() skip the
non-canonical ones to avoid any surprises.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/169aa17564b9aadb74897945ea74ac2eb70c5b13.1763671318.git.jpoimboe@kernel.org
2025-11-21 10:04:08 +01:00
Josh Poimboeuf 9205a322cf objtool: Return canonical symbol when aliases exist in symbol finding helpers
When symbol alias ambiguity exists in the symbol finding helper
functions, return the canonical sym->alias, as that's the one which gets
used by validate_branch() and elsewhere.

This doesn't fix any known issues, just makes the symbol alias behavior
more robust.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/450470a4897706af77453ad333e18af5ebab653c.1763671318.git.jpoimboe@kernel.org
2025-11-21 10:04:08 +01:00
Josh Poimboeuf 16f366c5a6 objtool: Don't alias undefined symbols
Objtool is mistakenly aliasing all undefined symbols.  That's obviously
wrong, though it has no consequence since objtool happens to only use
sym->alias for defined symbols.  Fix it regardless.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/bc401173a7717757eee672fc1ca5a20451d77b86.1763671318.git.jpoimboe@kernel.org
2025-11-21 10:04:08 +01:00
Josh Poimboeuf 2c2acca2ea objtool: Fix .cold function detection for duplicate symbols
The objtool .cold child/parent correlation is done in two phases: first
in elf_add_symbol() and later in add_jump_destinations().

The first phase is rather crude and can pick the wrong parent if there
are duplicates with the same name.

The second phase usually fixes that, but only if the parent has a direct
jump to the child.  It does *not* work if the only branch from the
parent to the child is an alternative or jump table entry.

Make the first phase more robust by looking for the parent in the same
STT_FILE as the child.

Fixes the following objtool warnings in an AutoFDO build with a large
CLANG_AUTOFDO_PROFILE profile:

  vmlinux.o: warning: objtool: rdev_add_key() falls through to next function rdev_add_key.cold()
  vmlinux.o: warning: objtool: rdev_set_default_key() falls through to next function rdev_set_default_key.cold()

Fixes: 13810435b9 ("objtool: Support GCC 8's cold subfunctions")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/82c7b52e40efa75dd10e1c550cc75c1ce10ac2c9.1763671318.git.jpoimboe@kernel.org
2025-11-21 10:04:07 +01:00
Josh Poimboeuf 024020e2b6 objtool: Support Clang AUTOFDO .cold functions
AutoFDO enables -fsplit-machine-functions which can move the cold parts
of a function to a <func>.cold symbol in a .text.split.<func> section.

Unlike GCC, the Clang <func>.cold symbols are not marked STT_FUNC.  This
confuses objtool in several ways, resulting in warnings like the
following:

  vmlinux.o: warning: objtool: apply_retpolines.cold+0xfc: unsupported instruction in callable function
  vmlinux.o: warning: objtool: machine_check_poll.cold+0x2e: unsupported instruction in callable function
  vmlinux.o: warning: objtool: free_deferred_objects.cold+0x1f: relocation to !ENDBR: free_deferred_objects.cold+0x26
  vmlinux.o: warning: objtool: rpm_idle.cold+0xe0: relocation to !ENDBR: rpm_idle.cold+0xe7
  vmlinux.o: warning: objtool: tcp_rcv_state_process.cold+0x1c: relocation to !ENDBR: tcp_rcv_state_process.cold+0x23

Fix it by marking the .cold symbols as STT_FUNC.

Fixes: 2fd65f7afd ("AutoFDO: Enable machine function split optimization for AutoFDO")
Closes: https://lore.kernel.org/20251103215244.2080638-2-xur@google.com
Reported-by: Rong Xu <xur@google.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: xur@google.com
Tested-by: xur@google.com
Link: https://patch.msgid.link/20a67326f04b2a361c031b56d58e8a803b3c5893.1763671318.git.jpoimboe@kernel.org
2025-11-21 10:04:07 +01:00
Peter Zijlstra c04507ac50 sched: Provide and use set_need_resched_current()
set_tsk_need_resched(current) requires set_preempt_need_resched(current) to
work correctly outside of the scheduler.

Provide set_need_resched_current() which wraps this correctly and replace
all the open coded instances.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251116174750.665769842@linutronix.de
2025-11-20 22:26:09 +01:00
Christian Brauner 101bf15887
Merge patch series "ovl: convert copyup credential override to cred guard"
Christian Brauner <brauner@kernel.org> says:

This simplifies the copyup specific credential override.

The current code is centered around a helper struct ovl_cu_creds and is
a bit convoluted. We can simplify this by using a cred guard. This will
also allow us to remove the helper struct and associated functions.

* patches from https://patch.msgid.link/20251114-work-ovl-cred-guard-copyup-v1-0-ea3fb15cf427@kernel.org:
  ovl: remove struct ovl_cu_creds and associated functions
  ovl: port ovl_copy_up_tmpfile() to cred guard
  ovl: mark *_cu_creds() as unused temporarily
  ovl: port ovl_copy_up_workdir() to cred guard
  ovl: add copy up credential guard

Link: https://patch.msgid.link/20251114-work-ovl-cred-guard-copyup-v1-0-ea3fb15cf427@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:27 +01:00
Christian Brauner c0fb968656
Merge patch series "ovl: convert creation credential override to cred guard"
Christian Brauner <brauner@kernel.org> says:

This cleans up the creation specific credential override.

The current code to override credentials for creation operations is
pretty difficult to understand as we override the credentials twice:

(1) override with the mounter's credentials
(2) copy the mounts credentials and override the fs{g,u}id with the inode {u,g}id

And then we elide the revert_creds() because it would be an idempotent
revert. That elision doesn't buy us anything anymore though because it's
all reference count less anyway.

The fact that this is done in a function and that the revert is
happening in the original override makes this a lot to grasp.

By introducing a cleanup guard for the creation case we can make this a
lot easier to understand and extremely visually prevalent:

with_ovl_creds(dentry->d_sb) {
	scoped_class(prepare_creds_ovl, cred, dentry, inode, mode) {
		if (IS_ERR(cred))
			return PTR_ERR(cred);

		ovl_path_upper(dentry->d_parent, &realparentpath);

		/* more stuff you want to do */
}

I think this is a big improvement over what we have now.

* patches from https://patch.msgid.link/20251117-work-ovl-cred-guard-prepare-v2-0-bd1c97a36d7b@kernel.org:
  ovl: drop ovl_setup_cred_for_create()
  ovl: port ovl_create_or_link() to new ovl_override_creator_creds cleanup guard
  ovl: mark ovl_setup_cred_for_create() as unused temporarily
  ovl: reflow ovl_create_or_link()
  ovl: port ovl_create_tmpfile() to new ovl_override_creator_creds cleanup guard
  ovl: add ovl_override_creator_creds cred guard

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-prepare-v2-0-bd1c97a36d7b@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:27 +01:00
Christian Brauner 2c42b6ce4a
ovl: remove struct ovl_cu_creds and associated functions
Now that we have this all ported to a cred guard remove the struct and
the associated helpers.

Link: https://patch.msgid.link/20251114-work-ovl-cred-guard-copyup-v1-5-ea3fb15cf427@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:27 +01:00
Christian Brauner 72f098f0dd
ovl: port ovl_copy_up_tmpfile() to cred guard
Remove the complicated struct ovl_cu_creds dance and use our new copy up
cred guard.

Link: https://patch.msgid.link/20251114-work-ovl-cred-guard-copyup-v1-4-ea3fb15cf427@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:27 +01:00
Christian Brauner 643b8a2c0a
ovl: mark *_cu_creds() as unused temporarily
They will become unused in the next patch and we'll drop them after the
conversion is finished together with the struct. This keeps the changes
small and reviewable.

Link: https://patch.msgid.link/20251114-work-ovl-cred-guard-copyup-v1-3-ea3fb15cf427@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:27 +01:00
Christian Brauner bdba9c79c8
ovl: port ovl_copy_up_workdir() to cred guard
Remove the complicated struct ovl_cu_creds dance and use our new copy up
cred guard.

Link: https://patch.msgid.link/20251114-work-ovl-cred-guard-copyup-v1-2-ea3fb15cf427@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:27 +01:00
Christian Brauner 81b77b5b0a
ovl: add copy up credential guard
Add a credential guard for copy up. This will allows us to waste struct
struct ovl_cu_creds and simplify the code.

Link: https://patch.msgid.link/20251114-work-ovl-cred-guard-copyup-v1-1-ea3fb15cf427@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:27 +01:00
Christian Brauner 89a11f004f
ovl: drop ovl_setup_cred_for_create()
It is now unused and can be removed.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-prepare-v2-6-bd1c97a36d7b@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:26 +01:00
Christian Brauner e566bff963
ovl: port ovl_create_or_link() to new ovl_override_creator_creds cleanup guard
This clearly indicates the double-credential override and makes the code
a lot easier to grasp with one glance.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-prepare-v2-5-bd1c97a36d7b@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:26 +01:00
Christian Brauner 8a227c2766
ovl: mark ovl_setup_cred_for_create() as unused temporarily
The function will become unused in the next patch.
We'll remove it in later patches to keep the diff legible.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-prepare-v2-4-bd1c97a36d7b@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:26 +01:00
Christian Brauner d6ef072d09
ovl: reflow ovl_create_or_link()
Reflow the creation routine in preparation of porting it to a guard.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-prepare-v2-3-bd1c97a36d7b@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:26 +01:00
Christian Brauner 8d7fc461e4
ovl: port ovl_create_tmpfile() to new ovl_override_creator_creds cleanup guard
This clearly indicates the double-credential override and makes the code
a lot easier to grasp with one glance.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-prepare-v2-2-bd1c97a36d7b@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:26 +01:00
Christian Brauner f37b334728
ovl: add ovl_override_creator_creds cred guard
The current code to override credentials for creation operations is
pretty difficult to understand. We effectively override the credentials
twice:

(1) override with the mounter's credentials
(2) copy the mounts credentials and override the fs{g,u}id with the inode {u,g}id

And then we elide the revert because it would be an idempotent revert.
That elision doesn't buy us anything anymore though because I've made it
all work without any reference counting anyway. All it does is mix the
two credential overrides together.

We can use a cleanup guard to clarify the creation codepaths and make
them easier to understand.

This just introduces the cleanup guard keeping the patch reviewable.
We'll convert the caller in follow-up patches and then drop the
duplicated code.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-prepare-v2-1-bd1c97a36d7b@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:26 +01:00
Christian Brauner 5c06bc9f06
Merge patch series "ovl: convert to cred guard"
Christian Brauner <brauner@kernel.org> says:

This adds an overlayfs specific extension of the cred guard
infrastructure I introduced. This allows all of overlayfs to be ported
to cred guards. I refactored a few functions to reduce the scope of the
cred guard. I think this is beneficial as it's visually very easy to
grasp the scope in one go. Lightly tested.

* patches from https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-0-b31603935724@kernel.org: (42 commits)
  ovl: remove ovl_revert_creds()
  ovl: port ovl_fill_super() to cred guard
  ovl: refactor ovl_fill_super()
  ovl: port ovl_lower_positive() to cred guard
  ovl: port ovl_lookup() to cred guard
  ovl: refactor ovl_lookup()
  ovl: port ovl_copyfile() to cred guard
  ovl: port ovl_rename() to cred guard
  ovl: refactor ovl_rename()
  ovl: introduce struct ovl_renamedata
  ovl: port ovl_listxattr() to cred guard
  ovl: port ovl_xattr_get() to cred guard
  ovl: port ovl_xattr_set() to cred guard
  ovl: port ovl_nlink_end() to cred guard
  ovl: port ovl_nlink_start() to cred guard
  ovl: port ovl_check_empty_dir() to cred guard
  ovl: port ovl_dir_llseek() to cred guard
  ovl: refactor ovl_iterate() and port to cred guard
  ovl: don't override credentials for ovl_check_whiteouts()
  ovl: port ovl_maybe_lookup_lowerdata() to cred guard
  ...

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-0-b31603935724@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:25 +01:00
Christian Brauner 850e32512a
ovl: remove ovl_revert_creds()
The wrapper isn't needed anymore. Overlayfs completely relies on its
cleanup guard.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-42-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:25 +01:00
Christian Brauner 217e78d1b7
ovl: port ovl_fill_super() to cred guard
Use the scoped ovl cred guard.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-41-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:25 +01:00
Christian Brauner fc95cda673
ovl: refactor ovl_fill_super()
Split the core into a separate helper in preparation of converting the
caller to the scoped ovl cred guard.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-40-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:25 +01:00
Christian Brauner db7cfe8783
ovl: port ovl_lower_positive() to cred guard
Use the scoped ovl cred guard.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-39-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:25 +01:00
Christian Brauner 6b6ef7d16f
ovl: port ovl_lookup() to cred guard
Use the scoped ovl cred guard.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-38-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:25 +01:00
Christian Brauner 15da486ad3
ovl: refactor ovl_lookup()
Split the core into a separate helper in preparation of converting the
caller to the scoped ovl cred guard.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-37-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:25 +01:00
Christian Brauner 14d35fda5b
ovl: port ovl_copyfile() to cred guard
Use the scoped ovl cred guard.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-36-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:24 +01:00
Christian Brauner ca0c657f25
ovl: port ovl_rename() to cred guard
Use the scoped ovl cred guard.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-35-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:24 +01:00
Christian Brauner a1da840198
ovl: refactor ovl_rename()
Extract the code that runs under overridden credentials into a separate
ovl_rename_upper() helper function and the code that runs before/after to
ovl_rename_start/end(). Error handling is simplified.
The helpers returns errors directly instead of using goto labels.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-34-b31603935724@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:24 +01:00
Christian Brauner fb9f31fe9f
ovl: introduce struct ovl_renamedata
Add a struct ovl_renamedata to group rename-related state that was
previously stored in local variables. Embedd struct renamedata directly
aligning with the vfs.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-33-b31603935724@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:24 +01:00
Christian Brauner 0b5800172c
ovl: port ovl_listxattr() to cred guard
Use the scoped ovl cred guard.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-32-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:24 +01:00
Christian Brauner ae64b54185
ovl: port ovl_xattr_get() to cred guard
Use the scoped ovl cred guard.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-31-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:24 +01:00
Christian Brauner d605301726
ovl: port ovl_xattr_set() to cred guard
Use the scoped ovl cred guard.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-30-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:24 +01:00
Christian Brauner 9e5ec68f3a
ovl: port ovl_nlink_end() to cred guard
Use the scoped ovl cred guard.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-29-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:23 +01:00
Christian Brauner 062c5b48d2
ovl: port ovl_nlink_start() to cred guard
Use the scoped ovl cred guard.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-28-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:23 +01:00
Christian Brauner 67bc75e6f4
ovl: port ovl_check_empty_dir() to cred guard
Use the scoped ovl cred guard.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-27-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:23 +01:00
Christian Brauner 5517646e14
ovl: port ovl_dir_llseek() to cred guard
Use the scoped ovl cred guard.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-26-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:23 +01:00
Christian Brauner d25e4b739f
ovl: refactor ovl_iterate() and port to cred guard
factor out ovl_iterate_merged() and move some code into
ovl_iterate_real() for easier use of the scoped ovl cred guard.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-25-b31603935724@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:23 +01:00
Christian Brauner 198d182288
ovl: don't override credentials for ovl_check_whiteouts()
The function is only called when rdd->dentry is non-NULL:

if (!err && rdd->first_maybe_whiteout && rdd->dentry)
    err = ovl_check_whiteouts(realpath, rdd);

| Caller                        | Sets rdd->dentry? | Can call ovl_check_whiteouts()? |
|-------------------------------|-------------------|---------------------------------|
| ovl_dir_read_merged()         | ✓ Yes (line 430)  | ✓ YES                           |
| ovl_dir_read_impure()         | ✗ No              | ✗ NO                            |
| ovl_check_d_type_supported()  | ✗ No              | ✗ NO                            |
| ovl_workdir_cleanup_recurse() | ✗ No              | ✗ NO                            |
| ovl_indexdir_cleanup()        | ✗ No              | ✗ NO                            |

VFS layer (.iterate_shared file operation)
  → ovl_iterate()
      [CRED OVERRIDE]
      → ovl_cache_get()
          → ovl_dir_read_merged()
              → ovl_dir_read()
                  → ovl_check_whiteouts()
      [CRED REVERT]

ovl_unlink()
  → ovl_do_remove()
      → ovl_check_empty_dir()
          [CRED OVERRIDE]
          → ovl_dir_read_merged()
              → ovl_dir_read()
                  → ovl_check_whiteouts()
          [CRED REVERT]

ovl_rename()
  → ovl_check_empty_dir()
      [CRED OVERRIDE]
      → ovl_dir_read_merged()
          → ovl_dir_read()
              → ovl_check_whiteouts()
      [CRED REVERT]

All valid callchains already override credentials so drop the override.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-24-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:23 +01:00
Christian Brauner cb3c8cbaed
ovl: port ovl_maybe_lookup_lowerdata() to cred guard
Use the scoped ovl cred guard.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-23-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:23 +01:00
Christian Brauner b1c47b3abc
ovl: port ovl_maybe_validate_verity() to cred guard
Use the scoped ovl cred guard.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-22-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:22 +01:00
Christian Brauner 4975e683c2
ovl: port ovl_fileattr_get() to cred guard
Use the scoped ovl cred guard.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-21-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:22 +01:00
Christian Brauner af1d5d62f3
ovl: port ovl_fileattr_set() to cred guard
Use the scoped ovl cred guard.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-20-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:22 +01:00
Christian Brauner a3860a808f
ovl: port ovl_fiemap() to cred guard
Use the scoped ovl cred guard.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-19-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:22 +01:00
Christian Brauner 8e9698d6e4
ovl: port ovl_set_or_remove_acl() to cred guard
Use the scoped ovl cred guard.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-18-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:22 +01:00
Christian Brauner 71ac28fbcd
ovl: port do_ovl_get_acl() to cred guard
Use the scoped ovl cred guard.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-17-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:22 +01:00
Christian Brauner 47eba7f7fd
ovl: port ovl_get_link() to cred guard
Use the scoped ovl cred guard.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-16-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:22 +01:00
Christian Brauner d81999b40b
ovl: port ovl_permission() to cred guard
Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-15-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:21 +01:00
Christian Brauner 81707ae827
ovl: port ovl_getattr() to cred guard
Use the scoped ovl cred guard.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-14-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:21 +01:00
Christian Brauner 7aedfa5a52
ovl: port ovl_setattr() to cred guard
Use the scoped ovl cred guard.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-13-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:21 +01:00
Christian Brauner 9763970984
ovl: port ovl_flush() to cred guard
Use the scoped ovl cred guard.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-12-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:21 +01:00
Christian Brauner 8e8f4df93c
ovl: port ovl_fadvise() to cred guard
Use the scoped ovl cred guard.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-11-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:21 +01:00
Christian Brauner 2468017783
ovl: port ovl_fallocate() to cred guard
Use the scoped ovl cred guard.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-10-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:21 +01:00
Christian Brauner 07a891c346
ovl: port ovl_fsync() to cred guard
Use the scoped ovl cred guard.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-9-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:21 +01:00
Christian Brauner 1fc4bc77c7
ovl: port ovl_llseek() to cred guard
Use the scoped ovl cred guard.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-8-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:20 +01:00
Christian Brauner b27ebb3d4b
ovl: port ovl_open_realfile() to cred guard
Use the scoped ovl cred guard.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-7-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:20 +01:00
Christian Brauner 5f51dfe768
ovl: port ovl_create_tmpfile() to cred guard
Use the scoped ovl cred guard.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-6-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:20 +01:00
Christian Brauner 8368eb837e
ovl: port ovl_do_remove() to cred guard
Use the scoped ovl cred guard.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-5-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:20 +01:00
Christian Brauner ff4f6e4689
ovl: port ovl_set_link_redirect() to cred guard
Use the scoped ovl cred guard.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-4-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:20 +01:00
Christian Brauner 8c9531edcf
ovl: port ovl_create_or_link() to cred guard
Use the scoped ovl cred guard.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-3-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:20 +01:00
Christian Brauner 87809f12e0
ovl: port ovl_copy_up_flags() to cred guards
Use the scoped ovl cred guard.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-2-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:20 +01:00
Christian Brauner 6f5c84162a
ovl: add override_creds cleanup guard extension for overlayfs
Overlayfs plucks the relevant creds from the superblock. Extend the
override_creds cleanup class I added to override_creds_ovl which uses
the ovl_override_creds() function as initialization helper. Add
with_ovl_creds() based on this new class.

Link: https://patch.msgid.link/20251117-work-ovl-cred-guard-v4-1-b31603935724@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:58:19 +01:00
Christian Brauner 658d1322fa
Merge branch 'vfs-6.19.directory.locking' into base.vfs-6.19.ovl
Bring in the directory locking changes as they touch overlayfs in a
pretty substantial way and we are about to change the credential
override semantics quite substantially as well.

Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:56:47 +01:00
Christian Brauner 2b21a6204d
Merge branch 'kbuild-6.19.fms.extension'
Bring in the shared branch with the kbuild tree to enable
'-fms-extensions' for 6.19. The overlayfs cred guard work
depends on this.

Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 21:56:17 +01:00
Ian Kent 922a6f34c1
autofs: dont trigger mount if it cant succeed
If a mount namespace contains autofs mounts, and they are propagation
private, and there is no namespace specific automount daemon to handle
possible automounting then attempted path resolution will loop until
MAXSYMLINKS is reached before failing causing quite a bit of noise in
the log.

Add a check for this in autofs ->d_automount() so that the VFS can
immediately return an error in this case. Since the mount is propagation
private an EPERM return seems most appropriate.

Suggested by: Christian Brauner <brauner@kernel.org>

Signed-off-by: Ian Kent <raven@themaw.net>
Link: https://patch.msgid.link/20251118024631.10854-2-raven@themaw.net
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-19 11:14:02 +01:00
Josh Poimboeuf 2092007aa3 objtool/klp: Only enable --checksum when needed
With CONFIG_KLP_BUILD enabled, checksums are only needed during a
klp-build run.  There's no need to enable them for normal kernel builds.

This also has the benefit of softening the xxhash dependency.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Link: https://patch.msgid.link/edbb1ca215e4926e02edb493b68b9d6d063e902f.1762990139.git.jpoimboe@kernel.org
2025-11-18 09:59:26 +01:00
Josh Poimboeuf ee0b48faba objtool: Set minimum xxhash version to 0.8
XXH3 is only supported starting with xxhash 0.8.  Enforce that.

Fixes: 0d83da43b1 ("objtool/klp: Add --checksum option to generate per-function checksums")
Closes: https://lore.kernel.org/SN6PR02MB41579B83CD295C9FEE40EED6D4FCA@SN6PR02MB4157.namprd02.prod.outlook.com
Reported-by: Michael Kelley <mhklinux@outlook.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Link: https://patch.msgid.link/7227c94692a3a51840278744c7af31b4797c6b96.1762990139.git.jpoimboe@kernel.org
2025-11-18 09:59:25 +01:00
Peter Zijlstra 33cf66d883 sched/fair: Proportional newidle balance
Add a randomized algorithm that runs newidle balancing proportional to
its success rate.

This improves schbench significantly:

 6.18-rc4:			2.22 Mrps/s
 6.18-rc4+revert:		2.04 Mrps/s
 6.18-rc4+revert+random:	2.18 Mrps/S

Conversely, per Adam Li this affects SpecJBB slightly, reducing it by 1%:

 6.17:			-6%
 6.17+revert:		 0%
 6.17+revert+random:	-1%

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Chris Mason <clm@meta.com>
Link: https://lkml.kernel.org/r/6825c50d-7fa7-45d8-9b81-c6e7e25738e2@meta.com
Link: https://patch.msgid.link/20251107161739.770122091@infradead.org
2025-11-17 17:13:16 +01:00
Peter Zijlstra 08d473dd87 sched/fair: Small cleanup to update_newidle_cost()
Simplify code by adding a few variables.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Chris Mason <clm@meta.com>
Link: https://patch.msgid.link/20251107161739.655208666@infradead.org
2025-11-17 17:13:15 +01:00
Peter Zijlstra e78e70dbf6 sched/fair: Small cleanup to sched_balance_newidle()
Pull out the !sd check to simplify code.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Chris Mason <clm@meta.com>
Link: https://patch.msgid.link/20251107161739.525916173@infradead.org
2025-11-17 17:13:15 +01:00
Peter Zijlstra d206fbad93 sched/fair: Revert max_newidle_lb_cost bump
Many people reported regressions on their database workloads due to:

  155213a2ae ("sched/fair: Bump sd->max_newidle_lb_cost when newidle balance fails")

For instance Adam Li reported a 6% regression on SpecJBB.

Conversely this will regress schbench again; on my machine from 2.22
Mrps/s down to 2.04 Mrps/s.

Reported-by: Joseph Salisbury <joseph.salisbury@oracle.com>
Reported-by: Adam Li <adamli@os.amperecomputing.com>
Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reported-by: Hazem Mohamed Abuelfotoh <abuehaze@amazon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Chris Mason <clm@meta.com>
Link: https://lkml.kernel.org/r/20250626144017.1510594-2-clm@fb.com
Link: https://lkml.kernel.org/r/006c9df2-b691-47f1-82e6-e233c3f91faf@oracle.com
Link: https://patch.msgid.link/20251107161739.406147760@infradead.org
2025-11-17 17:13:15 +01:00
Mel Gorman e837456fdc sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
Reimplement NEXT_BUDDY preemption to take into account the deadline and
eligibility of the wakee with respect to the waker. In the event
multiple buddies could be considered, the one with the earliest deadline
is selected.

Sync wakeups are treated differently to every other type of wakeup. The
WF_SYNC assumption is that the waker promises to sleep in the very near
future. This is violated in enough cases that WF_SYNC should be treated
as a suggestion instead of a contract. If a waker does go to sleep almost
immediately then the delay in wakeup is negligible. In other cases, it's
throttled based on the accumulated runtime of the waker so there is a
chance that some batched wakeups have been issued before preemption.

For all other wakeups, preemption happens if the wakee has a earlier
deadline than the waker and eligible to run.

While many workloads were tested, the two main targets were a modified
dbench4 benchmark and hackbench because the are on opposite ends of the
spectrum -- one prefers throughput by avoiding preemption and the other
relies on preemption.

First is the dbench throughput data even though it is a poor metric but
it is the default metric. The test machine is a 2-socket machine and the
backing filesystem is XFS as a lot of the IO work is dispatched to kernel
threads. It's important to note that these results are not representative
across all machines, especially Zen machines, as different bottlenecks
are exposed on different machines and filesystems.

dbench4 Throughput (misleading but traditional)
                            6.18-rc1               6.18-rc1
                             vanilla   sched-preemptnext-v5
Hmean     1       1268.80 (   0.00%)     1269.74 (   0.07%)
Hmean     4       3971.74 (   0.00%)     3950.59 (  -0.53%)
Hmean     7       5548.23 (   0.00%)     5420.08 (  -2.31%)
Hmean     12      7310.86 (   0.00%)     7165.57 (  -1.99%)
Hmean     21      8874.53 (   0.00%)     9149.04 (   3.09%)
Hmean     30      9361.93 (   0.00%)    10530.04 (  12.48%)
Hmean     48      9540.14 (   0.00%)    11820.40 (  23.90%)
Hmean     79      9208.74 (   0.00%)    12193.79 (  32.42%)
Hmean     110     8573.12 (   0.00%)    11933.72 (  39.20%)
Hmean     141     7791.33 (   0.00%)    11273.90 (  44.70%)
Hmean     160     7666.60 (   0.00%)    10768.72 (  40.46%)

As throughput is misleading, the benchmark is modified to use a short
loadfile report the completion time duration in milliseconds.

dbench4 Loadfile Execution Time
                             6.18-rc1               6.18-rc1
                              vanilla   sched-preemptnext-v5
Amean      1         14.62 (   0.00%)       14.69 (  -0.46%)
Amean      4         18.76 (   0.00%)       18.85 (  -0.45%)
Amean      7         23.71 (   0.00%)       24.38 (  -2.82%)
Amean      12        31.25 (   0.00%)       31.87 (  -1.97%)
Amean      21        45.12 (   0.00%)       43.69 (   3.16%)
Amean      30        61.07 (   0.00%)       54.33 (  11.03%)
Amean      48        95.91 (   0.00%)       77.22 (  19.49%)
Amean      79       163.38 (   0.00%)      123.08 (  24.66%)
Amean      110      243.91 (   0.00%)      175.11 (  28.21%)
Amean      141      343.47 (   0.00%)      239.10 (  30.39%)
Amean      160      401.15 (   0.00%)      283.73 (  29.27%)
Stddev     1          0.52 (   0.00%)        0.51 (   2.45%)
Stddev     4          1.36 (   0.00%)        1.30 (   4.04%)
Stddev     7          1.88 (   0.00%)        1.87 (   0.72%)
Stddev     12         3.06 (   0.00%)        2.45 (  19.83%)
Stddev     21         5.78 (   0.00%)        3.87 (  33.06%)
Stddev     30         9.85 (   0.00%)        5.25 (  46.76%)
Stddev     48        22.31 (   0.00%)        8.64 (  61.27%)
Stddev     79        35.96 (   0.00%)       18.07 (  49.76%)
Stddev     110       59.04 (   0.00%)       30.93 (  47.61%)
Stddev     141       85.38 (   0.00%)       40.93 (  52.06%)
Stddev     160       96.38 (   0.00%)       39.72 (  58.79%)

That is still looking good and the variance is reduced quite a bit.
Finally, fairness is a concern so the next report tracks how many
milliseconds does it take for all clients to complete a workfile. This
one is tricky because dbench makes to effort to synchronise clients so
the durations at benchmark start time differ substantially from typical
runtimes. This problem could be mitigated by warming up the benchmark
for a number of minutes but it's a matter of opinion whether that
counts as an evasion of inconvenient results.

dbench4 All Clients Loadfile Execution Time
                             6.18-rc1               6.18-rc1
                              vanilla   sched-preemptnext-v5
Amean      1         15.06 (   0.00%)       15.07 (  -0.03%)
Amean      4        603.81 (   0.00%)      524.29 (  13.17%)
Amean      7        855.32 (   0.00%)     1331.07 ( -55.62%)
Amean      12      1890.02 (   0.00%)     2323.97 ( -22.96%)
Amean      21      3195.23 (   0.00%)     2009.29 (  37.12%)
Amean      30     13919.53 (   0.00%)     4579.44 (  67.10%)
Amean      48     25246.07 (   0.00%)     5705.46 (  77.40%)
Amean      79     29701.84 (   0.00%)    15509.26 (  47.78%)
Amean      110    22803.03 (   0.00%)    23782.08 (  -4.29%)
Amean      141    36356.07 (   0.00%)    25074.20 (  31.03%)
Amean      160    17046.71 (   0.00%)    13247.62 (  22.29%)
Stddev     1          0.47 (   0.00%)        0.49 (  -3.74%)
Stddev     4        395.24 (   0.00%)      254.18 (  35.69%)
Stddev     7        467.24 (   0.00%)      764.42 ( -63.60%)
Stddev     12      1071.43 (   0.00%)     1395.90 ( -30.28%)
Stddev     21      1694.50 (   0.00%)     1204.89 (  28.89%)
Stddev     30      7945.63 (   0.00%)     2552.59 (  67.87%)
Stddev     48     14339.51 (   0.00%)     3227.55 (  77.49%)
Stddev     79     16620.91 (   0.00%)     8422.15 (  49.33%)
Stddev     110    12912.15 (   0.00%)    13560.95 (  -5.02%)
Stddev     141    20700.13 (   0.00%)    14544.51 (  29.74%)
Stddev     160     9079.16 (   0.00%)     7400.69 (  18.49%)

This is more of a mixed bag but it at least shows that fairness
is not crippled.

The hackbench results are more neutral but this is still important.
It's possible to boost the dbench figures by a large amount but only by
crippling the performance of a workload like hackbench. The WF_SYNC
behaviour is important for these workloads and is why the WF_SYNC
changes are not a separate patch.

hackbench-process-pipes
                          6.18-rc1             6.18-rc1
                             vanilla   sched-preemptnext-v5
Amean     1        0.2657 (   0.00%)      0.2150 (  19.07%)
Amean     4        0.6107 (   0.00%)      0.6060 (   0.76%)
Amean     7        0.7923 (   0.00%)      0.7440 (   6.10%)
Amean     12       1.1500 (   0.00%)      1.1263 (   2.06%)
Amean     21       1.7950 (   0.00%)      1.7987 (  -0.20%)
Amean     30       2.3207 (   0.00%)      2.5053 (  -7.96%)
Amean     48       3.5023 (   0.00%)      3.9197 ( -11.92%)
Amean     79       4.8093 (   0.00%)      5.2247 (  -8.64%)
Amean     110      6.1160 (   0.00%)      6.6650 (  -8.98%)
Amean     141      7.4763 (   0.00%)      7.8973 (  -5.63%)
Amean     172      8.9560 (   0.00%)      9.3593 (  -4.50%)
Amean     203     10.4783 (   0.00%)     10.8347 (  -3.40%)
Amean     234     12.4977 (   0.00%)     13.0177 (  -4.16%)
Amean     265     14.7003 (   0.00%)     15.5630 (  -5.87%)
Amean     296     16.1007 (   0.00%)     17.4023 (  -8.08%)

Processes using pipes are impacted but the variance (not presented) indicates
it's close to noise and the results are not always reproducible. If executed
across multiple reboots, it may show neutral or small gains so the worst
measured results are presented.

Hackbench using sockets is more reliably neutral as the wakeup
mechanisms are different between sockets and pipes.

hackbench-process-sockets
                          6.18-rc1             6.18-rc1
                             vanilla   sched-preemptnext-v2
Amean     1        0.3073 (   0.00%)      0.3263 (  -6.18%)
Amean     4        0.7863 (   0.00%)      0.7930 (  -0.85%)
Amean     7        1.3670 (   0.00%)      1.3537 (   0.98%)
Amean     12       2.1337 (   0.00%)      2.1903 (  -2.66%)
Amean     21       3.4683 (   0.00%)      3.4940 (  -0.74%)
Amean     30       4.7247 (   0.00%)      4.8853 (  -3.40%)
Amean     48       7.6097 (   0.00%)      7.8197 (  -2.76%)
Amean     79      14.7957 (   0.00%)     16.1000 (  -8.82%)
Amean     110     21.3413 (   0.00%)     21.9997 (  -3.08%)
Amean     141     29.0503 (   0.00%)     29.0353 (   0.05%)
Amean     172     36.4660 (   0.00%)     36.1433 (   0.88%)
Amean     203     39.7177 (   0.00%)     40.5910 (  -2.20%)
Amean     234     42.1120 (   0.00%)     43.5527 (  -3.42%)
Amean     265     45.7830 (   0.00%)     50.0560 (  -9.33%)
Amean     296     50.7043 (   0.00%)     54.3657 (  -7.22%)

As schbench has been mentioned in numerous bugs recently, the results
are interesting. A test case that represents the default schbench
behaviour is

schbench Wakeup Latency (usec)
                                       6.18.0-rc1             6.18.0-rc1
                                          vanilla   sched-preemptnext-v5
Amean     Wakeup-50th-80          7.17 (   0.00%)        6.00 (  16.28%)
Amean     Wakeup-90th-80         46.56 (   0.00%)       19.78 (  57.52%)
Amean     Wakeup-99th-80        119.61 (   0.00%)       89.94 (  24.80%)
Amean     Wakeup-99.9th-80     3193.78 (   0.00%)      328.22 (  89.72%)

schbench Requests Per Second (ops/sec)
                                  6.18.0-rc1             6.18.0-rc1
                                     vanilla   sched-preemptnext-v5
Hmean     RPS-20th-80     8900.91 (   0.00%)     9176.78 (   3.10%)
Hmean     RPS-50th-80     8987.41 (   0.00%)     9217.89 (   2.56%)
Hmean     RPS-90th-80     9123.73 (   0.00%)     9273.25 (   1.64%)
Hmean     RPS-max-80      9193.50 (   0.00%)     9301.47 (   1.17%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251112122521.1331238-3-mgorman@techsingularity.net
2025-11-17 17:13:15 +01:00
Mel Gorman aceccac58a sched/fair: Enable scheduler feature NEXT_BUDDY
The NEXT_BUDDY feature reinforces wakeup preemption to encourage the last
wakee to be scheduled sooner on the assumption that the waker/wakee share
cache-hot data. In CFS, it was paired with LAST_BUDDY to switch back on
the assumption that the pair of tasks still share data but also relied
on START_DEBIT and the exact WAKEUP_PREEMPTION implementation to get
good results.

NEXT_BUDDY has been disabled since commit 0ec9fab3d1 ("sched: Improve
latencies and throughput") and LAST_BUDDY was removed in commit 5e963f2bd4
("sched/fair: Commit to EEVDF"). The reasoning is not clear but as vruntime
spread is mentioned so the expectation is that NEXT_BUDDY had an impact
on overall fairness. It was not noted why LAST_BUDDY was removed but it
is assumed that it's very difficult to reason what LAST_BUDDY's correct
and effective behaviour should be while still respecting EEVDFs goals.
Peter Zijlstra noted during review;

	I think I was just struggling to make sense of things and figured
	less is more and axed it.

	I have vague memories trying to work through the dynamics of
	a wakeup-stack and the EEVDF latency requirements and getting
	a head-ache.

NEXT_BUDDY is easier to reason about given that it's a point-in-time
decision on the wakees deadline and eligibilty relative to the waker. Enable
NEXT_BUDDY as a preparation path to document that the decision to ignore
the current implementation is deliberate. While not presented, the results
were at best neutral and often much more variable.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251112122521.1331238-2-mgorman@techsingularity.net
2025-11-17 17:13:15 +01:00
Phil Auld aaab6bb54a sched: Increase sched_tick_remote timeout
Increase the sched_tick_remote WARN_ON timeout to remove false
positives due to temporarily busy HK cpus. The suggestion
was 30 seconds to catch really stuck remote tick processing
but not trigger it too easily.

Suggested-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://patch.msgid.link/20250911161300.437944-1-pauld@redhat.com
2025-11-17 17:13:15 +01:00
Peter Zijlstra 522fb20fbd sched/fair: Have SD_SERIALIZE affect newidle balancing
Also serialize the possiblty much more frequent newidle balancing for
the 'expensive' domains that have SD_BALANCE set.

Initial benchmarking by K Prateek and Tim showed no negative effect.

Split out from the larger patch moving sched_balance_running around
for ease of bisect and such.

Suggested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Seconded-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/df068896-82f9-458d-8fff-5a2f654e8ffd@amd.com
Link: https://patch.msgid.link/6fed119b723c71552943bfe5798c93851b30a361.1762800251.git.tim.c.chen@linux.intel.com

# Conflicts:
#	kernel/sched/fair.c
2025-11-17 17:13:09 +01:00
Tim Chen 3324b2180c sched/fair: Skip sched_balance_running cmpxchg when balance is not due
The NUMA sched domain sets the SD_SERIALIZE flag by default, allowing
only one NUMA load balancing operation to run system-wide at a time.

Currently, each sched group leader directly under NUMA domain attempts
to acquire the global sched_balance_running flag via cmpxchg() before
checking whether load balancing is due or whether it is the designated
load balancer for that NUMA domain. On systems with a large number
of cores, this causes significant cache contention on the shared
sched_balance_running flag.

This patch reduces unnecessary cmpxchg() operations by first checking
that the balancer is the designated leader for a NUMA domain from
should_we_balance(), and the balance interval has expired before
trying to acquire sched_balance_running to load balance a NUMA
domain.

On a 2-socket Granite Rapids system with sub-NUMA clustering enabled,
running an OLTP workload, 7.8% of total CPU cycles were previously spent
in sched_balance_domain() contending on sched_balance_running before
this change.

         : 104              static __always_inline int arch_atomic_cmpxchg(atomic_t *v, int old, int new)
         : 105              {
         : 106              return arch_cmpxchg(&v->counter, old, new);
    0.00 :   ffffffff81326e6c:       xor    %eax,%eax
    0.00 :   ffffffff81326e6e:       mov    $0x1,%ecx
    0.00 :   ffffffff81326e73:       lock cmpxchg %ecx,0x2394195(%rip)        # ffffffff836bb010 <sched_balance_running>
         : 110              sched_balance_domains():
         : 12234            if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
   99.39 :   ffffffff81326e7b:       test   %eax,%eax
    0.00 :   ffffffff81326e7d:       jne    ffffffff81326e99 <sched_balance_domains+0x209>
         : 12238            if (time_after_eq(jiffies, sd->last_balance + interval)) {
    0.00 :   ffffffff81326e7f:       mov    0x14e2b3a(%rip),%rax        # ffffffff828099c0 <jiffies_64>
    0.00 :   ffffffff81326e86:       sub    0x48(%r14),%rax
    0.00 :   ffffffff81326e8a:       cmp    %rdx,%rax

After applying this fix, sched_balance_domain() is gone from the profile
and there is a 5% throughput improvement.

[peterz: made it so that redo retains the 'lock' and split out the
         CPU_NEWLY_IDLE change to a separate patch]
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.ibm.com>
Tested-by: Mohini Narkhede <mohini.narkhede@intel.com>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://patch.msgid.link/6fed119b723c71552943bfe5798c93851b30a361.1762800251.git.tim.c.chen@linux.intel.com
2025-11-17 17:12:00 +01:00
Christian Brauner 523ac76880
Merge patch series "Create and use APIs to centralise locking for directory ops."
NeilBrown <neilb@ownmail.net> says:

This series is the next part of my effort to change directory-op
locking to allow multiple concurrent ops in a directory.  Ultimately we
will (in my plan) lock the target dentry(s) rather than the whole
parent directory.

To help with changing the locking protocol, this series centralises
locking and lookup in some helpers.  The various helpers are introduced
and then used in the same patch - roughly one patch per helper though
with various exceptions.

I haven't introduced these helpers into the various filesystems that
Al's tree-in-dcache series is changing.  That series introduces and
uses similar helpers tuned to the specific needs of that set of
filesystems.  Ultimately all the helpers will use the same backends
which can then be adjusted when it is time to change the locking
protocol.

One change that deserves highlighting is in patch 13 where vfs_mkdir()
is changed to unlock the parent on failure, as well as the current
behaviour of dput()ing the dentry on failure.  Once this change is in
place, the final step of both create and an remove sequences only
requires the target dentry, not the parent.  So e.g.  end_creating() is
only given the dentry (which may be IS_ERR() after vfs_mkdir()).  This
helps establish the pattern that it is the dentry that is being locked
and unlocked (the lock is currently held on dentry->d_parent->d_inode,
but that can change).

* patches from https://patch.msgid.link/20251113002050.676694-1-neilb@ownmail.net:
  VFS: introduce end_creating_keep()
  VFS: change vfs_mkdir() to unlock on failure.
  ecryptfs: use new start_creating/start_removing APIs
  Add start_renaming_two_dentries()
  VFS/ovl/smb: introduce start_renaming_dentry()
  VFS/nfsd/ovl: introduce start_renaming() and end_renaming()
  VFS: add start_creating_killable() and start_removing_killable()
  VFS: introduce start_removing_dentry()
  smb/server: use end_removing_noperm for for target of smb2_create_link()
  VFS: introduce start_creating_noperm() and start_removing_noperm()
  VFS/nfsd/cachefiles/ovl: introduce start_removing() and end_removing()
  VFS/nfsd/cachefiles/ovl: add start_creating() and end_creating()
  VFS: tidy up do_unlinkat()
  VFS: introduce start_dirop() and end_dirop()
  debugfs: rename end_creating() to debugfs_end_creating()

Link: https://patch.msgid.link/20251113002050.676694-1-neilb@ownmail.net
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-14 13:16:04 +01:00
NeilBrown cf296b294c
VFS: introduce end_creating_keep()
Occasionally the caller of end_creating() wants to keep using the dentry.
Rather then requiring them to dget() the dentry (when not an error)
before calling end_creating(), provide end_creating_keep() which does
this.

cachefiles and overlayfs make use of this.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://patch.msgid.link/20251113002050.676694-16-neilb@ownmail.net
Tested-by: syzbot@syzkaller.appspotmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-14 13:15:58 +01:00
NeilBrown fe497f0759
VFS: change vfs_mkdir() to unlock on failure.
vfs_mkdir() already drops the reference to the dentry on failure but it
leaves the parent locked.
This complicates end_creating() which needs to unlock the parent even
though the dentry is no longer available.

If we change vfs_mkdir() to unlock on failure as well as releasing the
dentry, we can remove the "parent" arg from end_creating() and simplify
the rules for calling it.

Note that cachefiles_get_directory() can choose to substitute an error
instead of actually calling vfs_mkdir(), for fault injection.  In that
case it needs to call end_creating(), just as vfs_mkdir() now does on
error.

ovl_create_real() will now unlock on error.  So the conditional
end_creating() after the call is removed, and end_creating() is called
internally on error.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: syzbot@syzkaller.appspotmail.com
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://patch.msgid.link/20251113002050.676694-15-neilb@ownmail.net
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-14 13:15:58 +01:00
NeilBrown f046fbb4d8
ecryptfs: use new start_creating/start_removing APIs
This requires the addition of start_creating_dentry() which is given the
dentry which has already been found, and asks for it to be locked and
its parent validated.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://patch.msgid.link/20251113002050.676694-14-neilb@ownmail.net
Tested-by: syzbot@syzkaller.appspotmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-14 13:15:58 +01:00
NeilBrown 833d2b3a07
Add start_renaming_two_dentries()
A few callers want to lock for a rename and already have both dentries.
Also debugfs does want to perform a lookup but doesn't want permission
checking, so start_renaming_dentry() cannot be used.

This patch introduces start_renaming_two_dentries() which is given both
dentries.  debugfs performs one lookup itself.  As it will only continue
with a negative dentry and as those cannot be renamed or unlinked, it is
safe to do the lookup before getting the rename locks.

overlayfs uses start_renaming_two_dentries() in three places and  selinux
uses it twice in sel_make_policy_nodes().

In sel_make_policy_nodes() we now lock for rename twice instead of just
once so the combined operation is no longer atomic w.r.t the parent
directory locks.  As selinux_state.policy_mutex is held across the whole
operation this does not open up any interesting races.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://patch.msgid.link/20251113002050.676694-13-neilb@ownmail.net
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-14 13:15:58 +01:00
NeilBrown ac50950ca1
VFS/ovl/smb: introduce start_renaming_dentry()
Several callers perform a rename on a dentry they already have, and only
require lookup for the target name.  This includes smb/server and a few
different places in overlayfs.

start_renaming_dentry() performs the required lookup and takes the
required lock using lock_rename_child()

It is used in three places in overlayfs and in ksmbd_vfs_rename().

In the ksmbd case, the parent of the source is not important - the
source must be renamed from wherever it is.  So start_renaming_dentry()
allows rd->old_parent to be NULL and only checks it if it is non-NULL.
On success rd->old_parent will be the parent of old_dentry with an extra
reference taken.  Other start_renaming function also now take the extra
reference and end_renaming() now drops this reference as well.

ovl_lookup_temp(), ovl_parent_lock(), and ovl_parent_unlock() are
all removed as they are no longer needed.

OVL_TEMPNAME_SIZE and ovl_tempname() are now declared in overlayfs.h so
that ovl_check_rename_whiteout() can access them.

ovl_copy_up_workdir() now always cleans up on error.

Reviewed-by: Namjae Jeon <linkinjeon@kernel.org>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://patch.msgid.link/20251113002050.676694-12-neilb@ownmail.net
Tested-by: syzbot@syzkaller.appspotmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-14 13:15:57 +01:00
NeilBrown 5c87527299
VFS/nfsd/ovl: introduce start_renaming() and end_renaming()
start_renaming() combines name lookup and locking to prepare for rename.
It is used when two names need to be looked up as in nfsd and overlayfs -
cases where one or both dentries are already available will be handled
separately.

__start_renaming() avoids the inode_permission check and hash
calculation and is suitable after filename_parentat() in do_renameat2().
It subsumes quite a bit of code from that function.

start_renaming() does calculate the hash and check X permission and is
suitable elsewhere:
- nfsd_rename()
- ovl_rename()

In ovl, ovl_do_rename_rd() is factored out of ovl_do_rename(), which
itself will be gone by the end of the series.

Acked-by: Chuck Lever <chuck.lever@oracle.com> (for nfsd parts)
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: NeilBrown <neil@brown.name>

--
Changes since v3:
 - added missig dput() in ovl_rename when "whiteout" is not-NULL.

Changes since v2:
 - in __start_renaming() some label have been renamed, and err
   is always set before a "goto out_foo" rather than passing the
   error in a dentry*.
 - ovl_do_rename() changed to call the new ovl_do_rename_rd() rather
   than keeping duplicate code
 - code around ovl_cleanup() call in ovl_rename() restructured.

Link: https://patch.msgid.link/20251113002050.676694-11-neilb@ownmail.net
Tested-by: syzbot@syzkaller.appspotmail.com
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-14 13:15:57 +01:00
NeilBrown ff7c4ea11a
VFS: add start_creating_killable() and start_removing_killable()
These are similar to start_creating() and start_removing(), but allow a
fatal signal to abort waiting for the lock.

They are used in btrfs for subvol creation and removal.

btrfs_may_create() no longer needs IS_DEADDIR() and
start_creating_killable() includes that check.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://patch.msgid.link/20251113002050.676694-10-neilb@ownmail.net
Tested-by: syzbot@syzkaller.appspotmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-14 13:15:57 +01:00
NeilBrown 7bb1eb45e4
VFS: introduce start_removing_dentry()
start_removing_dentry() is similar to start_removing() but instead of
providing a name for lookup, the target dentry is given.

start_removing_dentry() checks that the dentry is still hashed and in
the parent, and if so it locks and increases the refcount so that
end_removing() can be used to finish the operation.

This is used in cachefiles, overlayfs, smb/server, and apparmor.

There will be other users including ecryptfs.

As start_removing_dentry() takes an extra reference to the dentry (to be
put by end_removing()), there is no need to explicitly take an extra
reference to stop d_delete() from using dentry_unlink_inode() to negate
the dentry - as in cachefiles_delete_object(), and ksmbd_vfs_unlink().

cachefiles_bury_object() now gets an extra ref to the victim, which is
drops.  As it includes the needed end_removing() calls, the caller
doesn't need them.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Namjae Jeon <linkinjeon@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://patch.msgid.link/20251113002050.676694-9-neilb@ownmail.net
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-14 13:15:57 +01:00
NeilBrown 1ead2213dd
smb/server: use end_removing_noperm for for target of smb2_create_link()
Sometimes smb2_create_link() needs to remove the target before creating
the link.
It uses ksmbd_vfs_kern_locked(), and is the only user of that interface.

To match the new naming, that function is changed to
ksmbd_vfs_kern_start_removing(), and related functions or flags are also
renamed.

The lock actually happens in ksmbd_vfs_path_lookup() and that is changed
to use start_removing_noperm() - permission to perform lookup in the
parent was already checked in vfs_path_parent_lookup().

Signed-off-by: NeilBrown <neil@brown.name>
Link: https://patch.msgid.link/20251113002050.676694-8-neilb@ownmail.net
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-14 13:15:56 +01:00
NeilBrown c9ba789dad
VFS: introduce start_creating_noperm() and start_removing_noperm()
xfs, fuse, ipc/mqueue need variants of start_creating or start_removing
which do not check permissions.
This patch adds _noperm versions of these functions.

Note that do_mq_open() was only calling mntget() so it could call
path_put() - it didn't really need an extra reference on the mnt.
Now it doesn't call mntget() and uses end_creating() which does
the dput() half of path_put().

Also mq_unlink() previously passed
   d_inode(dentry->d_parent)
as the dir inode to vfs_unlink().  This is after locking
   d_inode(mnt->mnt_root)
These two inodes are the same, but normally calls use the textual
parent.
So I've changes the vfs_unlink() call to be given d_inode(mnt->mnt_root).

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neil@brown.name>

--
changes since v2:
 - dir arg passed to vfs_unlink() in mq_unlink() changed to match
   the dir passed to lookup_noperm()
 - restore assignment to path->mnt even though the mntget() is removed.

Link: https://patch.msgid.link/20251113002050.676694-7-neilb@ownmail.net
Tested-by: syzbot@syzkaller.appspotmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-14 13:15:56 +01:00
NeilBrown bd6ede8a06
VFS/nfsd/cachefiles/ovl: introduce start_removing() and end_removing()
start_removing() is similar to start_creating() but will only return a
positive dentry with the expectation that it will be removed.  This is
used by nfsd, cachefiles, and overlayfs.  They are changed to also use
end_removing() to terminate the action begun by start_removing().  This
is a simple alias for end_dirop().

Apart from changes to the error paths, as we no longer need to unlock on
a lookup error, an effect on callers is that they don't need to test if
the found dentry is positive or negative - they can be sure it is
positive.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://patch.msgid.link/20251113002050.676694-6-neilb@ownmail.net
Tested-by: syzbot@syzkaller.appspotmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-14 13:15:56 +01:00
NeilBrown 7ab96df840
VFS/nfsd/cachefiles/ovl: add start_creating() and end_creating()
start_creating() is similar to simple_start_creating() but is not so
simple.
It takes a qstr for the name, includes permission checking, and does NOT
report an error if the name already exists, returning a positive dentry
instead.

This is currently used by nfsd, cachefiles, and overlayfs.

end_creating() is called after the dentry has been used.
end_creating() drops the reference to the dentry as it is generally no
longer needed.  This is exactly the first section of end_creating_path()
so that function is changed to call the new end_creating()

These calls help encapsulate locking rules so that directory locking can
be changed.

Occasionally this change means that the parent lock is held for a
shorter period of time, for example in cachefiles_commit_tmpfile().
As this function now unlocks after an unlink and before the following
lookup, it is possible that the lookup could again find a positive
dentry, so a while loop is introduced there.

In overlayfs the ovl_lookup_temp() function has ovl_tempname()
split out to be used in ovl_start_creating_temp().  The other use
of ovl_lookup_temp() is preparing for a rename.  When rename handling
is updated, ovl_lookup_temp() will be removed.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://patch.msgid.link/20251113002050.676694-5-neilb@ownmail.net
Tested-by: syzbot@syzkaller.appspotmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-14 13:15:56 +01:00
NeilBrown 3661a78874
VFS: tidy up do_unlinkat()
The simplification of locking in the previous patch opens up some room
for tidying up do_unlinkat()

- change all "exit" labels to describe what will happen at the label.
- always goto an exit label on an error - unwrap the "if (!IS_ERR())" branch.
- Move the "slashes" handing inline, but mark it as unlikely()
- simplify use of the "inode" variable - we no longer need to test for NULL.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://patch.msgid.link/20251113002050.676694-4-neilb@ownmail.net
Tested-by: syzbot@syzkaller.appspotmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-14 13:15:56 +01:00
NeilBrown 4037d966f0
VFS: introduce start_dirop() and end_dirop()
The fact that directory operations (create,remove,rename) are protected
by a lock on the parent is known widely throughout the kernel.
In order to change this - to instead lock the target dentry  - it is
best to centralise this knowledge so it can be changed in one place.

This patch introduces start_dirop() which is local to VFS code.
It performs the required locking for create and remove.  Rename
will be handled separately.

Various functions with names like start_creating() or start_removing_path(),
some of which already exist, will export this functionality beyond the VFS.

end_dirop() is the partner of start_dirop().  It drops the lock and
releases the reference on the dentry.
It *is* exported so that various end_creating etc functions can be inline.

As vfs_mkdir() drops the dentry on error we cannot use end_dirop() as
that won't unlock when the dentry IS_ERR().  For now we need an explicit
unlock when dentry IS_ERR().  I hope to change vfs_mkdir() to unlock
when it drops a dentry so that explicit unlock can go away.

end_dirop() can always be called on the result of start_dirop(), but not
after vfs_mkdir().  After a vfs_mkdir() we still may need the explicit
unlock as seen in end_creating_path().

As well as adding start_dirop() and end_dirop()
this patch uses them in:
 - simple_start_creating (which requires sharing lookup_noperm_common()
        with libfs.c)
 - start_removing_path / start_removing_user_path_at
 - filename_create / end_creating_path()
 - do_rmdir(), do_unlinkat()

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://patch.msgid.link/20251113002050.676694-3-neilb@ownmail.net
Tested-by: syzbot@syzkaller.appspotmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-14 13:15:56 +01:00
NeilBrown 8b45b9a882
debugfs: rename end_creating() to debugfs_end_creating()
By not using the generic end_creating() name here we are free to use it
more globally for a more generic function.
This should have been done when start_creating() was renamed.

For consistency, also rename failed_creating().

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://patch.msgid.link/20251113002050.676694-2-neilb@ownmail.net
Tested-by: syzbot@syzkaller.appspotmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-14 13:15:55 +01:00
Josh Poimboeuf 9c7dc1dd89 objtool: Warn on functions with ambiguous -ffunction-sections section names
When compiled with -ffunction-sections, a function named startup() will
be placed in .text.startup.  However, .text.startup is also used by the
compiler for functions with __attribute__((constructor)).

That creates an ambiguity for the vmlinux linker script, which needs to
differentiate those two cases.

Similar naming conflicts exist for functions named exit(), split(),
unlikely(), hot() and unknown().

One potential solution would be to use '#ifdef CC_USING_FUNCTION_SECTIONS'
to create two distinct implementations of the TEXT_MAIN macro.  However,
-ffunction-sections can be (and is) enabled or disabled on a per-object
basis (for example via ccflags-y or AUTOFDO_PROFILE).

So the recently unified TEXT_MAIN macro (commit 1ba9f89794
("vmlinux.lds: Unify TEXT_MAIN, DATA_MAIN, and related macros")) is
necessary.  This means there's no way for the linker script to
disambiguate things.

Instead, use objtool to warn on any function names whose resulting
section names might create ambiguity when the kernel is compiled (in
whole or in part) with -ffunction-sections.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: live-patching@vger.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://patch.msgid.link/65fedea974fe14be487c8867a0b8d0e4a294ce1e.1762991150.git.jpoimboe@kernel.org
2025-11-13 08:03:10 +01:00
Josh Poimboeuf 0330b7fbbf drivers/xen/xenbus: Fix namespace collision and split() section placement with AutoFDO
When compiling the kernel with -ffunction-sections enabled, the split()
function gets compiled into the .text.split section.  In some cases it
can even be cloned into .text.split.constprop.0 or .text.split.isra.0.

However, .text.split.* is already reserved for use by the Clang
-fsplit-machine-functions flag, which is used by AutoFDO.  That may
place part of a function's code in a .text.split.<func> section.

This naming conflict causes the vmlinux linker script to wrongly place
split() with other .text.split.* code, rather than where it belongs with
regular text.

Fix it by renaming split() to split_strings().

Fixes: 6568f14cb5 ("vmlinux.lds: Exclude .text.startup and .text.exit from TEXT_MAIN")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: live-patching@vger.kernel.org
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://patch.msgid.link/92a194234a0f757765e275b288bb1a7236c2c35c.1762991150.git.jpoimboe@kernel.org
2025-11-13 08:03:10 +01:00
Josh Poimboeuf 56255fa968 media: atomisp: Fix namespace collision and startup() section placement with -ffunction-sections
When compiling the kernel with -ffunction-sections (e.g., for LTO,
livepatch, dead code elimination, AutoFDO, or Propeller), the startup()
function gets compiled into the .text.startup section.  In some cases it
can even be cloned into .text.startup.constprop.0 or
.text.startup.isra.0.

However, the .text.startup and .text.startup.* section names are already
reserved for use by the compiler for __attribute__((constructor)) code.

This naming conflict causes the vmlinux linker script to wrongly place
startup() function code in .init.text, which gets freed during boot.

Fix that by renaming startup() to ov2722_startup().

Fixes: 6568f14cb5 ("vmlinux.lds: Exclude .text.startup and .text.exit from TEXT_MAIN")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: live-patching@vger.kernel.org
Cc: Hans de Goede <hansg@kernel.org>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://patch.msgid.link/bf8cd823a3f11f64cc82167913be5013c72afa57.1762991150.git.jpoimboe@kernel.org
2025-11-13 08:03:09 +01:00
Josh Poimboeuf f6a8919d61 vmlinux.lds: Fix TEXT_MAIN to include .text.start and friends
Since:

  6568f14cb5 ("vmlinux.lds: Exclude .text.startup and .text.exit from TEXT_MAIN")

the TEXT_MAIN macro uses a series of patterns to prevent the
.text.startup[.*] and .text.exit[.*] sections from getting
linked into the vmlinux runtime .text.

That commit is a tad too aggressive: it also inadvertently filters out
valid runtime text sections like .text.start and
.text.start.constprop.0, which can be generated for a function named
start() when -ffunction-sections is enabled.

As a result, those sections become orphans when building with
CONFIG_LD_DEAD_CODE_DATA_ELIMINATION for arm:

  arm-linux-gnueabi-ld: warning: orphan section `.text.start.constprop.0' from `drivers/usb/host/sl811-hcd.o' being placed in section `.text.start.constprop.0'
  arm-linux-gnueabi-ld: warning: orphan section `.text.start.constprop.0' from `drivers/media/dvb-frontends/drxk_hard.o' being placed in section `.text.start.constprop.0'
  arm-linux-gnueabi-ld: warning: orphan section `.text.start' from `drivers/media/dvb-frontends/stv0910.o' being placed in section `.text.start'
  arm-linux-gnueabi-ld: warning: orphan section `.text.start.constprop.0' from `drivers/media/pci/ddbridge/ddbridge-sx8.o' being placed in section `.text.start.constprop.0'

Fix that by explicitly adding the partial "substring" sections (.text.s,
.text.st, .text.sta, etc) and their cloned derivatives.

While this unfortunately means that TEXT_MAIN continues to grow,
these changes are ultimately necessary for proper support of
-ffunction-sections.

Fixes: 6568f14cb5 ("vmlinux.lds: Exclude .text.startup and .text.exit from TEXT_MAIN")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: live-patching@vger.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://patch.msgid.link/cd588144e63df901a656b06b566855019c4a931d.1762991150.git.jpoimboe@kernel.org
Closes: https://lore.kernel.org/oe-kbuild-all/202511040812.DFGedJiy-lkp@intel.com/
2025-11-13 08:03:09 +01:00
Ingo Molnar d851f2b2b2 Linux 6.18-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCgA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmkRH1seHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGUCgH/j+fMbEg618ajVS2
 SWdAXZKEDVtCqN6bq9VT3g3rwk/zNgvppjMdCBqyXFpjvkGGIxlZnNgiTVuTLzvR
 cjl0c5C1a38lJ+DzmLjTF1TJ3t0CcA/8l2iWKu3Dm1ch2yuxm5ZcM2b9ujBholf7
 pYd7jZ7JhVm5eXD7U5X03AkZPUWAIx/Nip37cO7RLGzlkRSGLB7OXq3TB2u4e2ti
 gDpP4O+cgOqSuS71Hz0/8T6KIVQ9IZ/qzANWAYeHZD2DQwI3OZXI1WRnc1iw401o
 QaMaV21NirKwAANKetvbj7FgtmpdfQs/7FA+yR7YW2ARTpkc1EXrxgMZ6NuphGKE
 kYQo55g=
 =QaZ2
 -----END PGP SIGNATURE-----

Merge tag 'v6.18-rc5' into objtool/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-11-13 07:58:43 +01:00
Boqun Feng f74cf399e0 rust: debugfs: Replace the usage of Rust native atomics
Rust native atomics are not allowed to use in kernel due to the mismatch
of memory model with Linux kernel memory model, hence remove the usage
of Rust native atomics in debufs.

Reviewed-by: Matthew Maurer <mmaurer@google.com>
Acked-by: Danilo Krummrich <dakr@kernel.org>
Tested-by: David Gow <davidgow@google.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Link: https://patch.msgid.link/20251022035324.70785-4-boqun.feng@gmail.com
2025-11-12 08:56:42 -08:00
Boqun Feng 013f912eb5 rust: sync: atomic: Implement Debug for Atomic<Debug>
If `Atomic<T>` is `Debug` then it's a `debugfs::Writer`, therefore make
it so since 1) debugfs needs to support `Atomic<T>` and 2) it's rather
trivial to implement `Debug` for `Atomic<Debug>`.

Tested-by: David Gow <davidgow@google.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Link: https://patch.msgid.link/20251022035324.70785-3-boqun.feng@gmail.com
2025-11-12 08:56:41 -08:00
Boqun Feng 14e9a18b07 rust: sync: atomic: Make Atomic*Ops pub(crate)
In order to write code over a generate Atomic<T> we need to make
Atomic*Ops public so that functions like `.load()` and `.store()` are
available. Make these pub(crate) at the beginning so the usage in kernel
crate is supported.

Tested-by: David Gow <davidgow@google.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Link: https://patch.msgid.link/20251022035324.70785-2-boqun.feng@gmail.com
2025-11-12 08:56:38 -08:00
Ingo Molnar 9929dffce5 perf/x86/intel: Fix and clean up intel_pmu_drain_arch_pebs() type use
The following commit introduced a build failure on x86-32:

  21954c8a0ff ("perf/x86/intel: Process arch-PEBS records or record fragments")

  ...

  arch/x86/events/intel/ds.c:2983:24: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast]

The forced type conversion to 'u64' and 'void *' are not 32-bit clean,
but they are also entirely unnecessary: ->pebs_vaddr is 'void *' already,
and integer-compatible pointer arithmetics will work just fine on it.

Fix & simplify the code.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Fixes: d21954c8a0 ("perf/x86/intel: Process arch-PEBS records or record fragments")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Link: https://patch.msgid.link/20251029102136.61364-10-dapeng1.mi@linux.intel.com
2025-11-12 12:12:28 +01:00
Christian Brauner 76c63ff12e
Merge patch series "vfs: recall-only directory delegations for knfsd"
Jeff Layton <jlayton@kernel.org> says:

At the fall NFS Bakeathon last week, the NFS client and server
maintainers had a discussion about how to merge support for directory
delegations. We decided to start with just merging support for simple,
recallable-only directory delegation support, for a number of reasons:

1/ RFC8881 has some gaps in coverage that we are hoping to have
addressed in RFC8881bis. In particular, it's written such that CB_NOTIFY
callbacks require directory position information. That will be hard to
do properly under Linux, so we're planning to extend the spec to allow
that information to be omitted.

2/ client-side support for CB_NOTIFY still lags a bit. The client side
is tricky, as it involves heuristics about when to request a delegation.

3/ we have some early indication that simple, recallable-only
delegations can help performance in some cases. Anna mentioned seeing a
multi-minute speedup in xfstests runs with them enabled. This needs more
investigation, but it's promising and seems like enough justification to
merge support.

This patchset is quite similar to the set I initially posted back in
early 2024. We've merged some GET_DIR_DELEGATION handling patches
since then, but the VFS layer support is basically the same.

One thing that I want to make clear is that with this patchset, userspace
can request a read lease on a directory that will be recalled on
conflicting accesses. I saw no reason to prevent this, and I think it may
be something useful for applications like Samba.

As always, users can disable leases altogether via the fs.leases-enable
sysctl if this is an issue, but I wanted to point this out in case
anyone sees footguns here.

* patches from https://patch.msgid.link/20251111-dir-deleg-ro-v6-0-52f3feebb2f2@kernel.org:
  vfs: expose delegation support to userland
  nfsd: wire up GET_DIR_DELEGATION handling
  nfsd: allow DELEGRETURN on directories
  nfsd: allow filecache to hold S_IFDIR files
  filelock: lift the ban on directory leases in generic_setlease
  vfs: make vfs_symlink break delegations on parent dir
  vfs: make vfs_mknod break delegations on parent directory
  vfs: make vfs_create break delegations on parent directory
  vfs: clean up argument list for vfs_create()
  vfs: break parent dir delegations in open(..., O_CREAT) codepath
  vfs: allow rmdir to wait for delegation break on parent
  vfs: allow mkdir to wait for delegation break on parent
  vfs: add try_break_deleg calls for parents to vfs_{link,rename,unlink}
  filelock: push the S_ISREG check down to ->setlease handlers
  filelock: add struct delegated_inode
  filelock: rework the __break_lease API to use flags
  filelock: make lease_alloc() take a flags argument

Link: https://patch.msgid.link/20251111-dir-deleg-ro-v6-0-52f3feebb2f2@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12 09:38:46 +01:00
Jeff Layton 1602bad16d
vfs: expose delegation support to userland
Now that support for recallable directory delegations is available,
expose this functionality to userland with new F_SETDELEG and F_GETDELEG
commands for fcntl().

Note that this also allows userland to request a FL_DELEG type lease on
files too. Userland applications that do will get signalled when there
are metadata changes in addition to just data changes (which is a
limitation of FL_LEASE leases).

These commands accept a new "struct delegation" argument that contains a
flags field for future expansion.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20251111-dir-deleg-ro-v6-17-52f3feebb2f2@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12 09:38:37 +01:00
Jeff Layton 8b99f6a8c1
nfsd: wire up GET_DIR_DELEGATION handling
Add a new routine for acquiring a read delegation on a directory. These
are recallable-only delegations with no support for CB_NOTIFY. That will
be added in a later phase.

Since the same CB_RECALL/DELEGRETURN infrastructure is used for regular
and directory delegations, a normal nfs4_delegation is used to represent
a directory delegation.

Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20251111-dir-deleg-ro-v6-16-52f3feebb2f2@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12 09:38:37 +01:00
Jeff Layton 80c8afddc8
nfsd: allow DELEGRETURN on directories
As Trond pointed out: "...provided that the presented stateid is
actually valid, it is also sufficient to uniquely identify the file to
which it is associated (see RFC8881 Section 8.2.4), so the filehandle
should be considered mostly irrelevant for operations like DELEGRETURN."

Don't ask fh_verify to filter on file type.

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20251111-dir-deleg-ro-v6-15-52f3feebb2f2@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12 09:38:37 +01:00
Jeff Layton 544a0ee152
nfsd: allow filecache to hold S_IFDIR files
The filecache infrastructure will only handle S_IFREG files at the
moment. Directory delegations will require adding support for opening
S_IFDIR inodes.

Plumb a "type" argument into nfsd_file_do_acquire() and have all of the
existing callers set it to S_IFREG. Add a new nfsd_file_acquire_dir()
wrapper that nfsd can call to request a nfsd_file that holds a directory
open.

For now, there is no need for a fsnotify_mark for directories, as
CB_NOTIFY is not yet supported. Change nfsd_file_do_acquire() to avoid
allocating one for non-S_IFREG inodes.

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20251111-dir-deleg-ro-v6-14-52f3feebb2f2@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12 09:38:36 +01:00
Jeff Layton d0eab9fc10
filelock: lift the ban on directory leases in generic_setlease
With the addition of the try_break_lease calls in directory changing
operations, allow generic_setlease to hand them out. Write leases on
directories are never allowed however, so continue to reject them.

For now, there is no API for requesting delegations from userland, so
ensure that userland is prevented from acquiring a lease on a directory.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20251111-dir-deleg-ro-v6-13-52f3feebb2f2@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12 09:38:36 +01:00
Jeff Layton 92bf53577f
vfs: make vfs_symlink break delegations on parent dir
In order to add directory delegation support, we must break delegations
on the parent on any change to the directory.

Add a delegated_inode parameter to vfs_symlink() and have it break the
delegation. do_symlinkat() can then wait on the delegation break before
proceeding.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20251111-dir-deleg-ro-v6-12-52f3feebb2f2@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12 09:38:36 +01:00
Jeff Layton e8960c1b2e
vfs: make vfs_mknod break delegations on parent directory
In order to add directory delegation support, we need to break
delegations on the parent whenever there is going to be a change in the
directory.

Add a new delegated_inode pointer to vfs_mknod() and have the
appropriate callers wait when there is an outstanding delegation. All
other callers just set the pointer to NULL.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20251111-dir-deleg-ro-v6-11-52f3feebb2f2@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12 09:38:36 +01:00
Jeff Layton c826229c6a
vfs: make vfs_create break delegations on parent directory
In order to add directory delegation support, we need to break
delegations on the parent whenever there is going to be a change in the
directory.

Add a delegated_inode parameter to vfs_create. Most callers are
converted to pass in NULL, but do_mknodat() is changed to wait for a
delegation break if there is one.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20251111-dir-deleg-ro-v6-10-52f3feebb2f2@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12 09:38:36 +01:00
Jeff Layton 85bbffcad7
vfs: clean up argument list for vfs_create()
As Neil points out:

"I would be in favour of dropping the "dir" arg because it is always
d_inode(dentry->d_parent) which is stable."

...and...

"Also *every* caller of vfs_create() passes ".excl = true".  So maybe we
don't need that arg at all."

Drop both arguments from vfs_create() and fix up the callers.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20251111-dir-deleg-ro-v6-9-52f3feebb2f2@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12 09:38:36 +01:00
Jeff Layton 134796f43a
vfs: break parent dir delegations in open(..., O_CREAT) codepath
In order to add directory delegation support, we need to break
delegations on the parent whenever there is going to be a change in the
directory.

Add a delegated_inode parameter to lookup_open and have it break the
delegation. Then, open_last_lookups can wait for the delegation break
and retry the call to lookup_open once it's done.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20251111-dir-deleg-ro-v6-8-52f3feebb2f2@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12 09:38:35 +01:00
Jeff Layton 4fa76319cd
vfs: allow rmdir to wait for delegation break on parent
In order to add directory delegation support, we need to break
delegations on the parent whenever there is going to be a change in the
directory.

Add a delegated_inode struct to vfs_rmdir() and populate that
pointer with the parent inode if it's non-NULL. Most existing in-kernel
callers pass in a NULL pointer.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20251111-dir-deleg-ro-v6-7-52f3feebb2f2@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12 09:38:35 +01:00
Jeff Layton e12d203b8c
vfs: allow mkdir to wait for delegation break on parent
In order to add directory delegation support, we need to break
delegations on the parent whenever there is going to be a change in the
directory.

Add a new delegated_inode parameter to vfs_mkdir. All of the existing
callers set that to NULL for now, except for do_mkdirat which will
properly block until the lease is gone.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20251111-dir-deleg-ro-v6-6-52f3feebb2f2@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12 09:38:35 +01:00
Jeff Layton b46ebf9a76
vfs: add try_break_deleg calls for parents to vfs_{link,rename,unlink}
In order to add directory delegation support, we need to break
delegations on the parent whenever there is going to be a change in the
directory.

vfs_link, vfs_unlink, and vfs_rename all have existing delegation break
handling for the children in the rename. Add the necessary calls for
breaking delegations in the parent(s) as well.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20251111-dir-deleg-ro-v6-5-52f3feebb2f2@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12 09:38:35 +01:00
Jeff Layton e6d28ebc17
filelock: push the S_ISREG check down to ->setlease handlers
When nfsd starts requesting directory delegations, setlease handlers may
see requests for leases on directories. Push the !S_ISREG check down
into the non-trivial setlease handlers, so we can selectively enable
them where they're supported.

FUSE is special: It's the only filesystem that supports atomic_open and
allows kernel-internal leases. atomic_open is issued when the VFS
doesn't know the state of the dentry being opened. If the file doesn't
exist, it may be created, in which case the dir lease should be broken.

The existing kernel-internal lease implementation has no provision for
this. Ensure that we don't allow directory leases by default going
forward by explicitly disabling them there.

Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20251111-dir-deleg-ro-v6-4-52f3feebb2f2@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12 09:38:35 +01:00
Jeff Layton 6976ed2dd0
filelock: add struct delegated_inode
The current API requires a pointer to an inode pointer. It's easy for
callers to get this wrong. Add a new delegated_inode structure and use
that to pass back any inode that needs to be waited on.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20251111-dir-deleg-ro-v6-3-52f3feebb2f2@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12 09:38:34 +01:00
Jeff Layton 4be9f3cc58
filelock: rework the __break_lease API to use flags
Currently __break_lease takes both a type and an openmode. With the
addition of directory leases, that makes less sense. Declare a set of
LEASE_BREAK_* flags that can be used to control how lease breaks work
instead of requiring a type and an openmode.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20251111-dir-deleg-ro-v6-2-52f3feebb2f2@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12 09:38:34 +01:00
Jeff Layton 6fc5f2b19e
filelock: make lease_alloc() take a flags argument
__break_lease() currently overrides the flc_flags field in the lease
after allocating it. A forthcoming patch will add the ability to request
a FL_DELEG type lease.

Instead of overriding the flags field, add a flags argument to
lease_alloc() and lease_init() so it's set correctly after allocating.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20251111-dir-deleg-ro-v6-1-52f3feebb2f2@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12 09:38:31 +01:00
Shrikanth Hegde 65177ea9f6 sched/deadline: Minor cleanup in select_task_rq_dl()
In select_task_rq_dl, there is only one goto statement, there is no
need for it.

No functional changes.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://patch.msgid.link/20251014100342.978936-2-sshegde@linux.ibm.com
2025-11-11 17:27:55 +01:00
Shrikanth Hegde b4bfacd392 sched/deadline: Use cpumask_weight_and() in dl_bw_cpus
cpumask_subset(a,b) -> cpumask_weight(a) should be same as cpumask_weight_and(a,b)
for_each_cpu_and(a,b) to count cpus could be replaced by cpumask_weight_and(a,b)

No Functional Change. It could save a few cycles since cpumask_weight_and
would be more efficient. Plus one less stack variable.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://patch.msgid.link/20251014100342.978936-3-sshegde@linux.ibm.com
2025-11-11 17:27:55 +01:00
Peter Zijlstra 2614069c59 sched/deadline: Document dl_server
Place the notes that resulted from going through the dl_server code in a
comment.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-11-11 17:27:50 +01:00
Peter Zijlstra f5a538c07d sched/deadline: Fix dl_server stop condition
Gabriel reported that the dl_server doesn't stop as expected.

The problem was found to be the fact that idle time and fair runtime are
treated equally. Both will count towards dl_server runtime and push the
activation forwards when it is in the zero-laxity wait state.

Notably:

  dl_server_update_idle()
    update_curr_dl_se()
      if (dl_defer && dl_throttled && dl_runtime_exceeded())
        hrtimer_try_to_cancel(); // stop timer
	replenish_dl_new_period()
	  deadline = now + dl_deadline; // fwd period
	  runtime = dl_runtime;
        start_dl_timer(); // restart timer

And while we do want idle time accounted towards the *current* activation of
the dl_server -- after all, a fair task could've ran if we had any -- we don't
necessarily want idle time to cause or push forward an activation.

Introduce dl_defer_idle to make this distinction. It will be set once idle time
pushed the activation forward, once set idle time will only be allowed to
consume any runtime but not push the activation. This will then cause
dl_server_timer() to fire, which will stop the dl_server.

Any non-idle time accounting during this phase will clear dl_defer_idle, so
only a full period of idle will cause the dl_server to stop.

Reported-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251101000057.GA2184199@noisy.programming.kicks-ass.net
2025-11-11 12:33:39 +01:00
Peter Zijlstra e636ffb9e3 sched/deadline: Fix dl_server time accounting
The dl_server time accounting code is a little odd. The normal scheduler
pattern is to update curr before doing something, such that the old state is
fully accounted before changing state.

Notably, the dl_server_timer() needs to propagate the current time accounting
since the current task could be ran by dl_server and thus this can affect
dl_se->runtime. Similarly for dl_server_start().

And since the (deferred) dl_server wants idle time accounted, rework
sched_idle_class time accounting to be more like all the others.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251020141130.GJ3245006@noisy.programming.kicks-ass.net
2025-11-11 12:33:38 +01:00
Hao Jia e40cea333e sched/core: Remove double update_rq_clock() in __set_cpus_allowed_ptr_locked()
Since commit d4c64207b8 ("sched: Cleanup the sched_change NOCLOCK usage"),
update_rq_clock() is called in do_set_cpus_allowed() -> sched_change_begin()
to update the rq clock. This results in a duplicate call update_rq_clock()
in __set_cpus_allowed_ptr_locked().

While holding the rq lock and before calling do_set_cpus_allowed(),
there is nothing that depends on an updated rq_clock.

Therefore, remove the redundant update_rq_clock() in
__set_cpus_allowed_ptr_locked() to avoid the warning about double
rq clock updates.

Fixes: d4c64207b8 ("sched: Cleanup the sched_change NOCLOCK usage")
Signed-off-by: Hao Jia <jiahao1@lixiang.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20251029093655.31252-1-jiahao.kernel@gmail.com
2025-11-11 12:33:38 +01:00
Peter Zijlstra 79f3f9bedd sched/eevdf: Fix min_vruntime vs avg_vruntime
Basically, from the constraint that the sum of lag is zero, you can
infer that the 0-lag point is the weighted average of the individual
vruntime, which is what we're trying to compute:

        \Sum w_i * v_i
  avg = --------------
           \Sum w_i

Now, since vruntime takes the whole u64 (worse, it wraps), this
multiplication term in the numerator is not something we can compute;
instead we do the min_vruntime (v0 henceforth) thing like:

  v_i = (v_i - v0) + v0

This does two things:
 - it keeps the key: (v_i - v0) 'small';
 - it creates a relative 0-point in the modular space.

If you do that subtitution and work it all out, you end up with:

        \Sum w_i * (v_i - v0)
  avg = --------------------- + v0
              \Sum w_i

Since you cannot very well track a ratio like that (and not suffer
terrible numerical problems) we simpy track the numerator and
denominator individually and only perform the division when strictly
needed.

Notably, the numerator lives in cfs_rq->avg_vruntime and the denominator
lives in cfs_rq->avg_load.

The one extra 'funny' is that these numbers track the entities in the
tree, and current is typically outside of the tree, so avg_vruntime()
adds current when needed before doing the division.

(vruntime_eligible() elides the division by cross-wise multiplication)

Anyway, as mentioned above, we currently use the CFS era min_vruntime
for this purpose. However, this thing can only move forward, while the
above avg can in fact move backward (when a non-eligible task leaves,
the average becomes smaller), this can cause trouble when through
happenstance (or construction) these values drift far enough apart to
wreck the game.

Replace cfs_rq::min_vruntime with cfs_rq::zero_vruntime which is kept
near/at avg_vruntime, following its motion.

The down-side is that this requires computing the avg more often.

Fixes: 147f3efaa2 ("sched/fair: Implement an EEVDF-like scheduling policy")
Reported-by: Zicheng Qu <quzicheng@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251106111741.GC4068168@noisy.programming.kicks-ass.net
Cc: stable@vger.kernel.org
2025-11-11 12:33:38 +01:00
Peter Zijlstra 9359d9785d sched/core: Add comment explaining force-idle vruntime snapshots
I always end up having to re-read these emails every time I look at
this code. And a future patch is going to change this story a little.
This means it is past time to stick them in a comment so it can be
modified and stay current.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200506143506.GH5298@hirez.programming.kicks-ass.net
Link: https://lkml.kernel.org/r/20200515103844.GG2978@hirez.programming.kicks-ass.net
Link: https://patch.msgid.link/20251106111603.GB4068168@noisy.programming.kicks-ass.net
2025-11-11 12:33:37 +01:00
Fernand Sieber 7f829bde94 sched/core: Optimize core cookie matching check
Early return true if the core cookie matches. This avoids the SMT mask
loop to check for an idle core, which might be more expensive on wide
platforms.

Signed-off-by: Fernand Sieber <sieberf@amazon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Link: https://patch.msgid.link/20251105152538.470586-1-sieberf@amazon.com
2025-11-11 12:33:37 +01:00
Fernand Sieber 127b90315c sched/proxy: Yield the donor task
When executing a task in proxy context, handle yields as if they were
requested by the donor task. This matches the traditional PI semantics
of yield() as well.

This avoids scenario like proxy task yielding, pick next task selecting the
same previous blocked donor, running the proxy task again, etc.

Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202510211205.1e0f5223-lkp@intel.com
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Fernand Sieber <sieberf@amazon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251106104022.195157-1-sieberf@amazon.com
2025-11-11 12:33:36 +01:00
Borislav Petkov (AMD) 249092174c tools/objtool: Copy the __cleanup unused variable fix for older clang
Copy from

  54da6a0924 ("locking: Introduce __cleanup() based infrastructure")

the bits which mark the variable with a cleanup attribute unused so that my
clang 15 can dispose of it properly instead of warning that it is unused which
then fails the build due to -Werror.

Suggested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/20251031114919.GBaQSiPxZrziOs3RCW@fat_crate.local
2025-11-10 12:46:08 +01:00
Arnd Bergmann 780813d701 x86/math-emu: Fix div_Xsig() prototype
The third argument of div_Xsig() is the output of the division, but is marked
'const', which means the compiler is not expecting it to be updated and may
generate bad code around the call. clang-21 now warns about the pattern since
an uninitialized variable is passed into two 'const' arguments by reference:

  arch/x86/math-emu/poly_atan.c:93:28: error: variable 'argSignif' is uninitialized \
  when passed as a const pointer argument here [-Werror,-Wuninitialized-const-pointer]
     93 |         div_Xsig(&Numer, &Denom, &argSignif);
        |                                   ^~~~~~~~~
  arch/x86/math-emu/poly_l2.c:195:29: error: variable 'argSignif' is uninitialized \
  when passed as a const pointer argument here [-Werror,-Wuninitialized-const-pointer]
    195 |                 div_Xsig(&Numer, &Denom, &argSignif);
        |                                           ^~~~~~~~~

The implementation is in assembly, so the problem has gone unnoticed since the
code was added in the linux-1.1 days. Remove the 'const' marker here.

Fixes: e19a1bdb835c ("Import 1.1.38")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/20250807205334.123231-1-arnd@kernel.org
2025-11-09 21:01:08 +01:00
Julian Stecklina ed4f9638d9 x86/apic: Fix frequency in apic=verbose log output
When apic=verbose is specified, the LAPIC timer calibration prints its results
to the console. At least while debugging virtualization code, the CPU and bus
frequencies are printed incorrectly.

Specifically, for a 1.7 GHz CPU with 1 GHz bus frequency and HZ=1000,
the log includes a superfluous 0 after the period:

  ..... calibration result: 999978
  ..... CPU clock speed is 1696.0783 MHz.
  ..... host bus clock speed is 999.0978 MHz.

Looking at the code, this only worked as intended for HZ=100. After the fix,
the correct frequency is printed:

  ..... calibration result: 999828
  ..... CPU clock speed is 1696.507 MHz.
  ..... host bus clock speed is 999.828 MHz.

There is no functional change to the LAPIC calibration here, beyond the
printing format changes.

  [ bp: - Massage commit message
        - Figures it should apply this patch about ~4 years later
        - Massage it into the current code ]

Suggested-by: Markus Napierkowski <markus.napierkowski@cyberus-technology.de>
Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/20211030142148.143261-1-js@alien8.de
2025-11-07 17:48:14 +01:00
Peter Zijlstra 2093d8cf80 perf/x86/intel: Optimize PEBS extended config
Similar to enable_acr_event, avoid the branch.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-11-07 15:08:23 +01:00
Peter Zijlstra 02da693f66 perf/x86/intel: Check PEBS dyn_constraints
Handle the interaction between ("perf/x86/intel: Update dyn_constraint
base on PEBS event precise level") and ("perf/x86/intel: Add a check
for dynamic constraints").

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-11-07 15:08:23 +01:00
Kan Liang bd24f9beed perf/x86/intel: Add a check for dynamic constraints
The current event scheduler has a limit. If the counter constraint of an
event is not a subset of any other counter constraint with an equal or
higher weight. The counters may not be fully utilized.

To workaround it, the commit bc1738f6ee ("perf, x86: Fix event
scheduler for constraints with overlapping counters") introduced an
overlap flag, which is hardcoded to the event constraint that may
trigger the limit. It only works for static constraints.

Many features on and after Intel PMON v6 require dynamic constraints. An
event constraint is decided by both static and dynamic constraints at
runtime. See commit 4dfe3232cc ("perf/x86: Add dynamic constraint").
The dynamic constraints are from CPUID enumeration. It's impossible to
hardcode it in advance. It's not practical to set the overlap flag to all
events. It's harmful to the scheduler.

For the existing Intel platforms, the dynamic constraints don't trigger
the limit. A real fix is not required.

However, for virtualization, VMM may give a weird CPUID enumeration to a
guest. It's impossible to indicate what the weird enumeration is. A
check is introduced, which can list the possible breaks if a weird
enumeration is used.

Check the dynamic constraints enumerated for normal, branch counters
logging, and auto-counter reload.
Check both PEBS and non-PEBS constratins.

Closes: https://lore.kernel.org/lkml/20250416195610.GC38216@noisy.programming.kicks-ass.net/
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20250512175542.2000708-1-kan.liang@linux.intel.com
2025-11-07 15:08:23 +01:00
Dapeng Mi bb5f13df3c perf/x86/intel: Add counter group support for arch-PEBS
Base on previous adaptive PEBS counter snapshot support, add counter
group support for architectural PEBS. Since arch-PEBS shares same
counter group layout with adaptive PEBS, directly reuse
__setup_pebs_counter_group() helper to process arch-PEBS counter group.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251029102136.61364-13-dapeng1.mi@linux.intel.com
2025-11-07 15:08:22 +01:00
Dapeng Mi 52448a0a73 perf/x86/intel: Setup PEBS data configuration and enable legacy groups
Different with legacy PEBS, arch-PEBS provides per-counter PEBS data
configuration by programing MSR IA32_PMC_GPx/FXx_CFG_C MSRs.

This patch obtains PEBS data configuration from event attribute and then
writes the PEBS data configuration to MSR IA32_PMC_GPx/FXx_CFG_C and
enable corresponding PEBS groups.

Please notice this patch only enables XMM SIMD regs sampling for
arch-PEBS, the other SIMD regs (OPMASK/YMM/ZMM) sampling on arch-PEBS
would be supported after PMI based SIMD regs (OPMASK/YMM/ZMM) sampling
is supported.

Co-developed-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251029102136.61364-12-dapeng1.mi@linux.intel.com
2025-11-07 15:08:22 +01:00
Dapeng Mi e89c5d1f29 perf/x86/intel: Update dyn_constraint base on PEBS event precise level
arch-PEBS provides CPUIDs to enumerate which counters support PEBS
sampling and precise distribution PEBS sampling. Thus PEBS constraints
should be dynamically configured base on these counter and precise
distribution bitmap instead of defining them statically.

Update event dyn_constraint base on PEBS event precise level.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251029102136.61364-11-dapeng1.mi@linux.intel.com
2025-11-07 15:08:22 +01:00
Dapeng Mi 2721e8da2d perf/x86/intel: Allocate arch-PEBS buffer and initialize PEBS_BASE MSR
Arch-PEBS introduces a new MSR IA32_PEBS_BASE to store the arch-PEBS
buffer physical address. This patch allocates arch-PEBS buffer and then
initialize IA32_PEBS_BASE MSR with the buffer physical address.

Co-developed-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251029102136.61364-10-dapeng1.mi@linux.intel.com
2025-11-07 15:08:22 +01:00
Dapeng Mi d21954c8a0 perf/x86/intel: Process arch-PEBS records or record fragments
A significant difference with adaptive PEBS is that arch-PEBS record
supports fragments which means an arch-PEBS record could be split into
several independent fragments which have its own arch-PEBS header in
each fragment.

This patch defines architectural PEBS record layout structures and add
helpers to process arch-PEBS records or fragments. Only legacy PEBS
groups like basic, GPR, XMM and LBR groups are supported in this patch,
the new added YMM/ZMM/OPMASK vector registers capturing would be
supported in the future.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251029102136.61364-9-dapeng1.mi@linux.intel.com
2025-11-07 15:08:21 +01:00
Dapeng Mi 167cde7dc9 perf/x86/intel/ds: Factor out PEBS group processing code to functions
Adaptive PEBS and arch-PEBS share lots of same code to process these
PEBS groups, like basic, GPR and meminfo groups. Extract these shared
code to generic functions to avoid duplicated code.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251029102136.61364-8-dapeng1.mi@linux.intel.com
2025-11-07 15:08:21 +01:00
Dapeng Mi 8807d92270 perf/x86/intel/ds: Factor out PEBS record processing code to functions
Beside some PEBS record layout difference, arch-PEBS can share most of
PEBS record processing code with adaptive PEBS. Thus, factor out these
common processing code to independent inline functions, so they can be
reused by subsequent arch-PEBS handler.

Suggested-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251029102136.61364-7-dapeng1.mi@linux.intel.com
2025-11-07 15:08:21 +01:00
Dapeng Mi d243d0bb64 perf/x86/intel: Initialize architectural PEBS
arch-PEBS leverages CPUID.23H.4/5 sub-leaves enumerate arch-PEBS
supported capabilities and counters bitmap. This patch parses these 2
sub-leaves and initializes arch-PEBS capabilities and corresponding
structures.

Since IA32_PEBS_ENABLE and MSR_PEBS_DATA_CFG MSRs are no longer existed
for arch-PEBS, arch-PEBS doesn't need to manipulate these MSRs. Thus add
a simple pair of __intel_pmu_pebs_enable/disable() callbacks for
arch-PEBS.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251029102136.61364-6-dapeng1.mi@linux.intel.com
2025-11-07 15:08:20 +01:00
Dapeng Mi 5e4e355ae7 perf/x86/intel: Correct large PEBS flag check
current large PEBS flag check only checks if sample_regs_user contains
unsupported GPRs but doesn't check if sample_regs_intr contains
unsupported GPRs.

Of course, currently PEBS HW supports to sample all perf supported GPRs,
the missed check doesn't cause real issue. But it won't be true any more
after the subsequent patches support to sample SSP register. SSP
sampling is not supported by adaptive PEBS HW and it would be supported
until arch-PEBS HW. So correct this issue.

Fixes: a47ba4d77e ("perf/x86: Enable free running PEBS for REGS_USER/INTR")
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251029102136.61364-5-dapeng1.mi@linux.intel.com
2025-11-07 15:08:20 +01:00
Dapeng Mi ee98b8bfc7 perf/x86/intel: Replace x86_pmu.drain_pebs calling with static call
Use x86_pmu_drain_pebs static call to replace calling x86_pmu.drain_pebs
function pointer.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251029102136.61364-4-dapeng1.mi@linux.intel.com
2025-11-07 15:08:20 +01:00
Dapeng Mi 7e772a93eb perf/x86: Fix NULL event access and potential PEBS record loss
When intel_pmu_drain_pebs_icl() is called to drain PEBS records, the
perf_event_overflow() could be called to process the last PEBS record.

While perf_event_overflow() could trigger the interrupt throttle and
stop all events of the group, like what the below call-chain shows.

perf_event_overflow()
  -> __perf_event_overflow()
    ->__perf_event_account_interrupt()
      -> perf_event_throttle_group()
        -> perf_event_throttle()
          -> event->pmu->stop()
            -> x86_pmu_stop()

The side effect of stopping the events is that all corresponding event
pointers in cpuc->events[] array are cleared to NULL.

Assume there are two PEBS events (event a and event b) in a group. When
intel_pmu_drain_pebs_icl() calls perf_event_overflow() to process the
last PEBS record of PEBS event a, interrupt throttle is triggered and
all pointers of event a and event b are cleared to NULL. Then
intel_pmu_drain_pebs_icl() tries to process the last PEBS record of
event b and encounters NULL pointer access.

To avoid this issue, move cpuc->events[] clearing from x86_pmu_stop()
to x86_pmu_del(). It's safe since cpuc->active_mask or
cpuc->pebs_enabled is always checked before access the event pointer
from cpuc->events[].

Closes: https://lore.kernel.org/oe-lkp/202507042103.a15d2923-lkp@intel.com
Fixes: 9734e25fbf ("perf: Fix the throttle logic for a group")
Reported-by: kernel test robot <oliver.sang@intel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251029102136.61364-3-dapeng1.mi@linux.intel.com
2025-11-07 15:08:19 +01:00
Dapeng Mi c7f69dc073 perf/x86: Remove redundant is_x86_event() prototype
2 is_x86_event() prototypes are defined in perf_event.h. Remove the
redundant one.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251029102136.61364-2-dapeng1.mi@linux.intel.com
2025-11-07 15:08:19 +01:00
Peter Zijlstra cf76553aaa entry,unwind/deferred: Fix unwind_reset_info() placement
Stephen reported that on KASAN builds he's seeing:

vmlinux.o: warning: objtool: user_exc_vmm_communication+0x15a: call to __kasan_check_read() leaves .noinstr.text section
vmlinux.o: warning: objtool: exc_debug_user+0x182: call to __kasan_check_read() leaves .noinstr.text section
vmlinux.o: warning: objtool: exc_int3+0x123: call to __kasan_check_read() leaves .noinstr.text section
vmlinux.o: warning: objtool: noist_exc_machine_check+0x17a: call to __kasan_check_read() leaves .noinstr.text section
vmlinux.o: warning: objtool: fred_exc_machine_check+0x17e: call to __kasan_check_read() leaves .noinstr.text section

This turns out to be atomic ops from unwind_reset_info() that have
explicit instrumentation. Place unwind_reset_info() in the preceding
instrumentation_begin() section.

Fixes: c6439bfaab ("Merge tag 'trace-deferred-unwind-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251105100014.GY4068168@noisy.programming.kicks-ass.net
2025-11-05 13:57:32 +01:00
Josh Poimboeuf 6568f14cb5 vmlinux.lds: Exclude .text.startup and .text.exit from TEXT_MAIN
An ftrace warning was reported in ftrace_init_ool_stub():

   WARNING: arch/powerpc/kernel/trace/ftrace.c:234 at ftrace_init_ool_stub+0x188/0x3f4, CPU#0: swapper/0

The problem is that the linker script is placing .text.startup in .text
rather than in .init.text, due to an inadvertent match of the TEXT_MAIN
'.text.[0-9a-zA-Z_]*' pattern.

This bug existed for some configurations before, but is only now coming
to light due to the TEXT_MAIN macro unification in commit 1ba9f89794
("vmlinux.lds: Unify TEXT_MAIN, DATA_MAIN, and related macros").

The .text.startup section consists of constructors which are used by
KASAN, KCSAN, and GCOV.  The constructors are only called during boot,
so .text.startup is supposed to match the INIT_TEXT pattern so it can be
placed in .init.text and freed after init.  But since INIT_TEXT comes
*after* TEXT_MAIN in the linker script, TEXT_MAIN needs to manually
exclude .text.startup.

Update TEXT_MAIN to exclude .text.startup (and its .text.startup.*
variant from -ffunction-sections), along with .text.exit and
.text.exit.* which should match EXIT_TEXT.

Specifically, use a series of more specific glob patterns to match
generic .text.* sections (for -ffunction-sections) while explicitly
excluding .text.startup[.*] and .text.exit[.*].

Also update INIT_TEXT and EXIT_TEXT to explicitly match their
-ffunction-sections variants (.text.startup.* and .text.exit.*).

Fixes: 1ba9f89794 ("vmlinux.lds: Unify TEXT_MAIN, DATA_MAIN, and related macros")
Closes: https://lore.kernel.org/72469502-ca37-4287-90b9-a751cecc498c@linux.ibm.com
Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Debugged-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Link: https://patch.msgid.link/07f74b4e5c43872572b7def30f2eac45f28675d9.1761872421.git.jpoimboe@kernel.org
2025-10-31 11:19:21 +01:00
Chen Ni 5eccd32239 objtool: Remove unneeded semicolon
Remove unnecessary semicolons reported by Coccinelle/coccicheck and the
semantic patch at scripts/coccinelle/misc/semicolon.cocci.

Signed-off-by: Chen Ni <nichen@iscas.ac.cn>
Link: https://patch.msgid.link/20251020020916.1070369-1-nichen@iscas.ac.cn
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-30 08:29:46 -07:00
Thorsten Blum 0ccf30fc64 x86/smpboot: Mark native_play_dead() as __noreturn
native_play_dead() ends by calling the non-returning function
hlt_play_dead() and therefore also never returns.

The !CONFIG_HOTPLUG_CPU stub version of native_play_dead()
unconditionally calls BUG() and does not return either.

Add the __noreturn attribute to both function definitions and their
declaration to document this behavior and to potentially improve
compiler optimizations.

Remove the obsolete comment, and add native_play_dead() to the objtool's
list of __noreturn functions.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Link: https://patch.msgid.link/20251027155107.183136-1-thorsten.blum@linux.dev
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-30 08:29:41 -07:00
Peter Zijlstra aa7387e79a unwind_user/x86: Fix arch=um build
Add CONFIG_HAVE_UNWIND_USER_FP guards to make sure this code
doesn't break arch=um builds.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Closes: https://lore.kernel.org/oe-kbuild-all/202510291919.FFGyU7nq-lkp@intel.com/
2025-10-30 09:43:14 +01:00
Tengda Wu ced37e9cea x86/dumpstack: Prevent KASAN false positive warnings in __show_regs()
When triggering a stack dump via sysrq (echo t > /proc/sysrq-trigger),
KASAN may report false-positive out-of-bounds access:

  BUG: KASAN: out-of-bounds in __show_regs+0x4b/0x340
  Call Trace:
    dump_stack_lvl
    print_address_description.constprop.0
    print_report
    __show_regs
    show_trace_log_lvl
    sched_show_task
    show_state_filter
    sysrq_handle_showstate
    __handle_sysrq
    write_sysrq_trigger
    proc_reg_write
    vfs_write
    ksys_write
    do_syscall_64
    entry_SYSCALL_64_after_hwframe

The issue occurs as follows:

  Task A (walk other tasks' stacks)           Task B (running)
  1. echo t > /proc/sysrq-trigger
  show_trace_log_lvl
    regs = unwind_get_entry_regs()
    show_regs_if_on_stack(regs)
                                              2. The stack value pointed by
                                                 `regs` keeps changing, and
                                                 so are the tags in its
                                                 KASAN shadow region.
      __show_regs(regs)
        regs->ax, regs->bx, ...
          3. hit KASAN redzones, OOB

When task A walks task B's stack without suspending it, the continuous changes
in task B's stack (and corresponding KASAN shadow tags) may cause task A to
hit KASAN redzones when accessing obsolete values on the stack, resulting in
false positive reports.

Simply stopping the task before unwinding is not a viable fix, as it would
alter the state intended to inspect. This is especially true for diagnosing
misbehaving tasks (e.g., in a hard lockup), where stopping might fail or hide
the root cause by changing the call stack.

Therefore, fix this by disabling KASAN checks during asynchronous stack
unwinding, which is identified when the unwinding task does not match the
current task (task != current).

  [ bp: Align arguments on function's opening brace. ]

Fixes: 3b3fa11bc7 ("x86/dumpstack: Print any pt_regs found on the stack")
Signed-off-by: Tengda Wu <wutengda@huaweicloud.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://patch.msgid.link/all/20251023090632.269121-1-wutengda@huaweicloud.com
2025-10-29 13:07:21 +01:00
Peter Zijlstra c69993ecdd perf: Support deferred user unwind
Add support for deferred userspace unwind to perf.

Where perf currently relies on in-place stack unwinding; from NMI
context and all that. This moves the userspace part of the unwind to
right before the return-to-userspace.

This has two distinct benefits, the biggest is that it moves the
unwind to a faultable context. It becomes possible to fault in debug
info (.eh_frame, SFrame etc.) that might not otherwise be readily
available. And secondly, it de-duplicates the user callchain where
multiple samples happen during the same kernel entry.

To facilitate this the perf interface is extended with a new record
type:

  PERF_RECORD_CALLCHAIN_DEFERRED

and two new attribute flags:

  perf_event_attr::defer_callchain - to request the user unwind be deferred
  perf_event_attr::defer_output    - to request PERF_RECORD_CALLCHAIN_DEFERRED records

The existing PERF_RECORD_SAMPLE callchain section gets a new
context type:

  PERF_CONTEXT_USER_DEFERRED

After which will come a single entry, denoting the 'cookie' of the
deferred callchain that should be attached here, matching the 'cookie'
field of the above mentioned PERF_RECORD_CALLCHAIN_DEFERRED.

The 'defer_callchain' flag is expected on all events with
PERF_SAMPLE_CALLCHAIN. The 'defer_output' flag is expect on the event
responsible for collecting side-band events (like mmap, comm etc.).
Setting 'defer_output' on multiple events will get you duplicated
PERF_RECORD_CALLCHAIN_DEFERRED records.

Based on earlier patches by Josh and Steven.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251023150002.GR4067720@noisy.programming.kicks-ass.net
2025-10-29 10:29:58 +01:00
Peter Zijlstra ae25884ad7 unwind_user/x86: Teach FP unwind about start of function
When userspace is interrupted at the start of a function, before we
get a chance to complete the frame, unwind will miss one caller.

X86 has a uprobe specific fixup for this, add bits to the generic
unwinder to support this.

Suggested-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251024145156.GM4068168@noisy.programming.kicks-ass.net
2025-10-29 10:29:58 +01:00
Josh Poimboeuf 49cf34c081 unwind_user/x86: Enable frame pointer unwinding on x86
Use ARCH_INIT_USER_FP_FRAME to describe how frame pointers are unwound
on x86, and enable CONFIG_HAVE_UNWIND_USER_FP accordingly so the
unwind_user interfaces can be used.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20250827193828.347397433@kernel.org
2025-10-29 10:29:58 +01:00
Peter Zijlstra c79dd946e3 unwind: Implement compat fp unwind
It is important to be able to unwind compat tasks too.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20250924080119.613695709@infradead.org
2025-10-29 10:29:57 +01:00
Peter Zijlstra 5578534e4b unwind: Simplify unwind_user_next_fp() alignment check
2^log_2(n) == n

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://patch.msgid.link/20250924080119.497867836@infradead.org
2025-10-29 10:29:57 +01:00
Peter Zijlstra 639214f65b unwind: Make unwind_task_info::unwind_mask consistent
The unwind_task_info::unwind_mask was manipulated using a mixture of:

  regular store
  WRITE_ONCE()
  try_cmpxchg()
  set_bit()
  atomic_long_*()

Clean up and make it consistently atomic_long_t.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20250924080119.384384486@infradead.org
2025-10-29 10:29:57 +01:00
Peter Zijlstra 42b9138f81 unwind: Simplify unwind_user_faultable()
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20250924080119.271671514@infradead.org
2025-10-29 10:29:56 +01:00
Peter Zijlstra 1e74829f36 unwind: Clarify calling context
The get_cookie() function hard relies on IRQs being disabled, but this
isn't immediately obvious when reading the function.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://patch.msgid.link/20250924080119.122507632@infradead.org
2025-10-29 10:29:56 +01:00
Peter Zijlstra a38a64712e unwind: Fix unwind_deferred_request() vs NMI
task_work_add(RWA_RESUME) isn't NMI-safe, use TWA_NMI_CURRENT when
used from NMI context.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://patch.msgid.link/20250924080119.005422353@infradead.org
2025-10-29 10:29:56 +01:00
Peter Zijlstra ae577ea0bc unwind: Add comment to unwind_deferred_task_exit()
Explain why unwind_deferred_task_exit() exist and its constraints.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://patch.msgid.link/20250924080118.893367437@infradead.org
2025-10-29 10:29:55 +01:00
Peter Zijlstra 52a1ec718b unwind: Simplify unwind_reset_info()
Invert the condition of the first if and make it an early exit to
reduce an indent level for the rest fo the function.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://patch.msgid.link/20250924080118.777916262@infradead.org
2025-10-29 10:29:55 +01:00
Peter Zijlstra b1164c7d11 unwind: Add required include files
To be self sufficient, the file needs to include linux/types.h. This
provides things like u32/u64 and struct callback_head.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://patch.msgid.link/20250924080118.665787071@infradead.org
2025-10-29 10:29:55 +01:00
Peter Zijlstra c31b9d2f58 unwind: Shorten lines
There are some exceptionally long lines that cause ugly wrapping.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://patch.msgid.link/20250924080118.545274393@infradead.org
2025-10-29 10:29:54 +01:00
Peter Zijlstra ef1ea98c8f task_work: Fix NMI race condition
__schedule()
  // disable irqs
      <NMI>
	  task_work_add(current, work, TWA_NMI_CURRENT);
      </NMI>
  // current = next;
  // enable irqs
      <IRQ>
	  task_work_set_notify_irq()
	  test_and_set_tsk_thread_flag(current,
                                       TIF_NOTIFY_RESUME); // wrong task!
      </IRQ>
  // original task skips task work on its next return to user (or exit!)

Fixes: 466e4d801c ("task_work: Add TWA_NMI_CURRENT as an additional notify mode.")
Reported-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://patch.msgid.link/20250924080118.425949403@infradead.org
2025-10-29 10:29:54 +01:00
Zhang Rui 34976eaf5f perf/x86/intel/cstate: Add Pantherlake support
Like Lunarlake, Pantherlake supports CC1/CC6/CC7 and PC2/PC6/PC10.

Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://patch.msgid.link/20251023223754.1743928-4-zide.chen@intel.com
2025-10-29 10:29:54 +01:00
Zhang Rui 4ba45f041a perf/x86/intel/cstate: Remove PC3 support from LunarLake
LunarLake doesn't support Package C3. Remove the PC3 residency counter
support from LunarLake.

Fixes: 26579860fb ("perf/x86/intel/cstate: Add Lunarlake support")
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://patch.msgid.link/20251023223754.1743928-3-zide.chen@intel.com
2025-10-29 10:29:54 +01:00
Zide Chen e39b82f6cb perf/x86/intel/cstate: Add Clearwater Forest support
Clearwater Forest is based on the Darkmont Atom microarchitecture.
>From the perspective of C-state residency profiling, it supports the
same residency counters as Sierra Forest: CC1/CC6, PC2/PC6, and MC6.

Please note that the C1E residency counter can only be read via PMT,
not MSR. Therefore, tools relying on the perf_event framework cannot
access the C1E residency.

Signed-off-by: Zhenyu Wang <zhenyuw.linux@gmail.com>
Signed-off-by: Zide Chen <zide.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://patch.msgid.link/20251023223754.1743928-2-zide.chen@intel.com
2025-10-29 10:29:53 +01:00
Peter Zijlstra 977b9a0054 Merge branch 'linus/master' into sched/core, to resolve conflict
Conflicts:
	kernel/sched/ext.c

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-10-29 08:42:28 +01:00
Peter Zijlstra af13e5e437 sched: Fix the do_set_cpus_allowed() locking fix
Commit abfc01077d ("sched: Fix do_set_cpus_allowed() locking")
overlooked that __balance_push_cpu_stop() calls select_fallback_rq()
with rq->lock held. This makes that set_cpus_allowed_force() will
recursively take rq->lock and the machine locks up.

Run select_fallback_rq() earlier, without holding rq->lock. This opens
up a race window where a task could get migrated out from under us, but
that is harmless, we want the task migrated.

select_fallback_rq() itself will not be subject to concurrency as it
will be fully serialized by p->pi_lock, so there is no chance of
set_cpus_allowed_force() getting called with different arguments and
selecting different fallback CPUs for one task.

Fixes: abfc01077d ("sched: Fix do_set_cpus_allowed() locking")
Reported-by: Jan Polensky <japo@linux.ibm.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Jan Polensky <japo@linux.ibm.com>
Closes: https://lore.kernel.org/oe-lkp/202510271206.24495a68-lkp@intel.com
Link: https://patch.msgid.link/20251027110133.GI3245006@noisy.programming.kicks-ass.net
2025-10-28 15:00:48 +01:00
Peter Zijlstra b94d45b6bb seqlock: Allow KASAN to fail optimizing
Some KASAN builds are failing to properly optimize this code --
luckily we don't care about core quality for KASAN builds, so just
exclude it.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Closes: https://lore.kernel.org/oe-kbuild-all/202510251641.idrNXhv5-lkp@intel.com/
2025-10-28 09:58:57 +01:00
Josh Poimboeuf f6af8690d1 perf build: Fix perf build issues with fixdep
Commit a808a2b35f ("tools build: Fix fixdep dependencies") broke the
perf build ("make -C tools/perf") by introducing two inadvertent
conflicts:

  1) tools/build/Makefile includes tools/build/Makefile.include, which
     defines a phony 'fixdep' target.  This conflicts with the $(FIXDEP)
     file target in tools/build/Makefile when OUTPUT is empty, causing
     make to report duplicate recipes for the same target.

  2) The FIXDEP variable in tools/build/Makefile conflicts with the
     previously existing one in tools/perf/Makefile.perf.

Remove the unnecessary include of tools/build/Makefile.include from
tools/build/Makefile, and rename the FIXDEP variable in
tools/perf/Makefile.perf to FIXDEP_BUILT.

Fixes: a808a2b35f ("tools build: Fix fixdep dependencies")
Reported-by: Thorsten Leemhuis <linux@leemhuis.info>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Thorsten Leemhuis <linux@leemhuis.info>
Link: https://patch.msgid.link/8881bc3321bd9fa58802e4f36286eefe3667806b.1760992391.git.jpoimboe@kernel.org
2025-10-23 09:53:49 +02:00
Josh Poimboeuf 9025688bf6 module: Fix device table module aliases
Commit 6717e8f91d ("kbuild: Remove 'kmod_' prefix from
__KBUILD_MODNAME") inadvertently broke module alias generation for
modules which rely on MODULE_DEVICE_TABLE().

It removed the "kmod_" prefix from __KBUILD_MODNAME, which caused
MODULE_DEVICE_TABLE() to generate a symbol name which no longer matched
the format expected by handle_moddevtable() in scripts/mod/file2alias.c.

As a result, modpost failed to find the device tables, leading to
missing module aliases.

Fix this by explicitly adding the "kmod_" string within the
MODULE_DEVICE_TABLE() macro itself, restoring the symbol name to the
format expected by file2alias.c.

Fixes: 6717e8f91d ("kbuild: Remove 'kmod_' prefix from __KBUILD_MODNAME")
Reported-by: Alexander Stein <alexander.stein@ew.tq-group.com>
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reported-by: Mark Brown <broonie@kernel.org>
Reported-by: Cosmin Tanislav <demonsingur@gmail.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Cosmin Tanislav <demonsingur@gmail.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Mark Brown <broonie@kernel.org>
Tested-by: Alexander Stein <alexander.stein@ew.tq-group.com>
Tested-by: Chen-Yu Tsai <wenst@chromium.org>
Tested-by: Anders Roxell <anders.roxell@linaro.org>
Link: https://patch.msgid.link/e52ee3edf32874da645a9e037a7d77c69893a22a.1760982784.git.jpoimboe@kernel.org
2025-10-22 15:21:55 +02:00
Boqun Feng 37d0472c8a rust: debugfs: Implement Reader for Mutex<T> only when T is Unpin
Since we are going to make `Mutex<T>` structurally pin the data (i.e.
`T`), therefore `.lock()` function only returns a `Guard` that can
dereference a mutable reference to `T` if only `T` is `Unpin`, therefore
restrict the impl `Reader` block of `Mutex<T>` to that.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Danilo Krummrich <dakr@kernel.org>
Link: https://patch.msgid.link/20251022034237.70431-1-boqun.feng@gmail.com
2025-10-22 15:21:51 +02:00
Borislav Petkov (AMD) da247eff96 objtool/klp: Add the debian-based package name of xxhash to the hint
Add the debian package name for the devel version of the xxHash package
"libxxhash-dev".

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/20251017194732.7713-1-bp@kernel.org
2025-10-22 13:51:11 +02:00
Oleg Nesterov 795aab353d seqlock: Change do_io_accounting() to use scoped_seqlock_read()
To simplify the code and make it more readable.

[peterz: change to new interface]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-10-21 12:31:57 +02:00
Oleg Nesterov b76f72bea2 seqlock: Change do_task_stat() to use scoped_seqlock_read()
To simplify the code and make it more readable.

[peterz: change to new interface]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-10-21 12:31:57 +02:00
Oleg Nesterov 488f48b326 seqlock: Change thread_group_cputime() to use scoped_seqlock_read()
To simplify the code and make it more readable.

While at it, change thread_group_cputime() to use __for_each_thread(sig).

[peterz: update to new interface]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-10-21 12:31:57 +02:00
Peter Zijlstra cc39f3872c seqlock: Introduce scoped_seqlock_read()
The read_seqbegin/need_seqretry/done_seqretry API is cumbersome and
error prone. With the new helper the "typical" code like

	int seq, nextseq;
	unsigned long flags;

	nextseq = 0;
	do {
		seq = nextseq;
		flags = read_seqbegin_or_lock_irqsave(&seqlock, &seq);

		// read-side critical section

		nextseq = 1;
	} while (need_seqretry(&seqlock, seq));
	done_seqretry_irqrestore(&seqlock, seq, flags);

can be rewritten as

	scoped_seqlock_read (&seqlock, ss_lock_irqsave) {
		// read-side critical section
	}

Original idea by Oleg Nesterov; with contributions from Linus.

Originally-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-10-21 12:31:57 +02:00
Oleg Nesterov 28a0ee3119 documentation: seqlock: fix the wrong documentation of read_seqbegin_or_lock/need_seqretry
The comments and pseudo code in Documentation/locking/seqlock.rst are wrong:

	int seq = 0;
	do {
		read_seqbegin_or_lock(&foo_seqlock, &seq);

		/* ... [[read-side critical section]] ... */

	} while (need_seqretry(&foo_seqlock, seq));

read_seqbegin_or_lock() always returns with an even "seq" and need_seqretry()
doesn't change this counter. This means that seq is always even and thus the
locking pass is simply impossible.

IOW, "_or_lock" has no effect and this code doesn't differ from

	do {
		seq = read_seqbegin(&foo_seqlock);

		/* ... [[read-side critical section]] ... */

	} while (read_seqretry(&foo_seqlock, seq));

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-10-21 12:31:56 +02:00
Arnd Bergmann 44472d1b83 atomic: Skip alignment check for try_cmpxchg() old arg
The 'old' argument in atomic_try_cmpxchg() and related functions is a
pointer to a normal non-atomic integer number, which does not require
to be naturally aligned, unlike the atomic_t/atomic64_t types themselves.

In order to add an alignment check with CONFIG_DEBUG_ATOMIC into the
normal instrument_atomic_read_write() helper, change this check to use
the non-atomic instrument_read_write(), the same way that was done
earlier for try_cmpxchg() in commit ec570320b0 ("locking/atomic:
Correct (cmp)xchg() instrumentation").

This prevents warnings on m68k calling the 32-bit atomic_try_cmpxchg()
with 16-bit aligned arguments as well as several more architectures
including x86-32 when calling atomic64_try_cmpxchg() with 32-bit
aligned u64 arguments.

Reported-by: Finn Thain <fthain@linux-m68k.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/cover.1757810729.git.fthain@linux-m68k.org/
2025-10-21 12:31:56 +02:00
Daniel Almeida 66f1ea83d9 rust: lock: Add a Pin<&mut T> accessor
In order for callers to be able to access the inner T safely if T:
!Unpin, there needs to be a way to get a Pin<&mut T>. Add this accessor
and a corresponding example to tell users how it works.

This requires the pin projection functionality [1] for better ergonomic.

[boqun: Apply Daniel's fix to the code example, add the reference to pin
projection patch and remove out-of-date part in the commit log]

Suggested-by: Benno Lossin <lossin@kernel.org>
Suggested-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Daniel Almeida <daniel.almeida@collabora.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Reviewed-by: Benno Lossin <lossin@kernel.org>
Link: https://github.com/Rust-for-Linux/linux/issues/1181
Link: https://lore.kernel.org/rust-for-linux/20250912174148.373530-1-lossin@kernel.org/ [1]
2025-10-21 12:31:56 +02:00
Daniel Almeida 2497a7116f rust: lock: Pin the inner data
In preparation to support Lock<T> where T is pinned, the first thing
that needs to be done is to structurally pin the 'data' member. This
switches the 't' parameter in Lock<T>::new() to take in an impl
PinInit<T> instead of a plain T. This in turn uses the blanket
implementation "impl PinInit<T> for T".

Subsequent patches will touch on Guard<T>.

Suggested-by: Benno Lossin <lossin@kernel.org>
Suggested-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Daniel Almeida <daniel.almeida@collabora.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Benno Lossin <lossin@kernel.org>
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Link: https://github.com/Rust-for-Linux/linux/issues/1181
2025-10-21 12:31:55 +02:00
Daniel Almeida da123f0ee4 rust: lock: guard: Add T: Unpin bound to DerefMut
A core property of pinned types is not handing a mutable reference to
the inner data in safe code, as this trivially allows that data to be
moved.

Enforce this condition by adding a bound on lock::Guard's DerefMut
implementation, so that it's only implemented for pinning-agnostic
types.

Suggested-by: Benno Lossin <lossin@kernel.org>
Suggested-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Daniel Almeida <daniel.almeida@collabora.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Benno Lossin <lossin@kernel.org>
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Link: https://github.com/Rust-for-Linux/linux/issues/1181
2025-10-21 12:31:55 +02:00
Alexander Sverdlin c14ecb555c locking/spinlock/debug: Fix data-race in do_raw_write_lock
KCSAN reports:

BUG: KCSAN: data-race in do_raw_write_lock / do_raw_write_lock

write (marked) to 0xffff800009cf504c of 4 bytes by task 1102 on cpu 1:
 do_raw_write_lock+0x120/0x204
 _raw_write_lock_irq
 do_exit
 call_usermodehelper_exec_async
 ret_from_fork

read to 0xffff800009cf504c of 4 bytes by task 1103 on cpu 0:
 do_raw_write_lock+0x88/0x204
 _raw_write_lock_irq
 do_exit
 call_usermodehelper_exec_async
 ret_from_fork

value changed: 0xffffffff -> 0x00000001

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 1103 Comm: kworker/u4:1 6.1.111

Commit 1a365e8223 ("locking/spinlock/debug: Fix various data races") has
adressed most of these races, but seems to be not consistent/not complete.

>From do_raw_write_lock() only debug_write_lock_after() part has been
converted to WRITE_ONCE(), but not debug_write_lock_before() part.
Do it now.

Fixes: 1a365e8223 ("locking/spinlock/debug: Fix various data races")
Reported-by: Adrian Freihofer <adrian.freihofer@siemens.com>
Signed-off-by: Alexander Sverdlin <alexander.sverdlin@siemens.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Waiman Long <longman@redhat.com>
Cc: stable@vger.kernel.org
2025-10-21 12:31:55 +02:00
Christophe JAILLET 27d2afa3b4 x86/ioapic: Simplify mp_irqdomain_alloc() slightly
The IRQ return value of irq_find_mapping() is only tested
for existence, not used for anything else.

So, this call can be replaced by a slightly simpler
irq_resolve_mapping() call, which reduces generated
code size a bit (x86-64 allmodconfig):

   text	   data	    bss	    dec	    hex	filename
  82142	  38633	  18048	 138823	  21e47	arch/x86/kernel/apic/io_apic.o.before
  81932	  38633	  18048	 138613	  21d75	arch/x86/kernel/apic/io_apic.o.after

[ mingo: Fixed & simplified the changelog ]

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-janitors@vger.kernel.org
Link: https://patch.msgid.link/cb3a4968538637aac3a5ae4f5ecc4f5eb43376ea.1760861877.git.christophe.jaillet@wanadoo.fr
2025-10-21 08:47:33 +02:00
Peter Zijlstra 73cbcfe255 sched/topology,x86: Fix build warning
A compile warning slipped through:

   arch/x86/kernel/smpboot.c:548:5: warning: no previous prototype for function 'arch_sched_node_distance' [-Wmissing-prototypes]

Fixes: 4d6dd05d07 ("sched/topology: Fix sched domain build error for GNR, CWF in SNC-3 mode")
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-10-16 13:01:15 +02:00
Peter Zijlstra 00a155c691 Merge branch 'objtool/core' of https://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux
This series introduces new objtool features and a klp-build script to
generate livepatch modules using a source .patch as input.

This builds on concepts from the longstanding out-of-tree kpatch [1]
project which began in 2012 and has been used for many years to generate
livepatch modules for production kernels.  However, this is a complete
rewrite which incorporates hard-earned lessons from 12+ years of
maintaining kpatch.

Key improvements compared to kpatch-build:

  - Integrated with objtool: Leverages objtool's existing control-flow
    graph analysis to help detect changed functions.

  - Works on vmlinux.o: Supports late-linked objects, making it
    compatible with LTO, IBT, and similar.

  - Simplified code base: ~3k fewer lines of code.

  - Upstream: No more out-of-tree #ifdef hacks, far less cruft.

  - Cleaner internals: Vastly simplified logic for symbol/section/reloc
    inclusion and special section extraction.

  - Robust __LINE__ macro handling: Avoids false positive binary diffs
    caused by the __LINE__ macro by introducing a fix-patch-lines script
    which injects #line directives into the source .patch to preserve
    the original line numbers at compile time.

The primary user interface is the klp-build script which does the
following:

  - Builds an original kernel with -function-sections and
    -fdata-sections, plus objtool function checksumming.

  - Applies the .patch file and rebuilds the kernel using the same
    options.

  - Runs 'objtool klp diff' to detect changed functions and generate
    intermediate binary diff objects.

  - Builds a kernel module which links the diff objects with some
    livepatch module init code (scripts/livepatch/init.c).

  - Finalizes the livepatch module (aka work around linker wreckage)
    using 'objtool klp post-link'.

I've tested with a variety of patches on defconfig and Fedora-config
kernels with both GCC and Clang.
2025-10-16 11:38:19 +02:00
Peter Zijlstra 4c95380701 sched/ext: Fold balance_scx() into pick_task_scx()
With pick_task() having an rf argument, it is possible to do the
lock-break there, get rid of the weird balance/pick_task hack.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
2025-10-16 11:13:55 +02:00
Joel Fernandes 50653216e4 sched: Add support to pick functions to take rf
Some pick functions like the internal pick_next_task_fair() already take
rf but some others dont. We need this for scx's server pick function.
Prepare for this by having pick functions accept it.

[peterz: - added RETRY_TASK handling
         - removed pick_next_task_fair indirection]
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
2025-10-16 11:13:55 +02:00
Peter Zijlstra 1e900f415c sched: Detect per-class runqueue changes
Have enqueue/dequeue set a per-class bit in rq->queue_mask. This then
enables easy tracking of which runqueues are modified over a
lock-break.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
2025-10-16 11:13:55 +02:00
Peter Zijlstra 73ec89a1ce sched: Mandate shared flags for sched_change
Shrikanth noted that sched_change pattern relies on using shared
flags.

Suggested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-10-16 11:13:54 +02:00
Peter Zijlstra d4c64207b8 sched: Cleanup the sched_change NOCLOCK usage
Teach the sched_change pattern how to do update_rq_clock(); this
allows for some simplifications / cleanups.

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:54 +02:00
Peter Zijlstra 5892cbd85d sched: Match __task_rq_{,un}lock()
In preparation to adding more rules to __task_rq_lock(), such that
__task_rq_unlock() will no longer be equivalent to rq_unlock(),
make sure every __task_rq_lock() is matched by a __task_rq_unlock()
and vice-versa.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:54 +02:00
Peter Zijlstra 46a177fb01 sched: Add locking comments to sched_class methods
'Document' the locking context the various sched_class methods are
called under.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:53 +02:00
Peter Zijlstra 650952d3fb sched: Make __do_set_cpus_allowed() use the sched_change pattern
Now that do_set_cpus_allowed() holds all the regular locks, convert it
to use the sched_change pattern helper.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:53 +02:00
Peter Zijlstra b079d93796 sched: Rename do_set_cpus_allowed()
Hopefully saner naming.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:53 +02:00
Peter Zijlstra abfc01077d sched: Fix do_set_cpus_allowed() locking
All callers of do_set_cpus_allowed() only take p->pi_lock, which is
not sufficient to actually change the cpumask. Again, this is mostly
ok in these cases, but it results in unnecessarily complicated
reasoning.

Furthermore, there is no reason what so ever to not just take all the
required locks, so do just that.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:52 +02:00
Peter Zijlstra 942b8db965 sched: Fix migrate_disable_switch() locking
For some reason migrate_disable_switch() was more complicated than it
needs to be, resulting in mind bending locking of dubious quality.

Recognise that migrate_disable_switch() must be called before a
context switch, but any place before that switch is equally good.
Since the current place results in troubled locking, simply move the
thing before taking rq->lock.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:52 +02:00
Peter Zijlstra 6455ad5346 sched: Move sched_class::prio_changed() into the change pattern
Move sched_class::prio_changed() into the change pattern.

And while there, extend it with sched_class::get_prio() in order to
fix the deadline sitation.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:52 +02:00
Peter Zijlstra 1ae5f5dfe5 sched: Cleanup sched_delayed handling for class switches
Use the new sched_class::switching_from() method to dequeue delayed
tasks before switching to another class.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
2025-10-16 11:13:51 +02:00
Peter Zijlstra 637b068282 sched: Fold sched_class::switch{ing,ed}_{to,from}() into the change pattern
Add {DE,EN}QUEUE_CLASS and fold the sched_class::switch* methods into
the change pattern. This completes and makes the pattern more
symmetric.

This changes the order of callbacks slightly:

  OLD                              NEW
				|
				|  switching_from()
  dequeue_task();		|  dequeue_task()
  put_prev_task();		|  put_prev_task()
				|  switched_from()
				|
  ... change task ...		|  ... change task ...
				|
  switching_to();		|  switching_to()
  enqueue_task();		|  enqueue_task()
  set_next_task();		|  set_next_task()
  prev_class->switched_from()	|
  switched_to()			|  switched_to()
				|

Notably, it moves the switched_from() callback right after the
dequeue/put. Existing implementations don't appear to be affected by
this change in location -- specifically the task isn't enqueued on the
class in question in either location.

Make (CLASS)^(SAVE|MOVE), because there is nothing to save-restore
when changing scheduling classes.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:51 +02:00
Peter Zijlstra 5e42d4c123 sched/deadline: Prepare for switched_from() change
Prepare for the sched_class::switch*() methods getting folded into the
change pattern. As a result of that, the location of switched_from
will change slightly. SCHED_DEADLINE is affected by this change in
location:

  OLD                              NEW
				|
				|  switching_from()
  dequeue_task();		|  dequeue_task()
  put_prev_task();		|  put_prev_task()
				|  switched_from()
				|
  ... change task ...		|  ... change task ...
				|
  switching_to();		|  switching_to()
  enqueue_task();		|  enqueue_task()
  set_next_task();		|  set_next_task()
  prev_class->switched_from()	|
  switched_to()			|  switched_to()
				|

Notably, where switched_from() was called *after* the change to the
task, it will get called before it. Specifically, switched_from_dl()
uses dl_task(p) which uses p->prio; which is changed when switching
class (it might be the reason to switch class in case of PI).

When switched_from_dl() gets called, the task will have left the
deadline class and dl_task() must be false, while when doing
dequeue_dl_entity() the task must be a dl_task(), otherwise we'd have
called a different dequeue method.

Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-10-16 11:13:51 +02:00
Peter Zijlstra 376f8963bb sched: Re-arrange the {EN,DE}QUEUE flags
Ensure the matched flags are in the low word while the unmatched flags
go into the second word.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:50 +02:00
Peter Zijlstra e9139f765a sched: Employ sched_change guards
As proposed a long while ago -- and half done by scx -- wrap the
scheduler's 'change' pattern in a guard helper.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-10-16 11:13:50 +02:00
Adam Li 82d6e01a06 sched/fair: Only update stats for allowed CPUs when looking for dst group
Load imbalance is observed when the workload frequently forks new threads.
Due to CPU affinity, the workload can run on CPU 0-7 in the first
group, and only on CPU 8-11 in the second group. CPU 12-15 are always idle.

{ 0 1 2 3 4 5 6 7 } {8 9 10 11 12 13 14 15}
  * * * * * * * *    * * *  *

When looking for dst group for newly forked threads, in many times
update_sg_wakeup_stats() reports the second group has more idle CPUs
than the first group. The scheduler thinks the second group is less
busy. Then it selects least busy CPUs among CPU 8-11. Therefore CPU 8-11
can be crowded with newly forked threads, at the same time CPU 0-7
can be idle.

A task may not use all the CPUs in a schedule group due to CPU affinity.
Only update schedule group statistics for allowed CPUs.

Signed-off-by: Adam Li <adamli@os.amperecomputing.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-10-16 11:13:50 +02:00
Tim Chen 4d6dd05d07 sched/topology: Fix sched domain build error for GNR, CWF in SNC-3 mode
It is possible for Granite Rapids (GNR) and Clearwater Forest
(CWF) to have up to 3 dies per package. When sub-numa cluster (SNC-3)
is enabled, each die will become a separate NUMA node in the package
with different distances between dies within the same package.

For example, on GNR, we see the following numa distances for a 2 socket
system with 3 dies per socket:

    package 1       package2
	----------------
	|               |
    ---------       ---------
    |   0   |       |   3   |
    ---------       ---------
	|               |
    ---------       ---------
    |   1   |       |   4   |
    ---------       ---------
	|               |
    ---------       ---------
    |   2   |       |   5   |
    ---------       ---------
	|               |
	----------------

node distances:
node     0    1    2    3    4    5
0:   	10   15   17   21   28   26
1:   	15   10   15   23   26   23
2:   	17   15   10   26   23   21
3:   	21   28   26   10   15   17
4:   	23   26   23   15   10   15
5:   	26   23   21   17   15   10

The node distances above led to 2 problems:

1. Asymmetric routes taken between nodes in different packages led to
asymmetric scheduler domain perspective depending on which node you
are on.  Current scheduler code failed to build domains properly with
asymmetric distances.

2. Multiple remote distances to respective tiles on remote package create
too many levels of domain hierarchies grouping different nodes between
remote packages.

For example, the above GNR topology lead to NUMA domains below:

Sched domains from the perspective of a CPU in node 0, where the number
in bracket represent node number.

NUMA-level 1    [0,1] [2]
NUMA-level 2    [0,1,2] [3]
NUMA-level 3    [0,1,2,3] [5]
NUMA-level 4    [0,1,2,3,5] [4]

Sched domains from the perspective of a CPU in node 4
NUMA-level 1    [4] [3,5]
NUMA-level 2    [3,4,5] [0,2]
NUMA-level 3    [0,2,3,4,5] [1]

Scheduler group peers for load balancing from the perspective of CPU 0
and 4 are different.  Improper task could be chosen for load balancing
between groups such as [0,2,3,4,5] [1].  Ideally you should choose nodes
in 0 or 2 that are in same package as node 1 first.  But instead tasks
in the remote package node 3, 4, 5 could be chosen with an equal chance
and could lead to excessive remote package migrations and imbalance of
load between packages.  We should not group partial remote nodes and
local nodes together.
Simplify the remote distances for CWF and GNR for the purpose of
sched domains building, which maintains symmetry and leads to a more
reasonable load balance hierarchy.

The sched domains from the perspective of a CPU in node 0 NUMA-level 1
is now
NUMA-level 1    [0,1] [2]
NUMA-level 2    [0,1,2] [3,4,5]

The sched domains from the perspective of a CPU in node 4 NUMA-level 1
is now
NUMA-level 1    [4] [3,5]
NUMA-level 2    [3,4,5] [0,1,2]

We have the same balancing perspective from node 0 or node 4.  Loads are
now balanced equally between packages.

Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: Zhao Liu <zhao1.liu@intel.com>
2025-10-16 11:13:50 +02:00
Tim Chen 06f2c90885 sched: Create architecture specific sched domain distances
Allow architecture specific sched domain NUMA distances that are
modified from actual NUMA node distances for the purpose of building
NUMA sched domains.

Keep actual NUMA distances separately if modified distances
are used for building sched domains. Such distances
are still needed as NUMA balancing benefits from finding the
NUMA nodes that are actually closer to a task numa_group.

Consolidate the recording of unique NUMA distances in an array to
sched_record_numa_dist() so the function can be reused to record NUMA
distances when the NUMA distance metric is changed.

No functional change and additional distance array
allocated if there're no arch specific NUMA distances
being defined.

Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
2025-10-16 11:13:49 +02:00
Doug Berger 382748c05e sched/deadline: only set free_cpus for online runqueues
Commit 16b269436b ("sched/deadline: Modify cpudl::free_cpus
to reflect rd->online") introduced the cpudl_set/clear_freecpu
functions to allow the cpu_dl::free_cpus mask to be manipulated
by the deadline scheduler class rq_on/offline callbacks so the
mask would also reflect this state.

Commit 9659e1eeee ("sched/deadline: Remove cpu_active_mask
from cpudl_find()") removed the check of the cpu_active_mask to
save some processing on the premise that the cpudl::free_cpus
mask already reflected the runqueue online state.

Unfortunately, there are cases where it is possible for the
cpudl_clear function to set the free_cpus bit for a CPU when the
deadline runqueue is offline. When this occurs while a CPU is
connected to the default root domain the flag may retain the bad
state after the CPU has been unplugged. Later, a different CPU
that is transitioning through the default root domain may push a
deadline task to the powered down CPU when cpudl_find sees its
free_cpus bit is set. If this happens the task will not have the
opportunity to run.

One example is outlined here:
https://lore.kernel.org/lkml/20250110233010.2339521-1-opendmb@gmail.com

Another occurs when the last deadline task is migrated from a
CPU that has an offlined runqueue. The dequeue_task member of
the deadline scheduler class will eventually call cpudl_clear
and set the free_cpus bit for the CPU.

This commit modifies the cpudl_clear function to be aware of the
online state of the deadline runqueue so that the free_cpus mask
can be updated appropriately.

It is no longer necessary to manage the mask outside of the
cpudl_set/clear functions so the cpudl_set/clear_freecpu
functions are removed. In addition, since the free_cpus mask is
now only updated under the cpudl lock the code was changed to
use the non-atomic __cpumask functions.

Signed-off-by: Doug Berger <opendmb@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-10-16 11:13:49 +02:00
Fernand Sieber 79104becf4 sched/fair: Forfeit vruntime on yield
If a task yields, the scheduler may decide to pick it again. The task in
turn may decide to yield immediately or shortly after, leading to a tight
loop of yields.

If there's another runnable task as this point, the deadline will be
increased by the slice at each loop. This can cause the deadline to runaway
pretty quickly, and subsequent elevated run delays later on as the task
doesn't get picked again. The reason the scheduler can pick the same task
again and again despite its deadline increasing is because it may be the
only eligible task at that point.

Fix this by making the task forfeiting its remaining vruntime and pushing
the deadline one slice ahead. This implements yield behavior more
authentically.

We limit the forfeiting to eligible tasks. This is because core scheduling
prefers running ineligible tasks rather than force idling. As such, without
the condition, we can end up on a yield loop which makes the vruntime
increase rapidly, leading to anomalous run delays later down the line.

Fixes: 147f3efaa2 ("sched/fair: Implement an EEVDF-like scheduling  policy")
Signed-off-by: Fernand Sieber <sieberf@amazon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250401123622.584018-1-sieberf@amazon.com
Link: https://lore.kernel.org/r/20250911095113.203439-1-sieberf@amazon.com
Link: https://lore.kernel.org/r/20250916140228.452231-1-sieberf@amazon.com
2025-10-16 11:13:49 +02:00
Peter Zijlstra 45e1dccc06 x86/insn: Simplify for_each_insn_prefix()
Use the new-found freedom of allowing variable declarions inside
for() to simplify the for_each_insn_prefix() iterator to no longer
need an external temporary.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-10-16 11:13:48 +02:00
Peter Zijlstra 8a5c6cbfe4 x86/insn,uprobes,alternative: Unify insn_is_nop()
Both uprobes and alternatives have insn_is_nop() variants, unify them
and make sure insn_is_nop() works for both x86_64 and i386.

Specifically, uprobe must not compare userspace instructions to kernel
nops as that does not work right in the compat case.

For the uprobe case we therefore must recognise common 32bit and 64bit
nops. Because uprobe will consume the instruction as a nop, it must
not mistakenly claim a non-nop instruction to be a nop. Eg. 'REX.b3
NOP' is 'xchg %r8,%rax' - not a nop.

For the kernel case similar constraints apply, is it used to optimize
NOPs by replacing strings of short(er) nops with longer nops. Must not
claim an instruction is a nop if it really isn't. Not recognising a
nop is non-fatal.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-10-16 11:13:47 +02:00
George Kennedy 866cf36bfe perf/x86/amd: Check event before enable to avoid GPF
On AMD machines cpuc->events[idx] can become NULL in a subtle race
condition with NMI->throttle->x86_pmu_stop().

Check event for NULL in amd_pmu_enable_all() before enable to avoid a GPF.
This appears to be an AMD only issue.

Syzkaller reported a GPF in amd_pmu_enable_all.

INFO: NMI handler (perf_event_nmi_handler) took too long to run: 13.143
    msecs
Oops: general protection fault, probably for non-canonical address
    0xdffffc0000000034: 0000  PREEMPT SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x00000000000001a0-0x00000000000001a7]
CPU: 0 UID: 0 PID: 328415 Comm: repro_36674776 Not tainted 6.12.0-rc1-syzk
RIP: 0010:x86_pmu_enable_event (arch/x86/events/perf_event.h:1195
    arch/x86/events/core.c:1430)
RSP: 0018:ffff888118009d60 EFLAGS: 00010012
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000034 RSI: 0000000000000000 RDI: 00000000000001a0
RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
R13: ffff88811802a440 R14: ffff88811802a240 R15: ffff8881132d8601
FS:  00007f097dfaa700(0000) GS:ffff888118000000(0000) GS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000200001c0 CR3: 0000000103d56000 CR4: 00000000000006f0
Call Trace:
 <IRQ>
amd_pmu_enable_all (arch/x86/events/amd/core.c:760 (discriminator 2))
x86_pmu_enable (arch/x86/events/core.c:1360)
event_sched_out (kernel/events/core.c:1191 kernel/events/core.c:1186
    kernel/events/core.c:2346)
__perf_remove_from_context (kernel/events/core.c:2435)
event_function (kernel/events/core.c:259)
remote_function (kernel/events/core.c:92 (discriminator 1)
    kernel/events/core.c:72 (discriminator 1))
__flush_smp_call_function_queue (./arch/x86/include/asm/jump_label.h:27
    ./include/linux/jump_label.h:207 ./include/trace/events/csd.h:64
    kernel/smp.c:135 kernel/smp.c:540)
__sysvec_call_function_single (./arch/x86/include/asm/jump_label.h:27
    ./include/linux/jump_label.h:207
    ./arch/x86/include/asm/trace/irq_vectors.h:99 arch/x86/kernel/smp.c:272)
sysvec_call_function_single (arch/x86/kernel/smp.c:266 (discriminator 47)
    arch/x86/kernel/smp.c:266 (discriminator 47))
 </IRQ>

Reported-by: syzkaller <syzkaller@googlegroups.com>
Signed-off-by: George Kennedy <george.kennedy@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-10-16 11:13:47 +02:00
Josh Poimboeuf b9976fa464 livepatch: Introduce source code helpers for livepatch modules
Add some helper macros which can be used by livepatch source .patch
files to register callbacks, convert static calls to regular calls where
needed, and patch syscalls.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:50:19 -07:00
Josh Poimboeuf 78be9facfb livepatch/klp-build: Add --show-first-changed option to show function divergence
Add a --show-first-changed option to identify where changed functions
begin to diverge:

  - Parse 'objtool klp diff' output to find changed functions.

  - Run objtool again on each object with --debug-checksum=<funcs>.

  - Diff the per-instruction checksum debug output to locate the first
    differing instruction.

This can be useful for quickly determining where and why a function
changed.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:50:19 -07:00
Josh Poimboeuf 2c2f0b8626 livepatch/klp-build: Add --debug option to show cloning decisions
Add a --debug option which gets passed to "objtool klp diff" to enable
debug output related to cloning decisions.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:50:19 -07:00
Josh Poimboeuf 24ebfcd65a livepatch/klp-build: Introduce klp-build script for generating livepatch modules
Add a klp-build script which automates the generation of a livepatch
module from a source .patch file by performing the following steps:

  - Builds an original kernel with -function-sections and
    -fdata-sections, plus objtool function checksumming.

  - Applies the .patch file and rebuilds the kernel using the same
    options.

  - Runs 'objtool klp diff' to detect changed functions and generate
    intermediate binary diff objects.

  - Builds a kernel module which links the diff objects with some
    livepatch module init code (scripts/livepatch/init.c).

  - Finalizes the livepatch module (aka work around linker wreckage)
    using 'objtool klp post-link'.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:50:19 -07:00
Josh Poimboeuf 59adee07b5 livepatch/klp-build: Add stub init code for livepatch modules
Add a module initialization stub which can be linked with binary diff
objects to produce a livepatch module.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:50:19 -07:00
Josh Poimboeuf abaf1f42dd livepatch/klp-build: Introduce fix-patch-lines script to avoid __LINE__ diff noise
The __LINE__ macro creates challenges for binary diffing.  When a .patch
file adds or removes lines, it shifts the line numbers for all code
below it.

This can cause the code generation of functions using __LINE__ to change
due to the line number constant being embedded in a MOV instruction,
despite there being no semantic difference.

Avoid such false positives by adding a fix-patch-lines script which can
be used to insert a #line directive in each patch hunk affecting the
line numbering.  This script will be used by klp-build, which will be
introduced in a subsequent patch.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:50:19 -07:00
Josh Poimboeuf f2c356d1d0 kbuild,objtool: Defer objtool validation step for CONFIG_KLP_BUILD
In preparation for klp-build, defer objtool validation for
CONFIG_KLP_BUILD kernels until the final pre-link archive (e.g.,
vmlinux.o, module-foo.o) is built.  This will simplify the process of
generating livepatch modules.

Delayed objtool is generally preferred anyway, and is already standard
for IBT and LTO.  Eventually the per-translation-unit mode will be
phased out.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:50:19 -07:00
Josh Poimboeuf 7ae60ff0b7 livepatch: Add CONFIG_KLP_BUILD
In preparation for introducing klp-build, add a new CONFIG_KLP_BUILD
option.  The initial version will only be supported on x86-64.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:50:18 -07:00
Josh Poimboeuf 164c9201e1 objtool: Add base objtool support for livepatch modules
In preparation for klp-build, enable "classic" objtool to work on
livepatch modules:

  - Avoid duplicate symbol/section warnings for prefix symbols and the
    .static_call_sites and __mcount_loc sections which may have already
    been extracted by klp diff.

  - Add __klp_funcs to the IBT function pointer section whitelist.

  - Prevent KLP symbols from getting incorrectly classified as cold
    subfunctions.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:50:18 -07:00
Josh Poimboeuf 2058f6d166 objtool: Refactor prefix symbol creation code
The prefix symbol creation code currently ignores all errors, presumably
because some functions don't have the leading NOPs.

Shuffle the code around a bit, improve the error handling and document
why some errors are ignored.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:50:18 -07:00
Josh Poimboeuf ebe864b553 objtool/klp: Add post-link subcommand to finalize livepatch modules
Livepatch needs some ELF magic which linkers don't like:

  - Two relocation sections (.rela*, .klp.rela*) for the same text
    section.

  - Use of SHN_LIVEPATCH to mark livepatch symbols.

Unfortunately linkers tend to mangle such things.  To work around that,
klp diff generates a linker-compliant intermediate binary which encodes
the relevant KLP section/reloc/symbol metadata.

After module linking, the .ko then needs to be converted to an actual
livepatch module.  Introduce a new klp post-link subcommand to do so.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:50:18 -07:00
Josh Poimboeuf 7c2575a640 objtool/klp: Add --debug option to show cloning decisions
Add a --debug option to klp diff which prints cloning decisions and an
indented dependency tree for all cloned symbols and relocations.  This
helps visualize which symbols and relocations were included and why.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:50:18 -07:00
Josh Poimboeuf dd590d4d57 objtool/klp: Introduce klp diff subcommand for diffing object files
Add a new klp diff subcommand which performs a binary diff between two
object files and extracts changed functions into a new object which can
then be linked into a livepatch module.

This builds on concepts from the longstanding out-of-tree kpatch [1]
project which began in 2012 and has been used for many years to generate
livepatch modules for production kernels.  However, this is a complete
rewrite which incorporates hard-earned lessons from 12+ years of
maintaining kpatch.

Key improvements compared to kpatch-build:

  - Integrated with objtool: Leverages objtool's existing control-flow
    graph analysis to help detect changed functions.

  - Works on vmlinux.o: Supports late-linked objects, making it
    compatible with LTO, IBT, and similar.

  - Simplified code base: ~3k fewer lines of code.

  - Upstream: No more out-of-tree #ifdef hacks, far less cruft.

  - Cleaner internals: Vastly simplified logic for symbol/section/reloc
    inclusion and special section extraction.

  - Robust __LINE__ macro handling: Avoids false positive binary diffs
    caused by the __LINE__ macro by introducing a fix-patch-lines script
    (coming in a later patch) which injects #line directives into the
    source .patch to preserve the original line numbers at compile time.

Note the end result of this subcommand is not yet functionally complete.
Livepatch needs some ELF magic which linkers don't like:

  - Two relocation sections (.rela*, .klp.rela*) for the same text
    section.

  - Use of SHN_LIVEPATCH to mark livepatch symbols.

Unfortunately linkers tend to mangle such things.  To work around that,
klp diff generates a linker-compliant intermediate binary which encodes
the relevant KLP section/reloc/symbol metadata.

After module linking, a klp post-link step (coming soon) will clean up
the mess and convert the linked .ko into a fully compliant livepatch
module.

Note this subcommand requires the diffed binaries to have been compiled
with -ffunction-sections and -fdata-sections, and processed with
'objtool --checksum'.  Those constraints will be handled by a klp-build
script introduced in a later patch.

Without '-ffunction-sections -fdata-sections', reliable object diffing
would be infeasible due to toolchain limitations:

  - For intra-file+intra-section references, the compiler might
    occasionally generated hard-coded instruction offsets instead of
    relocations.

  - Section-symbol-based references can be ambiguous:

    - Overlapping or zero-length symbols create ambiguity as to which
      symbol is being referenced.

    - A reference to the end of a symbol (e.g., checking array bounds)
      can be misinterpreted as a reference to the next symbol, or vice
      versa.

A potential future alternative to '-ffunction-sections -fdata-sections'
would be to introduce a toolchain option that forces symbol-based
(non-section) relocations.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:50:18 -07:00
Josh Poimboeuf a3493b3338 objtool/klp: Add --debug-checksum=<funcs> to show per-instruction checksums
Add a --debug-checksum=<funcs> option to the check subcommand to print
the calculated checksum of each instruction in the given functions.

This is useful for determining where two versions of a function begin to
diverge.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:50:18 -07:00
Josh Poimboeuf 0d83da43b1 objtool/klp: Add --checksum option to generate per-function checksums
In preparation for the objtool klp diff subcommand, add a command-line
option to generate a unique checksum for each function.  This will
enable detection of functions which have changed between two versions of
an object file.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:50:17 -07:00
Josh Poimboeuf f6b740ef5f objtool: Unify STACK_FRAME_NON_STANDARD entry sizes
The C implementation of STACK_FRAME_NON_STANDARD emits 8-byte entries,
whereas the asm version's entries are only 4 bytes.

Make them consistent by converting the asm version to 8-byte entries.

This is much easier than converting the C version to 4-bytes, which
would require awkwardly putting inline asm in a dummy function in order
to pass the 'func' pointer to the asm.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:50:17 -07:00
Josh Poimboeuf aca282ab7e x86/asm: Annotate special section entries
In preparation for the objtool klp diff subcommand, add annotations for
special section entries.  This will enable objtool to determine the size
and location of the entries and to extract them when needed.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:50:17 -07:00
Josh Poimboeuf 58f36a5756 objtool: Add ANNOTATE_DATA_SPECIAL
In preparation for the objtool klp diff subcommand, add an
ANNOTATE_DATA_SPECIAL macro which annotates special section entries so
that objtool can determine their size and location and extract them
when needed.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:50:16 -07:00
Josh Poimboeuf d2c60bde1c objtool: Move ANNOTATE* macros to annotate.h
In preparation for using the objtool annotation macros in higher-level
objtool.h macros like UNWIND_HINT, move them to their own file.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:49:20 -07:00
Josh Poimboeuf 3b92486fa1 objtool: Add annotype() helper
... for reading annotation types.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:46:49 -07:00
Josh Poimboeuf 03c19a99ee objtool: Add elf_create_file()
Add interface to enable the creation of a new ELF file.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:46:49 -07:00
Josh Poimboeuf 2c05ca0262 objtool: Add elf_create_reloc() and elf_init_reloc()
elf_create_rela_section() is quite limited in that it requires the
caller to know how many relocations need to be allocated up front.

In preparation for the objtool klp diff subcommand, allow an arbitrary
number of relocations to be created and initialized on demand after
section creation.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:46:49 -07:00
Josh Poimboeuf 431dbabf2d objtool: Add elf_create_data()
In preparation for the objtool klp diff subcommand, refactor
elf_add_string() by adding a new elf_add_data() helper which allows the
adding of arbitrary data to a section.

Make both interfaces global so they can be used by the upcoming klp diff
code.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:46:48 -07:00
Josh Poimboeuf 243e963853 objtool: Generalize elf_create_section()
In preparation for the objtool klp diff subcommand, broaden the
elf_create_section() interface to give callers more control and reduce
duplication of some subtle setup logic.

While at it, make elf_create_rela_section() global so sections can be
created by the upcoming klp diff code.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:46:48 -07:00
Josh Poimboeuf dd2c29aafd objtool: Generalize elf_create_symbol()
In preparation for the objtool klp diff subcommand, broaden the
elf_create_symbol() interface to give callers more control and reduce
duplication of some subtle setup logic.

While at it, make elf_create_symbol() and elf_create_section_symbol()
global so sections can be created by the upcoming klp diff code.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:46:48 -07:00
Josh Poimboeuf 02cf323a7e objtool: Simplify special symbol handling in elf_update_symbol()
!sym->sec isn't actually a thing: even STT_UNDEF and other special
symbol types belong to NULL section 0.

Simplify the initialization of 'shndx' accordingly.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:46:48 -07:00
Josh Poimboeuf a05de0a772 objtool: Refactor add_jump_destinations()
The add_jump_destinations() logic is a bit weird and convoluted after
being incrementally tweaked over the years.  Refactor it to hopefully be
more logical and straightforward.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:46:48 -07:00
Josh Poimboeuf 935c0b6a05 objtool: Reindent check_options[]
Bring the cmdline check_options[] array back into vertical alignment for
better readability.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:46:48 -07:00
Josh Poimboeuf 2b91479776 objtool: Resurrect --backup option
The --backup option was removed with the following commit:

  aa8b3e64fd ("objtool: Create backup on error and print args")

... which tied the backup functionality to --verbose, and only for
warnings/errors.

It's a bit inelegant and out of scope to tie that to --verbose.

Bring back the old --backup option, but with the new behavior: only on
warnings/errors, and print the args to make it easier to recreate.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:46:48 -07:00
Josh Poimboeuf 56754f0f46 objtool: Rename --Werror to --werror
The objtool --Werror option name is stylistically inconsistent: halfway
between GCC's single-dash capitalized -Werror and objtool's double-dash
--lowercase convention, making it unnecessarily hard to remember.

Make the 'W' lower case (--werror) for consistency with objtool's other
options.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:46:48 -07:00
Josh Poimboeuf 48f1bbaf26 objtool: Avoid emptying lists for duplicate sections
When a to-be-created section already exists, there's no point in
emptying the various lists if their respective sections already exist.
In fact it's better to leave them intact as they might get used later.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:46:47 -07:00
Josh Poimboeuf a040ab73df objtool: Simplify reloc offset calculation in unwind_read_hints()
Simplify the relocation offset calculation in unwind_read_hints(),
similar to other conversions which have already been done.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:46:47 -07:00
Josh Poimboeuf a1526bcfcb objtool: Mark prefix functions
In preparation for the objtool klp diff subcommand, introduce a flag to
identify __pfx_*() and __cfi_*() functions in advance so they don't need
to be manually identified every time a check is needed.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:46:47 -07:00
Josh Poimboeuf c9e9b85d41 objtool: Fix weak symbol hole detection for .cold functions
When ignore_unreachable_insn() looks for weak function holes which jump
to their .cold functions, it assumes the parent function comes before
the corresponding .cold function in the symbol table.  That's not
necessarily the case with -ffunction-sections.

Mark all the holes beforehand (including .cold functions) so the
ordering of the discovery doesn't matter.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:46:47 -07:00
Josh Poimboeuf 4ea029389b objtool: Mark .cold subfunctions
Introduce a flag to identify .cold subfunctions so they can be detected
easier and faster.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:46:46 -07:00
Josh Poimboeuf 25eac74b6b objtool: Add section/symbol type helpers
Add some helper macros to improve readability.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:45:25 -07:00
Josh Poimboeuf 96eceff331 objtool: Convert elf iterator macros to use 'struct elf'
'struct objtool_file' is specific to the check code and doesn't belong
in the elf code which is supposed to be objtool_file-agnostic.  Convert
the elf iterator macros to use 'struct elf' instead.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:45:25 -07:00
Josh Poimboeuf 72e4b6b44e objtool: Remove .parainstructions reference
The .parainstructions section no longer exists since the following
commit:

  60bc276b12 ("x86/paravirt: Switch mixed paravirt/alternative calls to alternatives").

Remove the reference to it.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:45:24 -07:00
Josh Poimboeuf 31eca25f3a objtool: Clean up compiler flag usage
KBUILD_HOSTCFLAGS and KBUILD_HOSTLDFLAGS aren't defined when objtool is
built standalone.  Also, the EXTRA_WARNINGS flags are rather arbitrary.

Make things simpler and more consistent by specifying compiler flags
explicitly and tweaking the warnings.  Also make a few code tweaks to
make the new warnings happy.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:45:24 -07:00
Josh Poimboeuf 34244f784c objtool: Const string cleanup
Use 'const char *' where applicable.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:45:24 -07:00
Josh Poimboeuf 3e4b5f66cf objtool: Check for missing annotation entries in read_annotate()
Add a sanity check to make sure none of the relocations for the
.discard.annotate_insn section have gone missing.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:45:24 -07:00
Josh Poimboeuf 4cdee7888f objtool: Fix "unexpected end of section" warning for alternatives
Due to the short circuiting logic in next_insn_to_validate(), control
flow may silently transition from .altinstr_replacement to .text without
a corresponding nested call to validate_branch().

As a result the validate_branch() 'sec' variable doesn't get
reinitialized, which can trigger a confusing "unexpected end of section"
warning which blames .altinstr_replacement rather than the offending
fallthrough function.

Fix that by not caching the section.  There's no point in doing that
anyway.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:45:24 -07:00
Josh Poimboeuf 68245893cf objtool: Fix __pa_symbol() relocation handling
__pa_symbol() generates a relocation which refers to a physical address.
Convert it to back its virtual form before calculating the addend.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:45:24 -07:00
Josh Poimboeuf 41d24d7858 objtool: Fix x86 addend calculation
On x86, arch_dest_reloc_offset() hardcodes the addend adjustment to
four, but the actual adjustment depends on the relocation type.  Fix
that.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:45:24 -07:00
Josh Poimboeuf 72567c630d objtool: Fix weak symbol detection
find_symbol_hole_containing() fails to find a symbol hole (aka stripped
weak symbol) if its section has no symbols before the hole.  This breaks
weak symbol detection if -ffunction-sections is enabled.

Fix that by allowing the interval tree to contain section symbols, which
are always at offset zero for a given section.

Fixes a bunch of (-ffunction-sections) warnings like:

  vmlinux.o: warning: objtool: .text.__x64_sys_io_setup+0x10: unreachable instruction

Fixes: 4adb236867 ("objtool: Ignore extra-symbol code")
Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:45:23 -07:00
Josh Poimboeuf c2a3e7af31 objtool: Fix interval tree insertion for zero-length symbols
Zero-length symbols get inserted in the wrong spot.  Fix that.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:45:23 -07:00
Josh Poimboeuf 81cf39be35 objtool: Add empty symbols to the symbol tree again
The following commit

  5da6aea375 ("objtool: Fix find_{symbol,func}_containing()")

fixed the issue where overlapping symbols weren't getting sorted
properly in the symbol tree.  Therefore the workaround to skip adding
empty symbols from the following commit

  a2e38dffcd ("objtool: Don't add empty symbols to the rbtree")

is no longer needed.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:45:23 -07:00
Josh Poimboeuf 4ac2ba35f6 objtool: Remove error handling boilerplate
Up to a certain point in objtool's execution, all errors are fatal and
return -1.  When propagating such errors, just return -1 directly
instead of trying to propagate the original return code.  This helps
make the code more compact and the behavior more explicit.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:45:23 -07:00
Josh Poimboeuf 2bb23cbf3f objtool: Propagate elf_truncate_section() error in elf_write()
Properly check and propagate the return value of elf_truncate_section()
to avoid silent failures.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:45:23 -07:00
Josh Poimboeuf 9ebb662fab objtool: Fix broken error handling in read_symbols()
The free(sym) call in the read_symbols() error path is fundamentally
broken: 'sym' doesn't point to any allocated block.  If triggered,
things would go from bad to worse.

Remove the free() and simplify the error paths.  Freeing memory isn't
necessary here anyway, these are fatal errors which lead to an immediate
exit().

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:45:23 -07:00
Josh Poimboeuf 07e1c3fd86 objtool: Make find_symbol_containing() less arbitrary
In the rare case of overlapping symbols, find_symbol_containing() just
returns the first one it finds.  Make it slightly less arbitrary by
returning the smallest symbol with size > 0.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:45:23 -07:00
Josh Poimboeuf b37491d72b interval_tree: Fix ITSTATIC usage for *_subtree_search()
For consistency with the other function templates, change
_subtree_search_*() to use the user-supplied ITSTATIC rather than the
hard-coded 'static'.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:45:22 -07:00
Josh Poimboeuf 9b7eacac22 interval_tree: Sync interval_tree_generic.h with tools
The following commit made an improvement to interval_tree_generic.h, but
didn't sync it to the tools copy:

  1981128578 ("lib/interval_tree: skip the check before go to the right subtree")

Sync it, and add it to objtool's sync-check.sh so they are more likely
to stay in sync going forward.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:45:22 -07:00
Josh Poimboeuf 3049fc4b5f x86/alternative: Refactor INT3 call emulation selftest
The INT3 call emulation selftest is a bit fragile as it relies on the
compiler not inserting any extra instructions before the
int3_selftest_ip() definition.

Also, the int3_selftest_ip() symbol overlaps with the int3_selftest
symbol(), which can confuse objtool.

Fix those issues by slightly reworking the functionality and moving
int3_selftest_ip() to a separate asm function.  While at it, improve the
naming.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:45:22 -07:00
Josh Poimboeuf 4109043bff modpost: Ignore unresolved section bounds symbols
In preparation for klp-build livepatch module creation tooling,
suppress warnings for unresolved references to linker-generated
__start_* and __stop_* section bounds symbols.

These symbols are expected to be undefined when modpost runs, as they're
created later by the linker.

Cc: Masahiro Yamada <masahiroy@kernel.org>
Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:45:22 -07:00
Josh Poimboeuf 6717e8f91d kbuild: Remove 'kmod_' prefix from __KBUILD_MODNAME
In preparation for the objtool klp diff subcommand, remove the arbitrary
'kmod_' prefix from __KBUILD_MODNAME and instead add it explicitly in
the __initcall_id() macro.

This change supports the standardization of "unique" symbol naming by
ensuring the non-unique portion of the name comes before the unique
part.  That will enable objtool to properly correlate symbols across
builds.

Cc: Masahiro Yamada <masahiroy@kernel.org>
Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:45:22 -07:00
Josh Poimboeuf c2d420796a elfnote: Change ELFNOTE() to use __UNIQUE_ID()
In preparation for the objtool klp diff subcommand, replace the custom
unique symbol name generation in ELFNOTE() with __UNIQUE_ID().

This standardizes the naming format for all "unique" symbols, which will
allow objtool to properly correlate them.  Note this also removes the
"one ELF note per line" limitation.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:45:22 -07:00
Josh Poimboeuf 9f14f1f918 compiler.h: Make addressable symbols less of an eyesore
Avoid underscore overload by changing:

  __UNIQUE_ID___addressable_loops_per_jiffy_868

to the following:

  __UNIQUE_ID_addressable_loops_per_jiffy_868

This matches the format used by other __UNIQUE_ID()-generated symbols
and improves readability for those who stare at ELF symbol table dumps.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:45:21 -07:00
Josh Poimboeuf afb026b6d3 compiler: Tweak __UNIQUE_ID() naming
In preparation for the objtool klp diff subcommand, add an underscore
between the name and the counter.  This will make it possible for
objtool to distinguish between the non-unique and unique parts of the
symbol name so it can properly correlate the symbols.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:45:21 -07:00
Josh Poimboeuf 122679ebf9 x86/kprobes: Remove STACK_FRAME_NON_STANDARD annotation
Since commit 877b145f0f ("x86/kprobes: Move trampoline code into
RODATA"), the optprobe template code is no longer analyzed by objtool so
it doesn't need to be ignored.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:45:21 -07:00
Josh Poimboeuf bf770d6d20 x86/module: Improve relocation error messages
Add the section number and reloc index to relocation error messages to
help find the faulty relocation.

Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:45:21 -07:00
Josh Poimboeuf 1ba9f89794 vmlinux.lds: Unify TEXT_MAIN, DATA_MAIN, and related macros
TEXT_MAIN, DATA_MAIN and friends are defined differently depending on
whether certain config options enable -ffunction-sections and/or
-fdata-sections.

There's no technical reason for that beyond voodoo coding.  Keeping the
separate implementations adds unnecessary complexity, fragments the
logic, and increases the risk of subtle bugs.

Unify the macros by using the same input section patterns across all
configs.

This is a prerequisite for the upcoming livepatch klp-build tooling
which will manually enable -ffunction-sections and -fdata-sections via
KCFLAGS.

Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:45:21 -07:00
Josh Poimboeuf 68e71067ec s390/vmlinux.lds.S: Prevent thunk functions from getting placed with normal text
The s390 indirect thunks are placed in the .text.__s390_indirect_jump_*
sections.

Certain config options which enable -ffunction-sections have a custom
version of the TEXT_TEXT macro:

  .text.[0-9a-zA-Z_]*

That unintentionally matches the thunk sections, causing them to get
grouped with normal text rather than being handled by their intended
rule later in the script:

  *(.text.*_indirect_*)

Fix that by adding another period to the thunk section names, following
the kernel's general convention for distinguishing code-generated text
sections from compiler-generated ones.

Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:45:21 -07:00
Dylan Hatch be8374a5ba objtool: Fix standalone --hacks=jump_label
The objtool command line 'objtool --hacks=jump_label foo.o' on its own
should be expected to rewrite jump labels to NOPs. This means the
add_special_section_alts() code path needs to run when only this option
is provided.

This is mainly relevant in certain debugging situations, but could
potentially also fix kernel builds in which objtool is run with
--hacks=jump_label but without --orc, --stackval, --uaccess, or
--hacks=noinstr.

Fixes: de6fbcedf5 ("objtool: Read special sections with alts only when specific options are selected")
Signed-off-by: Dylan Hatch <dylanbhatch@google.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:45:21 -07:00
Pankaj Raghav ff5c046648 scripts/faddr2line: Fix "Argument list too long" error
The run_readelf() function reads the entire output of readelf into a
single shell variable. For large object files with extensive debug
information, the size of this variable can exceed the system's
command-line argument length limit.

When this variable is subsequently passed to sed via `echo "${out}"`, it
triggers an "Argument list too long" error, causing the script to fail.

Fix this by redirecting the output of readelf to a temporary file
instead of a variable. The sed commands are then modified to read from
this file, avoiding the argument length limitation entirely.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:45:20 -07:00
Pankaj Raghav 6b4679fcbf scripts/faddr2line: Use /usr/bin/env bash for portability
The shebang `#!/bin/bash` assumes a fixed path for the bash interpreter.
This path does not exist on some systems, such as NixOS, causing the
script to fail.

Replace `/bin/bash` with the more portable `#!/usr/bin/env bash`.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:45:20 -07:00
John Wang 567f9c428f scripts/faddr2line: Set LANG=C to enforce ASCII output
Force tools like readelf to use the POSIX/C locale by exporting LANG=C
This ensures ASCII-only output and avoids locale-specific
characters(e.g., UTF-8 symbols or translated strings), which could
break text processing utilities like sed in the script

Signed-off-by: John Wang <wangzq.jn@gmail.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:45:20 -07:00
Josh Poimboeuf a808a2b35f tools build: Fix fixdep dependencies
The tools version of fixdep has broken dependencies.  It doesn't get
rebuilt if the host compiler or headers change.

Build fixdep with the tools kbuild infrastructure, so fixdep runs on
itself.  Due to the recursive dependency, its dependency file is
incomplete the very first time it gets built.  In that case build it a
second time to achieve fixdep inception.

Reported-by: Arthur Marsh <arthur.marsh@internode.on.net>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:45:20 -07:00
Chen Ni 2e985fdb7e objtool: Remove unneeded semicolon
Remove unnecessary semicolons reported by Coccinelle/coccicheck and the
semantic patch at scripts/coccinelle/misc/semicolon.cocci.

Signed-off-by: Chen Ni <nichen@iscas.ac.cn>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2025-10-14 14:45:20 -07:00
Peter Zijlstra 044f721ccd objtool/x86: Fix NOP decode
For x86_64 the kernel consistently uses 2 instructions for all NOPs:

  90       - NOP
  0f 1f /0 - NOPL

Notably:

 - REP NOP is PAUSE, not a NOP instruction.

 - 0f {0c...0f} is reserved space,
   except for 0f 0d /1, which is PREFETCHW, not a NOP.

 - 0f {19,1c...1f} is reserved space,
   except for 0f 1f /0, which is NOPL.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-10-14 13:43:11 +02:00
Peter Zijlstra 76e1851a1b objtool/x86: Add UDB support
Per commit 85a2d4a890 ("x86,ibt: Use UDB instead of 0xEA"), make
sure objtool also recognises UDB as a #UD instruction.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
2025-10-14 13:43:11 +02:00
Peter Zijlstra c5df4e1ab8 objtool/x86: Remove 0xea hack
Was properly fixed in the decoder with commit 4b626015e1 ("x86/insn:
Stop decoding i64 instructions in x86-64 mode at opcode")

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
2025-10-14 13:43:10 +02:00
Juergen Gross ad74016b91 x86/alternative: Drop not needed test after call of alt_replace_call()
alt_replace_call() will never return a negative value, so testing the
return value to be less than zero can be dropped.

This makes it possible to switch the return type of alt_replace_call()
and the type of insn_buff_sz to unsigned int.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-10-14 10:38:11 +02:00
Ingo Molnar a53d0cf7f1 Merge commit 'linus' into core/bugs, to resolve conflicts
Resolve conflicts with this commit that was developed in parallel
during the merge window:

 8c8efa93db ("x86/bug: Add ARCH_WARN_ASM macro for BUG/WARN asm code sharing with Rust")

 Conflicts:
	arch/riscv/include/asm/bug.h
	arch/x86/include/asm/bug.h

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-08-05 11:15:34 +02:00
Heiko Carstens ed845c363d bugs/s390: Remove private WARN_ON() implementation
Besides an odd __builtin_constant_p() optimization the s390 specific
WARN_ON() implementation is identical to the generic variant.
Drop the s390 variant in favor of the generic variant.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org> # Rebased ancestor commits
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Link: https://lore.kernel.org/r/20250617135042.1878068-2-hca@linux.ibm.com
2025-07-28 08:07:07 +02:00
Ingo Molnar 28ea295f94 bugs/core: Reorganize fields in the first line of WARNING output, add ->comm[] output
With the introduction of the condition string as part of the 'file'
string output of kernel warnings, the first line has become a bit
harder to read:

   WARNING: CPU: 0 PID: 0 at [ptr == 0 && 1] kernel/sched/core.c:8511 sched_init+0x20/0x410

Re-order the fields by importance (higher to lower), make the 'at' meaningful
again, and add '->comm[]' output which is often more valuable than a PID.

Also, remove the 'PID' prefix - in combination with comm it's clear what it is.

These changes make the output only slightly longer:

   WARNING: [ptr == 0 && 1] kernel/sched/core.c:8511 at sched_init+0x20/0x410 CPU#0: swapper/0

While adding more information and making it better organized.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org> # Rebased ancestor commits
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-arch@vger.kernel.org
Link: https://lore.kernel.org/r/20250515124644.2958810-16-mingo@kernel.org
2025-07-28 08:06:55 +02:00
Ingo Molnar be2ba2fef1 bugs/sh: Concatenate 'cond_str' with '__FILE__' in __WARN_FLAGS(), to extend WARN_ON/BUG_ON output
Extend WARN_ON and BUG_ON style output from:

  WARNING: CPU: 0 PID: 0 at kernel/sched/core.c:8511 sched_init+0x20/0x410

to:

  WARNING: CPU: 0 PID: 0 at [idx < 0 && ptr] kernel/sched/core.c:8511 sched_init+0x20/0x410

Note that the output will be further reorganized later in this series.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org> # Rebased ancestor commits
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: linux-arch@vger.kernel.org
Cc: linux-sh@vger.kernel.org
Link: https://lore.kernel.org/r/20250515124644.2958810-15-mingo@kernel.org
2025-07-28 08:06:43 +02:00
Ingo Molnar f40484925b bugs/parisc: Concatenate 'cond_str' with '__FILE__' in __WARN_FLAGS(), to extend WARN_ON/BUG_ON output
Extend WARN_ON and BUG_ON style output from:

  WARNING: CPU: 0 PID: 0 at kernel/sched/core.c:8511 sched_init+0x20/0x410

to:

  WARNING: CPU: 0 PID: 0 at [idx < 0 && ptr] kernel/sched/core.c:8511 sched_init+0x20/0x410

Note that the output will be further reorganized later in this series.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org> # Rebased ancestor commits
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Helge Deller <deller@gmx.de>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-arch@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Link: https://lore.kernel.org/r/20250515124644.2958810-14-mingo@kernel.org
2025-07-28 08:06:27 +02:00
Ingo Molnar bb39faa71d bugs/riscv: Concatenate 'cond_str' with '__FILE__' in __BUG_FLAGS(), to extend WARN_ON/BUG_ON output
Extend WARN_ON and BUG_ON style output from:

  WARNING: CPU: 0 PID: 0 at kernel/sched/core.c:8511 sched_init+0x20/0x410

to:

  WARNING: CPU: 0 PID: 0 at [idx < 0 && ptr] kernel/sched/core.c:8511 sched_init+0x20/0x410

Note that the output will be further reorganized later in this series.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org> # Rebased ancestor commits
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-riscv@lists.infradead.org
Link: https://lore.kernel.org/r/20250515124644.2958810-13-mingo@kernel.org
2025-07-28 08:06:06 +02:00
Ingo Molnar 7e8c292692 bugs/riscv: Pass in 'cond_str' to __BUG_FLAGS()
Pass in the condition string from __WARN_FLAGS() to __BUG_FLAGS(),
but don't use it yet.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org> # Rebased ancestor commits
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-riscv@lists.infradead.org
Link: https://lore.kernel.org/r/20250515124644.2958810-12-mingo@kernel.org
2025-07-28 08:05:49 +02:00
Heiko Carstens 6584ff203a bugs/s390: Use 'cond_str' in __EMIT_BUG()
The simple thing would be to add the string as an assembly immediate
input operand. Some older gcc variants cannot handle strings as
immediate input operands for inline assemblies. Doing so may result in
compile errors.

Rewrite the s390 generic bug support very similar to arm64 and
loongarch, and get rid of all input operands to fix this.

  [ peterz: backmerge fix and massage changelog ]

  [ bp: clang integrated assembler concatenates only .ascii strings:
    https://lore.kernel.org/r/202507020528.N0LtekXt-lkp@intel.com ]

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org> # Fixed the tags section
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-arch@vger.kernel.org
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: linux-s390@vger.kernel.org
Link: https://lore.kernel.org/r/20250520133927.7932C19-hca@linux.ibm.com
Link: https://lore.kernel.org/r/20250617135042.1878068-3-hca@linux.ibm.com
2025-07-28 08:02:43 +02:00
Ingo Molnar 7ce0f693cb bugs/s390: Pass in 'cond_str' to __EMIT_BUG()
Pass in the condition string from __WARN_FLAGS(), but do not
concatenate it with __FILE__, because it results in s390
assembler build errors that are beyond my s390-asm-fu.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: linux-s390@vger.kernel.org
Cc: <linux-arch@vger.kernel.org>
Link: https://lore.kernel.org/r/20250515124644.2958810-11-mingo@kernel.org
2025-07-28 08:01:56 +02:00
Ingo Molnar d6b894cbfa bugs/LoongArch: Concatenate 'cond_str' with '__FILE__' in __BUG_ENTRY(), to extend WARN_ON/BUG_ON output
Extend WARN_ON and BUG_ON style output from:

  WARNING: CPU: 0 PID: 0 at kernel/sched/core.c:8511 sched_init+0x20/0x410

to:

  WARNING: CPU: 0 PID: 0 at [idx < 0 && ptr] kernel/sched/core.c:8511 sched_init+0x20/0x410

Note that the output will be further reorganized later in this series.

[ peterz: backmerge fix from Nathan ]

Fixed-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org> # Cleaned up tags section
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: linux-arch@vger.kernel.org
Link: https://lore.kernel.org/r/20250515124644.2958810-10-mingo@kernel.org
Link: https://lore.kernel.org/r/20250616-loongarch-fix-warn-cond-llvm-ias-v1-1-6c6d90bb4466@kernel.org
2025-07-28 08:01:23 +02:00
Ingo Molnar 66e94df0dd bugs/LoongArch: Pass in 'cond_str' to __BUG_ENTRY()
Pass in the condition string from __WARN_FLAGS(), but don't use it yet.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: linux-arch@vger.kernel.org
Link: https://lore.kernel.org/r/20250515124644.2958810-9-mingo@kernel.org
2025-06-13 10:25:32 +02:00
Ingo Molnar 1284579a7f bugs/powerpc: Concatenate 'cond_str' with '__FILE__' in BUG_ENTRY(), to extend WARN_ON/BUG_ON output
Extend WARN_ON and BUG_ON style output from:

  WARNING: CPU: 0 PID: 0 at kernel/sched/core.c:8511 sched_init+0x20/0x410

to:

  WARNING: CPU: 0 PID: 0 at [idx < 0 && ptr] kernel/sched/core.c:8511 sched_init+0x20/0x410

Note that the output will be further reorganized later in this series.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Naveen N Rao <naveen@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: linux-arch@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Link: https://lore.kernel.org/r/20250515124644.2958810-8-mingo@kernel.org
2025-06-13 10:25:32 +02:00
Ingo Molnar 1c59c2b284 bugs/powerpc: Pass in 'cond_str' to BUG_ENTRY()
Pass in the condition string from __WARN_FLAGS(), WARN_ON()
and BUG_ON(), but don't use it yet.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Naveen N Rao <naveen@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: linux-arch@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Link: https://lore.kernel.org/r/20250515124644.2958810-7-mingo@kernel.org
2025-06-13 10:25:32 +02:00
Ingo Molnar 48ede5be5c bugs/x86: Augment warnings output by concatenating 'cond_str' with the regular __FILE__ string in _BUG_FLAGS()
This allows the reuse of the UD2 based 'struct bug_entry' low-overhead
_BUG_FLAGS() implementation and string-printing backend, without
having to add a new field.

An example:

If we have the following WARN_ON_ONCE() in kernel/sched/core.c:

	WARN_ON_ONCE(idx < 0 && ptr);

Then previously _BUG_FLAGS() would store this string in bug_entry::file:

	"kernel/sched/core.c"

After this patch, it would store and print:

	"[idx < 0 && ptr] kernel/sched/core.c"

Which is an extended string that will be printed in warnings.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: linux-arch@vger.kernel.org
Link: https://lore.kernel.org/r/20250515124644.2958810-6-mingo@kernel.org
2025-06-13 10:25:32 +02:00
Ingo Molnar 407b9076c1 bugs/x86: Extend _BUG_FLAGS() with the 'cond_str' parameter
Just pass down the parameter, don't do anything with it yet.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-arch@vger.kernel.org
Link: https://lore.kernel.org/r/20250515124644.2958810-5-mingo@kernel.org
2025-06-13 10:25:32 +02:00
Ingo Molnar 687fac9d1b bugs/core: Introduce the CONFIG_DEBUG_BUGVERBOSE_DETAILED Kconfig switch
Allow configurability of the inclusion of more detailed
WARN_ON() strings, to be implemented in subsequent
commits.

Since the full cost will be around 100K more memory on
an x86 defconfig, disable it by default.

Provide the WARN_CONDITION_STR() macro to allow the conditional
passing of extra strings to lower level BUG/WARN handlers.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-arch@vger.kernel.org
Link: https://lore.kernel.org/r/20250515124644.2958810-4-mingo@kernel.org
2025-06-13 10:25:29 +02:00
Ingo Molnar 3bc3c9c3ab bugs/core: Pass down the condition string of WARN_ON_ONCE(cond) warnings to __WARN_FLAGS()
Doing this will allow architecture code to store and print out
this information as part of the WARN_ON and BUG_ON facilities.

The format of the string is '[condition]', for example:

  WARN_ON_ONCE(idx < 0 && ptr);

Will get the '[idx < 0 && ptr]' string literal passed down as 'cond_str'
in __WARN_FLAGS().

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-arch@vger.kernel.org
Link: https://lore.kernel.org/r/20250515124644.2958810-3-mingo@kernel.org
2025-06-13 10:20:52 +02:00
Ingo Molnar aec58b4851 bugs/core: Extend __WARN_FLAGS() with the 'cond_str' parameter
Push the new parameter down into every architecture that defines __WARN_FLAGS():

  arm64
  loongarch
  parisc
  powerpc
  riscv
  s390
  sh
  x86

Don't pass anything substantial down yet, just propagate the
new parameter with empty strings, without generating it or
using it.

( The string is never NULL, so it can be concatenated at the
  preprocessor level. )

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-arch@vger.kernel.org
Link: https://lore.kernel.org/r/20250515124644.2958810-2-mingo@kernel.org
2025-06-13 10:20:52 +02:00
290 changed files with 14543 additions and 5360 deletions

View File

@ -1309,3 +1309,16 @@ a different length, use
vfs_parse_fs_qstr(fc, key, &QSTR_LEN(value, len))
instead.
---
**mandatory**
vfs_mkdir() now returns a dentry - the one returned by ->mkdir(). If
that dentry is different from the dentry passed in, including if it is
an IS_ERR() dentry pointer, the original dentry is dput().
When vfs_mkdir() returns an error, and so both dputs() the original
dentry and doesn't provide a replacement, it also unlocks the parent.
Consequently the return value from vfs_mkdir() can be passed to
end_creating() and the parent will be unlocked precisely when necessary.

View File

@ -220,13 +220,14 @@ Read path, three categories:
according to a passed marker. This is used to avoid lockless readers
starvation (too much retry loops) in case of a sharp spike in write
activity. First, a lockless read is tried (even marker passed). If
that trial fails (odd sequence counter is returned, which is used as
the next iteration marker), the lockless read is transformed to a
full locking read and no retry loop is necessary::
that trial fails (sequence counter doesn't match), make the marker
odd for the next iteration, the lockless read is transformed to a
full locking read and no retry loop is necessary, for example::
/* marker; even initialization */
int seq = 0;
int seq = 1;
do {
seq++; /* 2 on the 1st/lockless path, otherwise odd */
read_seqbegin_or_lock(&foo_seqlock, &seq);
/* ... [[read-side critical section]] ... */

View File

@ -14459,10 +14459,11 @@ T: git git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching.g
F: Documentation/ABI/testing/sysfs-kernel-livepatch
F: Documentation/livepatch/
F: arch/powerpc/include/asm/livepatch.h
F: include/linux/livepatch.h
F: include/linux/livepatch*.h
F: kernel/livepatch/
F: kernel/module/livepatch.c
F: samples/livepatch/
F: scripts/livepatch/
F: tools/testing/selftests/livepatch/
LLC (802.2)
@ -14536,6 +14537,7 @@ S: Maintained
T: git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git locking/core
F: Documentation/locking/
F: arch/*/include/asm/spinlock*.h
F: include/linux/local_lock*.h
F: include/linux/lockdep*.h
F: include/linux/mutex*.h
F: include/linux/rwlock*.h

View File

@ -19,7 +19,7 @@
unreachable(); \
} while (0)
#define __WARN_FLAGS(flags) __BUG_FLAGS(BUGFLAG_WARNING|(flags))
#define __WARN_FLAGS(cond_str, flags) __BUG_FLAGS(BUGFLAG_WARNING|(flags))
#define HAVE_ARCH_BUG

View File

@ -11,7 +11,7 @@
#else
#define __BUGVERBOSE_LOCATION(file, line) \
.pushsection .rodata.str, "aMS", @progbits, 1; \
10002: .string file; \
10002: .ascii file "\0"; \
.popsection; \
\
.long 10002b - .; \
@ -20,39 +20,38 @@
#endif
#ifndef CONFIG_GENERIC_BUG
#define __BUG_ENTRY(flags)
#define __BUG_ENTRY(cond_str, flags)
#else
#define __BUG_ENTRY(flags) \
#define __BUG_ENTRY(cond_str, flags) \
.pushsection __bug_table, "aw"; \
.align 2; \
10000: .long 10001f - .; \
_BUGVERBOSE_LOCATION(__FILE__, __LINE__) \
_BUGVERBOSE_LOCATION(WARN_CONDITION_STR(cond_str) __FILE__, __LINE__) \
.short flags; \
.popsection; \
10001:
#endif
#define ASM_BUG_FLAGS(flags) \
__BUG_ENTRY(flags) \
#define ASM_BUG_FLAGS(cond_str, flags) \
__BUG_ENTRY(cond_str, flags) \
break BRK_BUG;
#define ASM_BUG() ASM_BUG_FLAGS(0)
#define ASM_BUG() ASM_BUG_FLAGS("", 0)
#define __BUG_FLAGS(flags, extra) \
asm_inline volatile (__stringify(ASM_BUG_FLAGS(flags)) \
extra);
#define __BUG_FLAGS(cond_str, flags, extra) \
asm_inline volatile (__stringify(ASM_BUG_FLAGS(cond_str, flags)) extra);
#define __WARN_FLAGS(flags) \
#define __WARN_FLAGS(cond_str, flags) \
do { \
instrumentation_begin(); \
__BUG_FLAGS(BUGFLAG_WARNING|(flags), ANNOTATE_REACHABLE(10001b));\
__BUG_FLAGS(cond_str, BUGFLAG_WARNING|(flags), ANNOTATE_REACHABLE(10001b));\
instrumentation_end(); \
} while (0)
#define BUG() \
do { \
instrumentation_begin(); \
__BUG_FLAGS(0, ""); \
__BUG_FLAGS("", 0, ""); \
unreachable(); \
} while (0)

View File

@ -50,7 +50,7 @@
#endif
#ifdef CONFIG_DEBUG_BUGVERBOSE
#define __WARN_FLAGS(flags) \
#define __WARN_FLAGS(cond_str, flags) \
do { \
asm volatile("\n" \
"1:\t" PARISC_BUG_BREAK_ASM "\n" \
@ -61,12 +61,12 @@
"\t.short %1, %2\n" \
"\t.blockz %3-2*4-2*2\n" \
"\t.popsection" \
: : "i" (__FILE__), "i" (__LINE__), \
: : "i" (WARN_CONDITION_STR(cond_str) __FILE__), "i" (__LINE__), \
"i" (BUGFLAG_WARNING|(flags)), \
"i" (sizeof(struct bug_entry)) ); \
} while(0)
#else
#define __WARN_FLAGS(flags) \
#define __WARN_FLAGS(cond_str, flags) \
do { \
asm volatile("\n" \
"1:\t" PARISC_BUG_BREAK_ASM "\n" \

View File

@ -51,11 +51,11 @@
".previous\n"
#endif
#define BUG_ENTRY(insn, flags, ...) \
#define BUG_ENTRY(cond_str, insn, flags, ...) \
__asm__ __volatile__( \
"1: " insn "\n" \
_EMIT_BUG_ENTRY \
: : "i" (__FILE__), "i" (__LINE__), \
: : "i" (WARN_CONDITION_STR(cond_str) __FILE__), "i" (__LINE__), \
"i" (flags), \
"i" (sizeof(struct bug_entry)), \
##__VA_ARGS__)
@ -67,12 +67,12 @@
*/
#define BUG() do { \
BUG_ENTRY("twi 31, 0, 0", 0); \
BUG_ENTRY("", "twi 31, 0, 0", 0); \
unreachable(); \
} while (0)
#define HAVE_ARCH_BUG
#define __WARN_FLAGS(flags) BUG_ENTRY("twi 31, 0, 0", BUGFLAG_WARNING | (flags))
#define __WARN_FLAGS(cond_str, flags) BUG_ENTRY(cond_str, "twi 31, 0, 0", BUGFLAG_WARNING | (flags))
#ifdef CONFIG_PPC64
#define BUG_ON(x) do { \
@ -80,7 +80,7 @@
if (x) \
BUG(); \
} else { \
BUG_ENTRY(PPC_TLNEI " %4, 0", 0, "r" ((__force long)(x))); \
BUG_ENTRY(#x, PPC_TLNEI " %4, 0", 0, "r" ((__force long)(x))); \
} \
} while (0)
@ -90,7 +90,7 @@
if (__ret_warn_on) \
__WARN(); \
} else { \
BUG_ENTRY(PPC_TLNEI " %4, 0", \
BUG_ENTRY(#x, PPC_TLNEI " %4, 0", \
BUGFLAG_WARNING | BUGFLAG_TAINT(TAINT_WARN), \
"r" (__ret_warn_on)); \
} \

View File

@ -267,22 +267,11 @@ spufs_mkdir(struct inode *dir, struct dentry *dentry, unsigned int flags,
static int spufs_context_open(const struct path *path)
{
int ret;
struct file *filp;
ret = get_unused_fd_flags(0);
if (ret < 0)
return ret;
filp = dentry_open(path, O_RDONLY, current_cred());
if (IS_ERR(filp)) {
put_unused_fd(ret);
return PTR_ERR(filp);
}
filp->f_op = &spufs_context_fops;
fd_install(ret, filp);
return ret;
FD_PREPARE(fdf, 0, dentry_open(path, O_RDONLY, current_cred()));
if (fdf.err)
return fdf.err;
fd_prepare_file(fdf)->f_op = &spufs_context_fops;
return fd_publish(fdf);
}
static struct spu_context *
@ -508,26 +497,15 @@ static const struct file_operations spufs_gang_fops = {
static int spufs_gang_open(const struct path *path)
{
int ret;
struct file *filp;
ret = get_unused_fd_flags(0);
if (ret < 0)
return ret;
/*
* get references for dget and mntget, will be released
* in error path of *_open().
*/
filp = dentry_open(path, O_RDONLY, current_cred());
if (IS_ERR(filp)) {
put_unused_fd(ret);
return PTR_ERR(filp);
}
filp->f_op = &spufs_gang_fops;
fd_install(ret, filp);
return ret;
FD_PREPARE(fdf, 0, dentry_open(path, O_RDONLY, current_cred()));
if (fdf.err)
return fdf.err;
fd_prepare_file(fdf)->f_op = &spufs_gang_fops;
return fd_publish(fdf);
}
static int spufs_create_gang(struct inode *inode,

View File

@ -479,10 +479,7 @@ static const struct file_operations papr_hvpipe_handle_ops = {
static int papr_hvpipe_dev_create_handle(u32 srcID)
{
struct hvpipe_source_info *src_info;
struct file *file;
long err;
int fd;
struct hvpipe_source_info *src_info __free(kfree) = NULL;
spin_lock(&hvpipe_src_list_lock);
/*
@ -506,20 +503,13 @@ static int papr_hvpipe_dev_create_handle(u32 srcID)
src_info->tsk = current;
init_waitqueue_head(&src_info->recv_wqh);
fd = get_unused_fd_flags(O_RDONLY | O_CLOEXEC);
if (fd < 0) {
err = fd;
goto free_buf;
}
file = anon_inode_getfile("[papr-hvpipe]",
&papr_hvpipe_handle_ops, (void *)src_info,
O_RDWR);
if (IS_ERR(file)) {
err = PTR_ERR(file);
goto free_fd;
}
FD_PREPARE(fdf, O_RDONLY | O_CLOEXEC,
anon_inode_getfile("[papr-hvpipe]", &papr_hvpipe_handle_ops,
(void *)src_info, O_RDWR));
if (fdf.err)
return fdf.err;
retain_and_null_ptr(src_info);
spin_lock(&hvpipe_src_list_lock);
/*
* If two processes are executing ioctl() for the same
@ -528,22 +518,11 @@ static int papr_hvpipe_dev_create_handle(u32 srcID)
*/
if (hvpipe_find_source(srcID)) {
spin_unlock(&hvpipe_src_list_lock);
err = -EALREADY;
goto free_file;
return -EALREADY;
}
list_add(&src_info->list, &hvpipe_src_list);
spin_unlock(&hvpipe_src_list_lock);
fd_install(fd, file);
return fd;
free_file:
fput(file);
free_fd:
put_unused_fd(fd);
free_buf:
kfree(src_info);
return err;
return fd_publish(fdf);
}
/*

View File

@ -303,8 +303,6 @@ static long papr_platform_dump_create_handle(u64 dump_tag)
{
struct ibm_platform_dump_params *params;
u64 param_dump_tag;
struct file *file;
long err;
int fd;
/*
@ -334,34 +332,22 @@ static long papr_platform_dump_create_handle(u64 dump_tag)
params->dump_tag_lo = (u32)(dump_tag & 0x00000000ffffffffULL);
params->status = RTAS_IBM_PLATFORM_DUMP_START;
fd = get_unused_fd_flags(O_RDONLY | O_CLOEXEC);
if (fd < 0) {
err = fd;
goto free_area;
}
file = anon_inode_getfile_fmode("[papr-platform-dump]",
fd = FD_ADD(O_RDONLY | O_CLOEXEC,
anon_inode_getfile_fmode("[papr-platform-dump]",
&papr_platform_dump_handle_ops,
(void *)params, O_RDONLY,
FMODE_LSEEK | FMODE_PREAD);
if (IS_ERR(file)) {
err = PTR_ERR(file);
goto put_fd;
FMODE_LSEEK | FMODE_PREAD));
if (fd < 0) {
rtas_work_area_free(params->work_area);
kfree(params);
return fd;
}
fd_install(fd, file);
list_add(&params->list, &platform_dump_list);
pr_info("%s (%d) initiated platform dump for dump tag %llu\n",
current->comm, current->pid, dump_tag);
return fd;
put_fd:
put_unused_fd(fd);
free_area:
rtas_work_area_free(params->work_area);
kfree(params);
return err;
}
/*

View File

@ -205,35 +205,18 @@ long papr_rtas_setup_file_interface(struct papr_rtas_sequence *seq,
char *name)
{
const struct papr_rtas_blob *blob;
struct file *file;
long ret;
int fd;
blob = papr_rtas_retrieve(seq);
if (IS_ERR(blob))
return PTR_ERR(blob);
fd = get_unused_fd_flags(O_RDONLY | O_CLOEXEC);
if (fd < 0) {
ret = fd;
goto free_blob;
}
file = anon_inode_getfile_fmode(name, fops, (void *)blob,
O_RDONLY, FMODE_LSEEK | FMODE_PREAD);
if (IS_ERR(file)) {
ret = PTR_ERR(file);
goto put_fd;
}
fd_install(fd, file);
return fd;
put_fd:
put_unused_fd(fd);
free_blob:
fd = FD_ADD(O_RDONLY | O_CLOEXEC,
anon_inode_getfile_fmode(name, fops, (void *)blob, O_RDONLY,
FMODE_LSEEK | FMODE_PREAD));
if (fd < 0)
papr_rtas_blob_free(blob);
return ret;
return fd;
}
/*

View File

@ -60,28 +60,28 @@ typedef u32 bug_insn_t;
".org 2b + " size "\n\t" \
".popsection" \
#define __BUG_FLAGS(flags) \
#define __BUG_FLAGS(cond_str, flags) \
do { \
__asm__ __volatile__ ( \
ARCH_WARN_ASM("%0", "%1", "%2", "%3") \
: \
: "i" (__FILE__), "i" (__LINE__), \
: "i" (WARN_CONDITION_STR(cond_str) __FILE__), "i" (__LINE__), \
"i" (flags), \
"i" (sizeof(struct bug_entry))); \
} while (0)
#else /* CONFIG_GENERIC_BUG */
#define __BUG_FLAGS(flags) do { \
#define __BUG_FLAGS(cond_str, flags) do { \
__asm__ __volatile__ ("ebreak\n"); \
} while (0)
#endif /* CONFIG_GENERIC_BUG */
#define BUG() do { \
__BUG_FLAGS(0); \
__BUG_FLAGS("", 0); \
unreachable(); \
} while (0)
#define __WARN_FLAGS(flags) __BUG_FLAGS(BUGFLAG_WARNING|(flags))
#define __WARN_FLAGS(cond_str, flags) __BUG_FLAGS(cond_str, BUGFLAG_WARNING|(flags))
#define ARCH_WARN_REACHABLE

View File

@ -2,69 +2,55 @@
#ifndef _ASM_S390_BUG_H
#define _ASM_S390_BUG_H
#include <linux/compiler.h>
#include <linux/stringify.h>
#ifdef CONFIG_BUG
#ifndef CONFIG_DEBUG_BUGVERBOSE
#define _BUGVERBOSE_LOCATION(file, line)
#else
#define __BUGVERBOSE_LOCATION(file, line) \
.pushsection .rodata.str, "aMS", @progbits, 1; \
10002: .ascii file "\0"; \
.popsection; \
\
.long 10002b - .; \
.short line;
#define _BUGVERBOSE_LOCATION(file, line) __BUGVERBOSE_LOCATION(file, line)
#endif
#ifdef CONFIG_DEBUG_BUGVERBOSE
#ifndef CONFIG_GENERIC_BUG
#define __BUG_ENTRY(cond_str, flags)
#else
#define __BUG_ENTRY(cond_str, flags) \
.pushsection __bug_table, "aw"; \
.align 4; \
10000: .long 10001f - .; \
_BUGVERBOSE_LOCATION(WARN_CONDITION_STR(cond_str) __FILE__, __LINE__) \
.short flags; \
.popsection; \
10001:
#endif
#define __EMIT_BUG(x) do { \
asm_inline volatile( \
"0: mc 0,0\n" \
".section .rodata.str,\"aMS\",@progbits,1\n" \
"1: .asciz \""__FILE__"\"\n" \
".previous\n" \
".section __bug_table,\"aw\"\n" \
"2: .long 0b-.\n" \
" .long 1b-.\n" \
" .short %0,%1\n" \
" .org 2b+%2\n" \
".previous\n" \
: : "i" (__LINE__), \
"i" (x), \
"i" (sizeof(struct bug_entry))); \
#define ASM_BUG_FLAGS(cond_str, flags) \
__BUG_ENTRY(cond_str, flags) \
mc 0,0
#define ASM_BUG() ASM_BUG_FLAGS("", 0)
#define __BUG_FLAGS(cond_str, flags) \
asm_inline volatile(__stringify(ASM_BUG_FLAGS(cond_str, flags)));
#define __WARN_FLAGS(cond_str, flags) \
do { \
__BUG_FLAGS(cond_str, BUGFLAG_WARNING|(flags)); \
} while (0)
#else /* CONFIG_DEBUG_BUGVERBOSE */
#define __EMIT_BUG(x) do { \
asm_inline volatile( \
"0: mc 0,0\n" \
".section __bug_table,\"aw\"\n" \
"1: .long 0b-.\n" \
" .short %0\n" \
" .org 1b+%1\n" \
".previous\n" \
: : "i" (x), \
"i" (sizeof(struct bug_entry))); \
} while (0)
#endif /* CONFIG_DEBUG_BUGVERBOSE */
#define BUG() do { \
__EMIT_BUG(0); \
#define BUG() \
do { \
__BUG_FLAGS("", 0); \
unreachable(); \
} while (0)
#define __WARN_FLAGS(flags) do { \
__EMIT_BUG(BUGFLAG_WARNING|(flags)); \
} while (0)
#define WARN_ON(x) ({ \
int __ret_warn_on = !!(x); \
if (__builtin_constant_p(__ret_warn_on)) { \
if (__ret_warn_on) \
__WARN(); \
} else { \
if (unlikely(__ret_warn_on)) \
__WARN(); \
} \
unlikely(__ret_warn_on); \
})
#define HAVE_ARCH_BUG
#define HAVE_ARCH_WARN_ON
#endif /* CONFIG_BUG */
#include <asm-generic/bug.h>

View File

@ -19,7 +19,7 @@
#ifdef CONFIG_EXPOLINE_EXTERN
SYM_CODE_START(\name)
#else
.pushsection .text.\name,"axG",@progbits,\name,comdat
.pushsection .text..\name,"axG",@progbits,\name,comdat
.globl \name
.hidden \name
.type \name,@function

View File

@ -51,7 +51,7 @@ SECTIONS
IRQENTRY_TEXT
SOFTIRQENTRY_TEXT
FTRACE_HOTPATCH_TRAMPOLINES_TEXT
*(.text.*_indirect_*)
*(.text..*_indirect_*)
*(.gnu.warning)
. = ALIGN(PAGE_SIZE);
_etext = .; /* End of text section */

View File

@ -199,8 +199,7 @@ static void pfault_interrupt(struct ext_code ext_code,
* return to userspace schedule() to block.
*/
__set_current_state(TASK_UNINTERRUPTIBLE);
set_tsk_need_resched(tsk);
set_preempt_need_resched();
set_need_resched_current();
}
}
out:

View File

@ -52,14 +52,14 @@ do { \
unreachable(); \
} while (0)
#define __WARN_FLAGS(flags) \
#define __WARN_FLAGS(cond_str, flags) \
do { \
__asm__ __volatile__ ( \
"1:\t.short %O0\n" \
_EMIT_BUG_ENTRY \
: \
: "n" (TRAPA_BUG_OPCODE), \
"i" (__FILE__), \
"i" (WARN_CONDITION_STR(cond_str) __FILE__), \
"i" (__LINE__), \
"i" (BUGFLAG_WARNING|(flags)), \
"i" (sizeof(struct bug_entry))); \

View File

@ -261,6 +261,7 @@ config X86
select HAVE_FUNCTION_ERROR_INJECTION
select HAVE_KRETPROBES
select HAVE_RETHOOK
select HAVE_KLP_BUILD if X86_64
select HAVE_LIVEPATCH if X86_64
select HAVE_MIXED_BREAKPOINTS_REGS
select HAVE_MOD_ARCH_SPECIFIC
@ -297,6 +298,7 @@ config X86
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_UACCESS_VALIDATION if HAVE_OBJTOOL
select HAVE_UNSTABLE_SCHED_CLOCK
select HAVE_UNWIND_USER_FP if X86_64
select HAVE_USER_RETURN_NOTIFIER
select HAVE_GENERIC_VDSO
select VDSO_GETRANDOM if X86_64
@ -379,7 +381,7 @@ config GENERIC_CSUM
config GENERIC_BUG
def_bool y
depends on BUG
select GENERIC_BUG_RELATIVE_POINTERS if X86_64
select GENERIC_BUG_RELATIVE_POINTERS
config GENERIC_BUG_RELATIVE_POINTERS
bool

View File

@ -29,11 +29,10 @@
bool insn_has_rep_prefix(struct insn *insn)
{
insn_byte_t p;
int i;
insn_get_prefixes(insn);
for_each_insn_prefix(insn, i, p) {
for_each_insn_prefix(insn, p) {
if (p == 0xf2 || p == 0xf3)
return true;
}

View File

@ -36,7 +36,7 @@ $(patsubst %.o,$(obj)/%.o,$(lib-y)): OBJECT_FILES_NON_STANDARD := y
# relocations, even if other objtool actions are being deferred.
#
$(pi-objs): objtool-enabled = 1
$(pi-objs): objtool-args = $(if $(delay-objtool),,$(objtool-args-y)) --noabs
$(pi-objs): objtool-args = $(if $(delay-objtool),--dry-run,$(objtool-args-y)) --noabs
#
# Confine the startup code by prefixing all symbols with __pi_ (for position

View File

@ -32,6 +32,14 @@ SYM_FUNC_END(write_ibpb)
/* For KVM */
EXPORT_SYMBOL_GPL(write_ibpb);
SYM_FUNC_START(__WARN_trap)
ANNOTATE_NOENDBR
ANNOTATE_REACHABLE
ud1 (%edx), %_ASM_ARG1
RET
SYM_FUNC_END(__WARN_trap)
EXPORT_SYMBOL(__WARN_trap)
.popsection
/*

View File

@ -763,6 +763,11 @@ static void amd_pmu_enable_all(int added)
if (!test_bit(idx, cpuc->active_mask))
continue;
/*
* FIXME: cpuc->events[idx] can become NULL in a subtle race
* condition with NMI->throttle->x86_pmu_stop().
*/
if (cpuc->events[idx])
amd_pmu_enable_event(cpuc->events[idx]);
}
}

View File

@ -554,14 +554,22 @@ static inline int precise_br_compat(struct perf_event *event)
return m == b;
}
int x86_pmu_max_precise(void)
int x86_pmu_max_precise(struct pmu *pmu)
{
int precise = 0;
/* Support for constant skid */
if (x86_pmu.pebs_active && !x86_pmu.pebs_broken) {
/* arch PEBS */
if (x86_pmu.arch_pebs) {
precise = 2;
if (hybrid(pmu, arch_pebs_cap).pdists)
precise++;
return precise;
}
/* legacy PEBS - support for constant skid */
precise++;
/* Support for IP fixup */
if (x86_pmu.lbr_nr || x86_pmu.intel_cap.pebs_format >= 2)
precise++;
@ -569,13 +577,14 @@ int x86_pmu_max_precise(void)
if (x86_pmu.pebs_prec_dist)
precise++;
}
return precise;
}
int x86_pmu_hw_config(struct perf_event *event)
{
if (event->attr.precise_ip) {
int precise = x86_pmu_max_precise();
int precise = x86_pmu_max_precise(event->pmu);
if (event->attr.precise_ip > precise)
return -EOPNOTSUPP;
@ -1344,6 +1353,7 @@ static void x86_pmu_enable(struct pmu *pmu)
hwc->state |= PERF_HES_ARCH;
x86_pmu_stop(event, PERF_EF_UPDATE);
cpuc->events[hwc->idx] = NULL;
}
/*
@ -1365,6 +1375,7 @@ static void x86_pmu_enable(struct pmu *pmu)
* if cpuc->enabled = 0, then no wrmsr as
* per x86_pmu_enable_event()
*/
cpuc->events[hwc->idx] = event;
x86_pmu_start(event, PERF_EF_RELOAD);
}
cpuc->n_added = 0;
@ -1531,7 +1542,6 @@ static void x86_pmu_start(struct perf_event *event, int flags)
event->hw.state = 0;
cpuc->events[idx] = event;
__set_bit(idx, cpuc->active_mask);
static_call(x86_pmu_enable)(event);
perf_event_update_userpage(event);
@ -1610,7 +1620,6 @@ void x86_pmu_stop(struct perf_event *event, int flags)
if (test_bit(hwc->idx, cpuc->active_mask)) {
static_call(x86_pmu_disable)(event);
__clear_bit(hwc->idx, cpuc->active_mask);
cpuc->events[hwc->idx] = NULL;
WARN_ON_ONCE(hwc->state & PERF_HES_STOPPED);
hwc->state |= PERF_HES_STOPPED;
}
@ -1648,6 +1657,7 @@ static void x86_pmu_del(struct perf_event *event, int flags)
* Not a TXN, therefore cleanup properly.
*/
x86_pmu_stop(event, PERF_EF_UPDATE);
cpuc->events[event->hw.idx] = NULL;
for (i = 0; i < cpuc->n_events; i++) {
if (event == cpuc->event_list[i])
@ -2629,7 +2639,9 @@ static ssize_t max_precise_show(struct device *cdev,
struct device_attribute *attr,
char *buf)
{
return snprintf(buf, PAGE_SIZE, "%d\n", x86_pmu_max_precise());
struct pmu *pmu = dev_get_drvdata(cdev);
return snprintf(buf, PAGE_SIZE, "%d\n", x86_pmu_max_precise(pmu));
}
static DEVICE_ATTR_RO(max_precise);
@ -2845,46 +2857,6 @@ static unsigned long get_segment_base(unsigned int segment)
return get_desc_base(desc);
}
#ifdef CONFIG_UPROBES
/*
* Heuristic-based check if uprobe is installed at the function entry.
*
* Under assumption of user code being compiled with frame pointers,
* `push %rbp/%ebp` is a good indicator that we indeed are.
*
* Similarly, `endbr64` (assuming 64-bit mode) is also a common pattern.
* If we get this wrong, captured stack trace might have one extra bogus
* entry, but the rest of stack trace will still be meaningful.
*/
static bool is_uprobe_at_func_entry(struct pt_regs *regs)
{
struct arch_uprobe *auprobe;
if (!current->utask)
return false;
auprobe = current->utask->auprobe;
if (!auprobe)
return false;
/* push %rbp/%ebp */
if (auprobe->insn[0] == 0x55)
return true;
/* endbr64 (64-bit only) */
if (user_64bit_mode(regs) && is_endbr((u32 *)auprobe->insn))
return true;
return false;
}
#else
static bool is_uprobe_at_func_entry(struct pt_regs *regs)
{
return false;
}
#endif /* CONFIG_UPROBES */
#ifdef CONFIG_IA32_EMULATION
#include <linux/compat.h>

View File

@ -2563,6 +2563,44 @@ static void intel_pmu_disable_fixed(struct perf_event *event)
cpuc->fixed_ctrl_val &= ~mask;
}
static inline void __intel_pmu_update_event_ext(int idx, u64 ext)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
u32 msr;
if (idx < INTEL_PMC_IDX_FIXED) {
msr = MSR_IA32_PMC_V6_GP0_CFG_C +
x86_pmu.addr_offset(idx, false);
} else {
msr = MSR_IA32_PMC_V6_FX0_CFG_C +
x86_pmu.addr_offset(idx - INTEL_PMC_IDX_FIXED, false);
}
cpuc->cfg_c_val[idx] = ext;
wrmsrq(msr, ext);
}
static void intel_pmu_disable_event_ext(struct perf_event *event)
{
/*
* Only clear CFG_C MSR for PEBS counter group events,
* it avoids the HW counter's value to be added into
* other PEBS records incorrectly after PEBS counter
* group events are disabled.
*
* For other events, it's unnecessary to clear CFG_C MSRs
* since CFG_C doesn't take effect if counter is in
* disabled state. That helps to reduce the WRMSR overhead
* in context switches.
*/
if (!is_pebs_counter_event_group(event))
return;
__intel_pmu_update_event_ext(event->hw.idx, 0);
}
DEFINE_STATIC_CALL_NULL(intel_pmu_disable_event_ext, intel_pmu_disable_event_ext);
static void intel_pmu_disable_event(struct perf_event *event)
{
struct hw_perf_event *hwc = &event->hw;
@ -2571,9 +2609,12 @@ static void intel_pmu_disable_event(struct perf_event *event)
switch (idx) {
case 0 ... INTEL_PMC_IDX_FIXED - 1:
intel_clear_masks(event, idx);
static_call_cond(intel_pmu_disable_event_ext)(event);
x86_pmu_disable_event(event);
break;
case INTEL_PMC_IDX_FIXED ... INTEL_PMC_IDX_FIXED_BTS - 1:
static_call_cond(intel_pmu_disable_event_ext)(event);
fallthrough;
case INTEL_PMC_IDX_METRIC_BASE ... INTEL_PMC_IDX_METRIC_END:
intel_pmu_disable_fixed(event);
break;
@ -2940,6 +2981,79 @@ static void intel_pmu_enable_acr(struct perf_event *event)
DEFINE_STATIC_CALL_NULL(intel_pmu_enable_acr_event, intel_pmu_enable_acr);
static void intel_pmu_enable_event_ext(struct perf_event *event)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct hw_perf_event *hwc = &event->hw;
union arch_pebs_index old, new;
struct arch_pebs_cap cap;
u64 ext = 0;
cap = hybrid(cpuc->pmu, arch_pebs_cap);
if (event->attr.precise_ip) {
u64 pebs_data_cfg = intel_get_arch_pebs_data_config(event);
ext |= ARCH_PEBS_EN;
if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD)
ext |= (-hwc->sample_period) & ARCH_PEBS_RELOAD;
if (pebs_data_cfg && cap.caps) {
if (pebs_data_cfg & PEBS_DATACFG_MEMINFO)
ext |= ARCH_PEBS_AUX & cap.caps;
if (pebs_data_cfg & PEBS_DATACFG_GP)
ext |= ARCH_PEBS_GPR & cap.caps;
if (pebs_data_cfg & PEBS_DATACFG_XMMS)
ext |= ARCH_PEBS_VECR_XMM & cap.caps;
if (pebs_data_cfg & PEBS_DATACFG_LBRS)
ext |= ARCH_PEBS_LBR & cap.caps;
if (pebs_data_cfg &
(PEBS_DATACFG_CNTR_MASK << PEBS_DATACFG_CNTR_SHIFT))
ext |= ARCH_PEBS_CNTR_GP & cap.caps;
if (pebs_data_cfg &
(PEBS_DATACFG_FIX_MASK << PEBS_DATACFG_FIX_SHIFT))
ext |= ARCH_PEBS_CNTR_FIXED & cap.caps;
if (pebs_data_cfg & PEBS_DATACFG_METRICS)
ext |= ARCH_PEBS_CNTR_METRICS & cap.caps;
}
if (cpuc->n_pebs == cpuc->n_large_pebs)
new.thresh = ARCH_PEBS_THRESH_MULTI;
else
new.thresh = ARCH_PEBS_THRESH_SINGLE;
rdmsrq(MSR_IA32_PEBS_INDEX, old.whole);
if (new.thresh != old.thresh || !old.en) {
if (old.thresh == ARCH_PEBS_THRESH_MULTI && old.wr > 0) {
/*
* Large PEBS was enabled.
* Drain PEBS buffer before applying the single PEBS.
*/
intel_pmu_drain_pebs_buffer();
} else {
new.wr = 0;
new.full = 0;
new.en = 1;
wrmsrq(MSR_IA32_PEBS_INDEX, new.whole);
}
}
}
if (is_pebs_counter_event_group(event))
ext |= ARCH_PEBS_CNTR_ALLOW;
if (cpuc->cfg_c_val[hwc->idx] != ext)
__intel_pmu_update_event_ext(hwc->idx, ext);
}
DEFINE_STATIC_CALL_NULL(intel_pmu_enable_event_ext, intel_pmu_enable_event_ext);
static void intel_pmu_enable_event(struct perf_event *event)
{
u64 enable_mask = ARCH_PERFMON_EVENTSEL_ENABLE;
@ -2955,10 +3069,12 @@ static void intel_pmu_enable_event(struct perf_event *event)
enable_mask |= ARCH_PERFMON_EVENTSEL_BR_CNTR;
intel_set_masks(event, idx);
static_call_cond(intel_pmu_enable_acr_event)(event);
static_call_cond(intel_pmu_enable_event_ext)(event);
__x86_pmu_enable_event(hwc, enable_mask);
break;
case INTEL_PMC_IDX_FIXED ... INTEL_PMC_IDX_FIXED_BTS - 1:
static_call_cond(intel_pmu_enable_acr_event)(event);
static_call_cond(intel_pmu_enable_event_ext)(event);
fallthrough;
case INTEL_PMC_IDX_METRIC_BASE ... INTEL_PMC_IDX_METRIC_END:
intel_pmu_enable_fixed(event);
@ -3215,6 +3331,19 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
status &= ~GLOBAL_STATUS_PERF_METRICS_OVF_BIT;
}
/*
* Arch PEBS sets bit 54 in the global status register
*/
if (__test_and_clear_bit(GLOBAL_STATUS_ARCH_PEBS_THRESHOLD_BIT,
(unsigned long *)&status)) {
handled++;
static_call(x86_pmu_drain_pebs)(regs, &data);
if (cpuc->events[INTEL_PMC_IDX_FIXED_SLOTS] &&
is_pebs_counter_event_group(cpuc->events[INTEL_PMC_IDX_FIXED_SLOTS]))
status &= ~GLOBAL_STATUS_PERF_METRICS_OVF_BIT;
}
/*
* Intel PT
*/
@ -3269,7 +3398,7 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
* The PEBS buffer has to be drained before handling the A-PMI
*/
if (is_pebs_counter_event_group(event))
x86_pmu.drain_pebs(regs, &data);
static_call(x86_pmu_drain_pebs)(regs, &data);
last_period = event->hw.last_period;
@ -4029,7 +4158,9 @@ static unsigned long intel_pmu_large_pebs_flags(struct perf_event *event)
if (!event->attr.exclude_kernel)
flags &= ~PERF_SAMPLE_REGS_USER;
if (event->attr.sample_regs_user & ~PEBS_GP_REGS)
flags &= ~(PERF_SAMPLE_REGS_USER | PERF_SAMPLE_REGS_INTR);
flags &= ~PERF_SAMPLE_REGS_USER;
if (event->attr.sample_regs_intr & ~PEBS_GP_REGS)
flags &= ~PERF_SAMPLE_REGS_INTR;
return flags;
}
@ -4204,6 +4335,20 @@ static bool intel_pmu_is_acr_group(struct perf_event *event)
return false;
}
static inline bool intel_pmu_has_pebs_counter_group(struct pmu *pmu)
{
u64 caps;
if (x86_pmu.intel_cap.pebs_format >= 6 && x86_pmu.intel_cap.pebs_baseline)
return true;
caps = hybrid(pmu, arch_pebs_cap).caps;
if (x86_pmu.arch_pebs && (caps & ARCH_PEBS_CNTR_MASK))
return true;
return false;
}
static inline void intel_pmu_set_acr_cntr_constr(struct perf_event *event,
u64 *cause_mask, int *num)
{
@ -4237,6 +4382,8 @@ static int intel_pmu_hw_config(struct perf_event *event)
}
if (event->attr.precise_ip) {
struct arch_pebs_cap pebs_cap = hybrid(event->pmu, arch_pebs_cap);
if ((event->attr.config & INTEL_ARCH_EVENT_MASK) == INTEL_FIXED_VLBR_EVENT)
return -EINVAL;
@ -4250,6 +4397,15 @@ static int intel_pmu_hw_config(struct perf_event *event)
}
if (x86_pmu.pebs_aliases)
x86_pmu.pebs_aliases(event);
if (x86_pmu.arch_pebs) {
u64 cntr_mask = hybrid(event->pmu, intel_ctrl) &
~GLOBAL_CTRL_EN_PERF_METRICS;
u64 pebs_mask = event->attr.precise_ip >= 3 ?
pebs_cap.pdists : pebs_cap.counters;
if (cntr_mask != pebs_mask)
event->hw.dyn_constraint &= pebs_mask;
}
}
if (needs_branch_stack(event)) {
@ -4341,8 +4497,7 @@ static int intel_pmu_hw_config(struct perf_event *event)
}
if ((event->attr.sample_type & PERF_SAMPLE_READ) &&
(x86_pmu.intel_cap.pebs_format >= 6) &&
x86_pmu.intel_cap.pebs_baseline &&
intel_pmu_has_pebs_counter_group(event->pmu) &&
is_sampling_event(event) &&
event->attr.precise_ip)
event->group_leader->hw.flags |= PERF_X86_EVENT_PEBS_CNTR;
@ -5212,7 +5367,13 @@ int intel_cpuc_prepare(struct cpu_hw_events *cpuc, int cpu)
static int intel_pmu_cpu_prepare(int cpu)
{
return intel_cpuc_prepare(&per_cpu(cpu_hw_events, cpu), cpu);
int ret;
ret = intel_cpuc_prepare(&per_cpu(cpu_hw_events, cpu), cpu);
if (ret)
return ret;
return alloc_arch_pebs_buf_on_cpu(cpu);
}
static void flip_smm_bit(void *data)
@ -5257,6 +5418,163 @@ static void intel_pmu_check_event_constraints(struct event_constraint *event_con
u64 fixed_cntr_mask,
u64 intel_ctrl);
enum dyn_constr_type {
DYN_CONSTR_NONE,
DYN_CONSTR_BR_CNTR,
DYN_CONSTR_ACR_CNTR,
DYN_CONSTR_ACR_CAUSE,
DYN_CONSTR_PEBS,
DYN_CONSTR_PDIST,
DYN_CONSTR_MAX,
};
static const char * const dyn_constr_type_name[] = {
[DYN_CONSTR_NONE] = "a normal event",
[DYN_CONSTR_BR_CNTR] = "a branch counter logging event",
[DYN_CONSTR_ACR_CNTR] = "an auto-counter reload event",
[DYN_CONSTR_ACR_CAUSE] = "an auto-counter reload cause event",
[DYN_CONSTR_PEBS] = "a PEBS event",
[DYN_CONSTR_PDIST] = "a PEBS PDIST event",
};
static void __intel_pmu_check_dyn_constr(struct event_constraint *constr,
enum dyn_constr_type type, u64 mask)
{
struct event_constraint *c1, *c2;
int new_weight, check_weight;
u64 new_mask, check_mask;
for_each_event_constraint(c1, constr) {
new_mask = c1->idxmsk64 & mask;
new_weight = hweight64(new_mask);
/* ignore topdown perf metrics event */
if (c1->idxmsk64 & INTEL_PMC_MSK_TOPDOWN)
continue;
if (!new_weight && fls64(c1->idxmsk64) < INTEL_PMC_IDX_FIXED) {
pr_info("The event 0x%llx is not supported as %s.\n",
c1->code, dyn_constr_type_name[type]);
}
if (new_weight <= 1)
continue;
for_each_event_constraint(c2, c1 + 1) {
bool check_fail = false;
check_mask = c2->idxmsk64 & mask;
check_weight = hweight64(check_mask);
if (c2->idxmsk64 & INTEL_PMC_MSK_TOPDOWN ||
!check_weight)
continue;
/* The same constraints or no overlap */
if (new_mask == check_mask ||
(new_mask ^ check_mask) == (new_mask | check_mask))
continue;
/*
* A scheduler issue may be triggered in the following cases.
* - Two overlap constraints have the same weight.
* E.g., A constraints: 0x3, B constraints: 0x6
* event counter failure case
* B PMC[2:1] 1
* A PMC[1:0] 0
* A PMC[1:0] FAIL
* - Two overlap constraints have different weight.
* The constraint has a low weight, but has high last bit.
* E.g., A constraints: 0x7, B constraints: 0xC
* event counter failure case
* B PMC[3:2] 2
* A PMC[2:0] 0
* A PMC[2:0] 1
* A PMC[2:0] FAIL
*/
if (new_weight == check_weight) {
check_fail = true;
} else if (new_weight < check_weight) {
if ((new_mask | check_mask) != check_mask &&
fls64(new_mask) > fls64(check_mask))
check_fail = true;
} else {
if ((new_mask | check_mask) != new_mask &&
fls64(new_mask) < fls64(check_mask))
check_fail = true;
}
if (check_fail) {
pr_info("The two events 0x%llx and 0x%llx may not be "
"fully scheduled under some circumstances as "
"%s.\n",
c1->code, c2->code, dyn_constr_type_name[type]);
}
}
}
}
static void intel_pmu_check_dyn_constr(struct pmu *pmu,
struct event_constraint *constr,
u64 cntr_mask)
{
enum dyn_constr_type i;
u64 mask;
for (i = DYN_CONSTR_NONE; i < DYN_CONSTR_MAX; i++) {
mask = 0;
switch (i) {
case DYN_CONSTR_NONE:
mask = cntr_mask;
break;
case DYN_CONSTR_BR_CNTR:
if (x86_pmu.flags & PMU_FL_BR_CNTR)
mask = x86_pmu.lbr_counters;
break;
case DYN_CONSTR_ACR_CNTR:
mask = hybrid(pmu, acr_cntr_mask64) & GENMASK_ULL(INTEL_PMC_MAX_GENERIC - 1, 0);
break;
case DYN_CONSTR_ACR_CAUSE:
if (hybrid(pmu, acr_cntr_mask64) == hybrid(pmu, acr_cause_mask64))
continue;
mask = hybrid(pmu, acr_cause_mask64) & GENMASK_ULL(INTEL_PMC_MAX_GENERIC - 1, 0);
break;
case DYN_CONSTR_PEBS:
if (x86_pmu.arch_pebs)
mask = hybrid(pmu, arch_pebs_cap).counters;
break;
case DYN_CONSTR_PDIST:
if (x86_pmu.arch_pebs)
mask = hybrid(pmu, arch_pebs_cap).pdists;
break;
default:
pr_warn("Unsupported dynamic constraint type %d\n", i);
}
if (mask)
__intel_pmu_check_dyn_constr(constr, i, mask);
}
}
static void intel_pmu_check_event_constraints_all(struct pmu *pmu)
{
struct event_constraint *event_constraints = hybrid(pmu, event_constraints);
struct event_constraint *pebs_constraints = hybrid(pmu, pebs_constraints);
u64 cntr_mask = hybrid(pmu, cntr_mask64);
u64 fixed_cntr_mask = hybrid(pmu, fixed_cntr_mask64);
u64 intel_ctrl = hybrid(pmu, intel_ctrl);
intel_pmu_check_event_constraints(event_constraints, cntr_mask,
fixed_cntr_mask, intel_ctrl);
if (event_constraints)
intel_pmu_check_dyn_constr(pmu, event_constraints, cntr_mask);
if (pebs_constraints)
intel_pmu_check_dyn_constr(pmu, pebs_constraints, cntr_mask);
}
static void intel_pmu_check_extra_regs(struct extra_reg *extra_regs);
static inline bool intel_pmu_broken_perf_cap(void)
@ -5269,34 +5587,89 @@ static inline bool intel_pmu_broken_perf_cap(void)
return false;
}
static inline void __intel_update_pmu_caps(struct pmu *pmu)
{
struct pmu *dest_pmu = pmu ? pmu : x86_get_pmu(smp_processor_id());
if (hybrid(pmu, arch_pebs_cap).caps & ARCH_PEBS_VECR_XMM)
dest_pmu->capabilities |= PERF_PMU_CAP_EXTENDED_REGS;
}
static inline void __intel_update_large_pebs_flags(struct pmu *pmu)
{
u64 caps = hybrid(pmu, arch_pebs_cap).caps;
x86_pmu.large_pebs_flags |= PERF_SAMPLE_TIME;
if (caps & ARCH_PEBS_LBR)
x86_pmu.large_pebs_flags |= PERF_SAMPLE_BRANCH_STACK;
if (caps & ARCH_PEBS_CNTR_MASK)
x86_pmu.large_pebs_flags |= PERF_SAMPLE_READ;
if (!(caps & ARCH_PEBS_AUX))
x86_pmu.large_pebs_flags &= ~PERF_SAMPLE_DATA_SRC;
if (!(caps & ARCH_PEBS_GPR)) {
x86_pmu.large_pebs_flags &=
~(PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER);
}
}
#define counter_mask(_gp, _fixed) ((_gp) | ((u64)(_fixed) << INTEL_PMC_IDX_FIXED))
static void update_pmu_cap(struct pmu *pmu)
{
unsigned int cntr, fixed_cntr, ecx, edx;
union cpuid35_eax eax;
union cpuid35_ebx ebx;
unsigned int eax, ebx, ecx, edx;
union cpuid35_eax eax_0;
union cpuid35_ebx ebx_0;
u64 cntrs_mask = 0;
u64 pebs_mask = 0;
u64 pdists_mask = 0;
cpuid(ARCH_PERFMON_EXT_LEAF, &eax.full, &ebx.full, &ecx, &edx);
cpuid(ARCH_PERFMON_EXT_LEAF, &eax_0.full, &ebx_0.full, &ecx, &edx);
if (ebx.split.umask2)
if (ebx_0.split.umask2)
hybrid(pmu, config_mask) |= ARCH_PERFMON_EVENTSEL_UMASK2;
if (ebx.split.eq)
if (ebx_0.split.eq)
hybrid(pmu, config_mask) |= ARCH_PERFMON_EVENTSEL_EQ;
if (eax.split.cntr_subleaf) {
if (eax_0.split.cntr_subleaf) {
cpuid_count(ARCH_PERFMON_EXT_LEAF, ARCH_PERFMON_NUM_COUNTER_LEAF,
&cntr, &fixed_cntr, &ecx, &edx);
hybrid(pmu, cntr_mask64) = cntr;
hybrid(pmu, fixed_cntr_mask64) = fixed_cntr;
&eax, &ebx, &ecx, &edx);
hybrid(pmu, cntr_mask64) = eax;
hybrid(pmu, fixed_cntr_mask64) = ebx;
cntrs_mask = counter_mask(eax, ebx);
}
if (eax.split.acr_subleaf) {
if (eax_0.split.acr_subleaf) {
cpuid_count(ARCH_PERFMON_EXT_LEAF, ARCH_PERFMON_ACR_LEAF,
&cntr, &fixed_cntr, &ecx, &edx);
&eax, &ebx, &ecx, &edx);
/* The mask of the counters which can be reloaded */
hybrid(pmu, acr_cntr_mask64) = cntr | ((u64)fixed_cntr << INTEL_PMC_IDX_FIXED);
hybrid(pmu, acr_cntr_mask64) = counter_mask(eax, ebx);
/* The mask of the counters which can cause a reload of reloadable counters */
hybrid(pmu, acr_cause_mask64) = ecx | ((u64)edx << INTEL_PMC_IDX_FIXED);
hybrid(pmu, acr_cause_mask64) = counter_mask(ecx, edx);
}
/* Bits[5:4] should be set simultaneously if arch-PEBS is supported */
if (eax_0.split.pebs_caps_subleaf && eax_0.split.pebs_cnts_subleaf) {
cpuid_count(ARCH_PERFMON_EXT_LEAF, ARCH_PERFMON_PEBS_CAP_LEAF,
&eax, &ebx, &ecx, &edx);
hybrid(pmu, arch_pebs_cap).caps = (u64)ebx << 32;
cpuid_count(ARCH_PERFMON_EXT_LEAF, ARCH_PERFMON_PEBS_COUNTER_LEAF,
&eax, &ebx, &ecx, &edx);
pebs_mask = counter_mask(eax, ecx);
pdists_mask = counter_mask(ebx, edx);
hybrid(pmu, arch_pebs_cap).counters = pebs_mask;
hybrid(pmu, arch_pebs_cap).pdists = pdists_mask;
if (WARN_ON((pebs_mask | pdists_mask) & ~cntrs_mask)) {
x86_pmu.arch_pebs = 0;
} else {
__intel_update_pmu_caps(pmu);
__intel_update_large_pebs_flags(pmu);
}
} else {
WARN_ON(x86_pmu.arch_pebs == 1);
x86_pmu.arch_pebs = 0;
}
if (!intel_pmu_broken_perf_cap()) {
@ -5319,10 +5692,7 @@ static void intel_pmu_check_hybrid_pmus(struct x86_hybrid_pmu *pmu)
else
pmu->intel_ctrl &= ~GLOBAL_CTRL_EN_PERF_METRICS;
intel_pmu_check_event_constraints(pmu->event_constraints,
pmu->cntr_mask64,
pmu->fixed_cntr_mask64,
pmu->intel_ctrl);
intel_pmu_check_event_constraints_all(&pmu->pmu);
intel_pmu_check_extra_regs(pmu->extra_regs);
}
@ -5418,6 +5788,7 @@ static void intel_pmu_cpu_starting(int cpu)
return;
init_debug_store_on_cpu(cpu);
init_arch_pebs_on_cpu(cpu);
/*
* Deal with CPUs that don't clear their LBRs on power-up, and that may
* even boot with LBRs enabled.
@ -5456,6 +5827,8 @@ static void intel_pmu_cpu_starting(int cpu)
}
}
__intel_update_pmu_caps(cpuc->pmu);
if (!cpuc->shared_regs)
return;
@ -5515,6 +5888,7 @@ static void free_excl_cntrs(struct cpu_hw_events *cpuc)
static void intel_pmu_cpu_dying(int cpu)
{
fini_debug_store_on_cpu(cpu);
fini_arch_pebs_on_cpu(cpu);
}
void intel_cpuc_finish(struct cpu_hw_events *cpuc)
@ -5535,6 +5909,7 @@ static void intel_pmu_cpu_dead(int cpu)
{
struct cpu_hw_events *cpuc = &per_cpu(cpu_hw_events, cpu);
release_arch_pebs_buf_on_cpu(cpu);
intel_cpuc_finish(cpuc);
if (is_hybrid() && cpuc->pmu)
@ -6250,7 +6625,7 @@ tsx_is_visible(struct kobject *kobj, struct attribute *attr, int i)
static umode_t
pebs_is_visible(struct kobject *kobj, struct attribute *attr, int i)
{
return x86_pmu.ds_pebs ? attr->mode : 0;
return intel_pmu_has_pebs() ? attr->mode : 0;
}
static umode_t
@ -6940,8 +7315,11 @@ __init int intel_pmu_init(void)
* Many features on and after V6 require dynamic constraint,
* e.g., Arch PEBS, ACR.
*/
if (version >= 6)
if (version >= 6) {
x86_pmu.flags |= PMU_FL_DYN_CONSTRAINT;
x86_pmu.late_setup = intel_pmu_late_setup;
}
/*
* Install the hw-cache-events table:
*/
@ -7727,6 +8105,14 @@ __init int intel_pmu_init(void)
if (!is_hybrid() && boot_cpu_has(X86_FEATURE_ARCH_PERFMON_EXT))
update_pmu_cap(NULL);
if (x86_pmu.arch_pebs) {
static_call_update(intel_pmu_disable_event_ext,
intel_pmu_disable_event_ext);
static_call_update(intel_pmu_enable_event_ext,
intel_pmu_enable_event_ext);
pr_cont("Architectural PEBS, ");
}
intel_pmu_check_counters_mask(&x86_pmu.cntr_mask64,
&x86_pmu.fixed_cntr_mask64,
&x86_pmu.intel_ctrl);
@ -7735,10 +8121,8 @@ __init int intel_pmu_init(void)
if (x86_pmu.intel_cap.anythread_deprecated)
x86_pmu.format_attrs = intel_arch_formats_attr;
intel_pmu_check_event_constraints(x86_pmu.event_constraints,
x86_pmu.cntr_mask64,
x86_pmu.fixed_cntr_mask64,
x86_pmu.intel_ctrl);
intel_pmu_check_event_constraints_all(NULL);
/*
* Access LBR MSR may cause #GP under certain circumstances.
* Check all LBR MSR here.

View File

@ -41,7 +41,7 @@
* MSR_CORE_C1_RES: CORE C1 Residency Counter
* perf code: 0x00
* Available model: SLM,AMT,GLM,CNL,ICX,TNT,ADL,RPL
* MTL,SRF,GRR,ARL,LNL
* MTL,SRF,GRR,ARL,LNL,PTL
* Scope: Core (each processor core has a MSR)
* MSR_CORE_C3_RESIDENCY: CORE C3 Residency Counter
* perf code: 0x01
@ -53,31 +53,32 @@
* Available model: SLM,AMT,NHM,WSM,SNB,IVB,HSW,BDW,
* SKL,KNL,GLM,CNL,KBL,CML,ICL,ICX,
* TGL,TNT,RKL,ADL,RPL,SPR,MTL,SRF,
* GRR,ARL,LNL
* GRR,ARL,LNL,PTL
* Scope: Core
* MSR_CORE_C7_RESIDENCY: CORE C7 Residency Counter
* perf code: 0x03
* Available model: SNB,IVB,HSW,BDW,SKL,CNL,KBL,CML,
* ICL,TGL,RKL,ADL,RPL,MTL,ARL,LNL
* ICL,TGL,RKL,ADL,RPL,MTL,ARL,LNL,
* PTL
* Scope: Core
* MSR_PKG_C2_RESIDENCY: Package C2 Residency Counter.
* perf code: 0x00
* Available model: SNB,IVB,HSW,BDW,SKL,KNL,GLM,CNL,
* KBL,CML,ICL,ICX,TGL,TNT,RKL,ADL,
* RPL,SPR,MTL,ARL,LNL,SRF
* RPL,SPR,MTL,ARL,LNL,SRF,PTL
* Scope: Package (physical package)
* MSR_PKG_C3_RESIDENCY: Package C3 Residency Counter.
* perf code: 0x01
* Available model: NHM,WSM,SNB,IVB,HSW,BDW,SKL,KNL,
* GLM,CNL,KBL,CML,ICL,TGL,TNT,RKL,
* ADL,RPL,MTL,ARL,LNL
* ADL,RPL,MTL,ARL
* Scope: Package (physical package)
* MSR_PKG_C6_RESIDENCY: Package C6 Residency Counter.
* perf code: 0x02
* Available model: SLM,AMT,NHM,WSM,SNB,IVB,HSW,BDW,
* SKL,KNL,GLM,CNL,KBL,CML,ICL,ICX,
* TGL,TNT,RKL,ADL,RPL,SPR,MTL,SRF,
* ARL,LNL
* ARL,LNL,PTL
* Scope: Package (physical package)
* MSR_PKG_C7_RESIDENCY: Package C7 Residency Counter.
* perf code: 0x03
@ -96,7 +97,7 @@
* MSR_PKG_C10_RESIDENCY: Package C10 Residency Counter.
* perf code: 0x06
* Available model: HSW ULT,KBL,GLM,CNL,CML,ICL,TGL,
* TNT,RKL,ADL,RPL,MTL,ARL,LNL
* TNT,RKL,ADL,RPL,MTL,ARL,LNL,PTL
* Scope: Package (physical package)
* MSR_MODULE_C6_RES_MS: Module C6 Residency Counter.
* perf code: 0x00
@ -522,7 +523,6 @@ static const struct cstate_model lnl_cstates __initconst = {
BIT(PERF_CSTATE_CORE_C7_RES),
.pkg_events = BIT(PERF_CSTATE_PKG_C2_RES) |
BIT(PERF_CSTATE_PKG_C3_RES) |
BIT(PERF_CSTATE_PKG_C6_RES) |
BIT(PERF_CSTATE_PKG_C10_RES),
};
@ -628,6 +628,7 @@ static const struct x86_cpu_id intel_cstates_match[] __initconst = {
X86_MATCH_VFM(INTEL_ATOM_GRACEMONT, &adl_cstates),
X86_MATCH_VFM(INTEL_ATOM_CRESTMONT_X, &srf_cstates),
X86_MATCH_VFM(INTEL_ATOM_CRESTMONT, &grr_cstates),
X86_MATCH_VFM(INTEL_ATOM_DARKMONT_X, &srf_cstates),
X86_MATCH_VFM(INTEL_ICELAKE_L, &icl_cstates),
X86_MATCH_VFM(INTEL_ICELAKE, &icl_cstates),
@ -652,6 +653,7 @@ static const struct x86_cpu_id intel_cstates_match[] __initconst = {
X86_MATCH_VFM(INTEL_ARROWLAKE_H, &adl_cstates),
X86_MATCH_VFM(INTEL_ARROWLAKE_U, &adl_cstates),
X86_MATCH_VFM(INTEL_LUNARLAKE_M, &lnl_cstates),
X86_MATCH_VFM(INTEL_PANTHERLAKE_L, &lnl_cstates),
{ },
};
MODULE_DEVICE_TABLE(x86cpu, intel_cstates_match);

View File

@ -626,13 +626,18 @@ static int alloc_pebs_buffer(int cpu)
int max, node = cpu_to_node(cpu);
void *buffer, *insn_buff, *cea;
if (!x86_pmu.ds_pebs)
if (!intel_pmu_has_pebs())
return 0;
buffer = dsalloc_pages(bsiz, GFP_KERNEL, cpu);
if (unlikely(!buffer))
return -ENOMEM;
if (x86_pmu.arch_pebs) {
hwev->pebs_vaddr = buffer;
return 0;
}
/*
* HSW+ already provides us the eventing ip; no need to allocate this
* buffer then.
@ -645,7 +650,7 @@ static int alloc_pebs_buffer(int cpu)
}
per_cpu(insn_buffer, cpu) = insn_buff;
}
hwev->ds_pebs_vaddr = buffer;
hwev->pebs_vaddr = buffer;
/* Update the cpu entry area mapping */
cea = &get_cpu_entry_area(cpu)->cpu_debug_buffers.pebs_buffer;
ds->pebs_buffer_base = (unsigned long) cea;
@ -661,17 +666,20 @@ static void release_pebs_buffer(int cpu)
struct cpu_hw_events *hwev = per_cpu_ptr(&cpu_hw_events, cpu);
void *cea;
if (!x86_pmu.ds_pebs)
if (!intel_pmu_has_pebs())
return;
if (x86_pmu.ds_pebs) {
kfree(per_cpu(insn_buffer, cpu));
per_cpu(insn_buffer, cpu) = NULL;
/* Clear the fixmap */
cea = &get_cpu_entry_area(cpu)->cpu_debug_buffers.pebs_buffer;
ds_clear_cea(cea, x86_pmu.pebs_buffer_size);
dsfree_pages(hwev->ds_pebs_vaddr, x86_pmu.pebs_buffer_size);
hwev->ds_pebs_vaddr = NULL;
}
dsfree_pages(hwev->pebs_vaddr, x86_pmu.pebs_buffer_size);
hwev->pebs_vaddr = NULL;
}
static int alloc_bts_buffer(int cpu)
@ -824,6 +832,56 @@ void reserve_ds_buffers(void)
}
}
inline int alloc_arch_pebs_buf_on_cpu(int cpu)
{
if (!x86_pmu.arch_pebs)
return 0;
return alloc_pebs_buffer(cpu);
}
inline void release_arch_pebs_buf_on_cpu(int cpu)
{
if (!x86_pmu.arch_pebs)
return;
release_pebs_buffer(cpu);
}
void init_arch_pebs_on_cpu(int cpu)
{
struct cpu_hw_events *cpuc = per_cpu_ptr(&cpu_hw_events, cpu);
u64 arch_pebs_base;
if (!x86_pmu.arch_pebs)
return;
if (!cpuc->pebs_vaddr) {
WARN(1, "Fail to allocate PEBS buffer on CPU %d\n", cpu);
x86_pmu.pebs_active = 0;
return;
}
/*
* 4KB-aligned pointer of the output buffer
* (__alloc_pages_node() return page aligned address)
* Buffer Size = 4KB * 2^SIZE
* contiguous physical buffer (__alloc_pages_node() with order)
*/
arch_pebs_base = virt_to_phys(cpuc->pebs_vaddr) | PEBS_BUFFER_SHIFT;
wrmsr_on_cpu(cpu, MSR_IA32_PEBS_BASE, (u32)arch_pebs_base,
(u32)(arch_pebs_base >> 32));
x86_pmu.pebs_active = 1;
}
inline void fini_arch_pebs_on_cpu(int cpu)
{
if (!x86_pmu.arch_pebs)
return;
wrmsr_on_cpu(cpu, MSR_IA32_PEBS_BASE, 0, 0);
}
/*
* BTS
*/
@ -1471,6 +1529,25 @@ pebs_update_state(bool needed_cb, struct cpu_hw_events *cpuc,
}
}
u64 intel_get_arch_pebs_data_config(struct perf_event *event)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
u64 pebs_data_cfg = 0;
u64 cntr_mask;
if (WARN_ON(event->hw.idx < 0 || event->hw.idx >= X86_PMC_IDX_MAX))
return 0;
pebs_data_cfg |= pebs_update_adaptive_cfg(event);
cntr_mask = (PEBS_DATACFG_CNTR_MASK << PEBS_DATACFG_CNTR_SHIFT) |
(PEBS_DATACFG_FIX_MASK << PEBS_DATACFG_FIX_SHIFT) |
PEBS_DATACFG_CNTR | PEBS_DATACFG_METRICS;
pebs_data_cfg |= cpuc->pebs_data_cfg & cntr_mask;
return pebs_data_cfg;
}
void intel_pmu_pebs_add(struct perf_event *event)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
@ -1532,6 +1609,15 @@ static inline void intel_pmu_drain_large_pebs(struct cpu_hw_events *cpuc)
intel_pmu_drain_pebs_buffer();
}
static void __intel_pmu_pebs_enable(struct perf_event *event)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct hw_perf_event *hwc = &event->hw;
hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;
cpuc->pebs_enabled |= 1ULL << hwc->idx;
}
void intel_pmu_pebs_enable(struct perf_event *event)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
@ -1540,9 +1626,7 @@ void intel_pmu_pebs_enable(struct perf_event *event)
struct debug_store *ds = cpuc->ds;
unsigned int idx = hwc->idx;
hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;
cpuc->pebs_enabled |= 1ULL << hwc->idx;
__intel_pmu_pebs_enable(event);
if ((event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT) && (x86_pmu.version < 5))
cpuc->pebs_enabled |= 1ULL << (hwc->idx + 32);
@ -1604,14 +1688,22 @@ void intel_pmu_pebs_del(struct perf_event *event)
pebs_update_state(needed_cb, cpuc, event, false);
}
void intel_pmu_pebs_disable(struct perf_event *event)
static void __intel_pmu_pebs_disable(struct perf_event *event)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct hw_perf_event *hwc = &event->hw;
intel_pmu_drain_large_pebs(cpuc);
cpuc->pebs_enabled &= ~(1ULL << hwc->idx);
hwc->config |= ARCH_PERFMON_EVENTSEL_INT;
}
void intel_pmu_pebs_disable(struct perf_event *event)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct hw_perf_event *hwc = &event->hw;
__intel_pmu_pebs_disable(event);
if ((event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT) &&
(x86_pmu.version < 5))
@ -1623,8 +1715,6 @@ void intel_pmu_pebs_disable(struct perf_event *event)
if (cpuc->enabled)
wrmsrq(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);
hwc->config |= ARCH_PERFMON_EVENTSEL_INT;
}
void intel_pmu_pebs_enable_all(void)
@ -2060,6 +2150,90 @@ static inline void __setup_pebs_counter_group(struct cpu_hw_events *cpuc,
#define PEBS_LATENCY_MASK 0xffff
static inline void __setup_perf_sample_data(struct perf_event *event,
struct pt_regs *iregs,
struct perf_sample_data *data)
{
perf_sample_data_init(data, 0, event->hw.last_period);
/*
* We must however always use iregs for the unwinder to stay sane; the
* record BP,SP,IP can point into thin air when the record is from a
* previous PMI context or an (I)RET happened between the record and
* PMI.
*/
perf_sample_save_callchain(data, event, iregs);
}
static inline void __setup_pebs_basic_group(struct perf_event *event,
struct pt_regs *regs,
struct perf_sample_data *data,
u64 sample_type, u64 ip,
u64 tsc, u16 retire)
{
/* The ip in basic is EventingIP */
set_linear_ip(regs, ip);
regs->flags = PERF_EFLAGS_EXACT;
setup_pebs_time(event, data, tsc);
if (sample_type & PERF_SAMPLE_WEIGHT_STRUCT)
data->weight.var3_w = retire;
}
static inline void __setup_pebs_gpr_group(struct perf_event *event,
struct pt_regs *regs,
struct pebs_gprs *gprs,
u64 sample_type)
{
if (event->attr.precise_ip < 2) {
set_linear_ip(regs, gprs->ip);
regs->flags &= ~PERF_EFLAGS_EXACT;
}
if (sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER))
adaptive_pebs_save_regs(regs, gprs);
}
static inline void __setup_pebs_meminfo_group(struct perf_event *event,
struct perf_sample_data *data,
u64 sample_type, u64 latency,
u16 instr_latency, u64 address,
u64 aux, u64 tsx_tuning, u64 ax)
{
if (sample_type & PERF_SAMPLE_WEIGHT_TYPE) {
u64 tsx_latency = intel_get_tsx_weight(tsx_tuning);
data->weight.var2_w = instr_latency;
/*
* Although meminfo::latency is defined as a u64,
* only the lower 32 bits include the valid data
* in practice on Ice Lake and earlier platforms.
*/
if (sample_type & PERF_SAMPLE_WEIGHT)
data->weight.full = latency ?: tsx_latency;
else
data->weight.var1_dw = (u32)latency ?: tsx_latency;
data->sample_flags |= PERF_SAMPLE_WEIGHT_TYPE;
}
if (sample_type & PERF_SAMPLE_DATA_SRC) {
data->data_src.val = get_data_src(event, aux);
data->sample_flags |= PERF_SAMPLE_DATA_SRC;
}
if (sample_type & PERF_SAMPLE_ADDR_TYPE) {
data->addr = address;
data->sample_flags |= PERF_SAMPLE_ADDR;
}
if (sample_type & PERF_SAMPLE_TRANSACTION) {
data->txn = intel_get_tsx_transaction(tsx_tuning, ax);
data->sample_flags |= PERF_SAMPLE_TRANSACTION;
}
}
/*
* With adaptive PEBS the layout depends on what fields are configured.
*/
@ -2069,12 +2243,14 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
struct pt_regs *regs)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
u64 sample_type = event->attr.sample_type;
struct pebs_basic *basic = __pebs;
void *next_record = basic + 1;
u64 sample_type, format_group;
struct pebs_meminfo *meminfo = NULL;
struct pebs_gprs *gprs = NULL;
struct x86_perf_regs *perf_regs;
u64 format_group;
u16 retire;
if (basic == NULL)
return;
@ -2082,31 +2258,17 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
perf_regs = container_of(regs, struct x86_perf_regs, regs);
perf_regs->xmm_regs = NULL;
sample_type = event->attr.sample_type;
format_group = basic->format_group;
perf_sample_data_init(data, 0, event->hw.last_period);
setup_pebs_time(event, data, basic->tsc);
/*
* We must however always use iregs for the unwinder to stay sane; the
* record BP,SP,IP can point into thin air when the record is from a
* previous PMI context or an (I)RET happened between the record and
* PMI.
*/
perf_sample_save_callchain(data, event, iregs);
__setup_perf_sample_data(event, iregs, data);
*regs = *iregs;
/* The ip in basic is EventingIP */
set_linear_ip(regs, basic->ip);
regs->flags = PERF_EFLAGS_EXACT;
if (sample_type & PERF_SAMPLE_WEIGHT_STRUCT) {
if (x86_pmu.flags & PMU_FL_RETIRE_LATENCY)
data->weight.var3_w = basic->retire_latency;
else
data->weight.var3_w = 0;
}
/* basic group */
retire = x86_pmu.flags & PMU_FL_RETIRE_LATENCY ?
basic->retire_latency : 0;
__setup_pebs_basic_group(event, regs, data, sample_type,
basic->ip, basic->tsc, retire);
/*
* The record for MEMINFO is in front of GP
@ -2122,54 +2284,20 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
gprs = next_record;
next_record = gprs + 1;
if (event->attr.precise_ip < 2) {
set_linear_ip(regs, gprs->ip);
regs->flags &= ~PERF_EFLAGS_EXACT;
}
if (sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER))
adaptive_pebs_save_regs(regs, gprs);
__setup_pebs_gpr_group(event, regs, gprs, sample_type);
}
if (format_group & PEBS_DATACFG_MEMINFO) {
if (sample_type & PERF_SAMPLE_WEIGHT_TYPE) {
u64 latency = x86_pmu.flags & PMU_FL_INSTR_LATENCY ?
meminfo->cache_latency : meminfo->mem_latency;
u64 instr_latency = x86_pmu.flags & PMU_FL_INSTR_LATENCY ?
meminfo->instr_latency : 0;
u64 ax = gprs ? gprs->ax : 0;
if (x86_pmu.flags & PMU_FL_INSTR_LATENCY)
data->weight.var2_w = meminfo->instr_latency;
/*
* Although meminfo::latency is defined as a u64,
* only the lower 32 bits include the valid data
* in practice on Ice Lake and earlier platforms.
*/
if (sample_type & PERF_SAMPLE_WEIGHT) {
data->weight.full = latency ?:
intel_get_tsx_weight(meminfo->tsx_tuning);
} else {
data->weight.var1_dw = (u32)latency ?:
intel_get_tsx_weight(meminfo->tsx_tuning);
}
data->sample_flags |= PERF_SAMPLE_WEIGHT_TYPE;
}
if (sample_type & PERF_SAMPLE_DATA_SRC) {
data->data_src.val = get_data_src(event, meminfo->aux);
data->sample_flags |= PERF_SAMPLE_DATA_SRC;
}
if (sample_type & PERF_SAMPLE_ADDR_TYPE) {
data->addr = meminfo->address;
data->sample_flags |= PERF_SAMPLE_ADDR;
}
if (sample_type & PERF_SAMPLE_TRANSACTION) {
data->txn = intel_get_tsx_transaction(meminfo->tsx_tuning,
gprs ? gprs->ax : 0);
data->sample_flags |= PERF_SAMPLE_TRANSACTION;
}
__setup_pebs_meminfo_group(event, data, sample_type, latency,
instr_latency, meminfo->address,
meminfo->aux, meminfo->tsx_tuning,
ax);
}
if (format_group & PEBS_DATACFG_XMMS) {
@ -2220,6 +2348,135 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
format_group);
}
static inline bool arch_pebs_record_continued(struct arch_pebs_header *header)
{
/* Continue bit or null PEBS record indicates fragment follows. */
return header->cont || !(header->format & GENMASK_ULL(63, 16));
}
static void setup_arch_pebs_sample_data(struct perf_event *event,
struct pt_regs *iregs,
void *__pebs,
struct perf_sample_data *data,
struct pt_regs *regs)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
u64 sample_type = event->attr.sample_type;
struct arch_pebs_header *header = NULL;
struct arch_pebs_aux *meminfo = NULL;
struct arch_pebs_gprs *gprs = NULL;
struct x86_perf_regs *perf_regs;
void *next_record;
void *at = __pebs;
if (at == NULL)
return;
perf_regs = container_of(regs, struct x86_perf_regs, regs);
perf_regs->xmm_regs = NULL;
__setup_perf_sample_data(event, iregs, data);
*regs = *iregs;
again:
header = at;
next_record = at + sizeof(struct arch_pebs_header);
if (header->basic) {
struct arch_pebs_basic *basic = next_record;
u16 retire = 0;
next_record = basic + 1;
if (sample_type & PERF_SAMPLE_WEIGHT_STRUCT)
retire = basic->valid ? basic->retire : 0;
__setup_pebs_basic_group(event, regs, data, sample_type,
basic->ip, basic->tsc, retire);
}
/*
* The record for MEMINFO is in front of GP
* But PERF_SAMPLE_TRANSACTION needs gprs->ax.
* Save the pointer here but process later.
*/
if (header->aux) {
meminfo = next_record;
next_record = meminfo + 1;
}
if (header->gpr) {
gprs = next_record;
next_record = gprs + 1;
__setup_pebs_gpr_group(event, regs,
(struct pebs_gprs *)gprs,
sample_type);
}
if (header->aux) {
u64 ax = gprs ? gprs->ax : 0;
__setup_pebs_meminfo_group(event, data, sample_type,
meminfo->cache_latency,
meminfo->instr_latency,
meminfo->address, meminfo->aux,
meminfo->tsx_tuning, ax);
}
if (header->xmm) {
struct pebs_xmm *xmm;
next_record += sizeof(struct arch_pebs_xer_header);
xmm = next_record;
perf_regs->xmm_regs = xmm->xmm;
next_record = xmm + 1;
}
if (header->lbr) {
struct arch_pebs_lbr_header *lbr_header = next_record;
struct lbr_entry *lbr;
int num_lbr;
next_record = lbr_header + 1;
lbr = next_record;
num_lbr = header->lbr == ARCH_PEBS_LBR_NUM_VAR ?
lbr_header->depth :
header->lbr * ARCH_PEBS_BASE_LBR_ENTRIES;
next_record += num_lbr * sizeof(struct lbr_entry);
if (has_branch_stack(event)) {
intel_pmu_store_pebs_lbrs(lbr);
intel_pmu_lbr_save_brstack(data, cpuc, event);
}
}
if (header->cntr) {
struct arch_pebs_cntr_header *cntr = next_record;
unsigned int nr;
next_record += sizeof(struct arch_pebs_cntr_header);
if (is_pebs_counter_event_group(event)) {
__setup_pebs_counter_group(cpuc, event,
(struct pebs_cntr_header *)cntr, next_record);
data->sample_flags |= PERF_SAMPLE_READ;
}
nr = hweight32(cntr->cntr) + hweight32(cntr->fixed);
if (cntr->metrics == INTEL_CNTR_METRICS)
nr += 2;
next_record += nr * sizeof(u64);
}
/* Parse followed fragments if there are. */
if (arch_pebs_record_continued(header)) {
at = at + header->size;
goto again;
}
}
static inline void *
get_next_pebs_record_by_bit(void *base, void *top, int bit)
{
@ -2602,6 +2859,57 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_d
}
}
static __always_inline void
__intel_pmu_handle_pebs_record(struct pt_regs *iregs,
struct pt_regs *regs,
struct perf_sample_data *data,
void *at, u64 pebs_status,
short *counts, void **last,
setup_fn setup_sample)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct perf_event *event;
int bit;
for_each_set_bit(bit, (unsigned long *)&pebs_status, X86_PMC_IDX_MAX) {
event = cpuc->events[bit];
if (WARN_ON_ONCE(!event) ||
WARN_ON_ONCE(!event->attr.precise_ip))
continue;
if (counts[bit]++) {
__intel_pmu_pebs_event(event, iregs, regs, data,
last[bit], setup_sample);
}
last[bit] = at;
}
}
static __always_inline void
__intel_pmu_handle_last_pebs_record(struct pt_regs *iregs,
struct pt_regs *regs,
struct perf_sample_data *data,
u64 mask, short *counts, void **last,
setup_fn setup_sample)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct perf_event *event;
int bit;
for_each_set_bit(bit, (unsigned long *)&mask, X86_PMC_IDX_MAX) {
if (!counts[bit])
continue;
event = cpuc->events[bit];
__intel_pmu_pebs_last_event(event, iregs, regs, data, last[bit],
counts[bit], setup_sample);
}
}
static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_data *data)
{
short counts[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS] = {};
@ -2611,9 +2919,7 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
struct x86_perf_regs perf_regs;
struct pt_regs *regs = &perf_regs.regs;
struct pebs_basic *basic;
struct perf_event *event;
void *base, *at, *top;
int bit;
u64 mask;
if (!x86_pmu.pebs_active)
@ -2626,6 +2932,7 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
mask = hybrid(cpuc->pmu, pebs_events_mask) |
(hybrid(cpuc->pmu, fixed_cntr_mask64) << INTEL_PMC_IDX_FIXED);
mask &= cpuc->pebs_enabled;
if (unlikely(base >= top)) {
intel_pmu_pebs_event_update_no_drain(cpuc, mask);
@ -2643,38 +2950,114 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
if (basic->format_size != cpuc->pebs_record_size)
continue;
pebs_status = basic->applicable_counters & cpuc->pebs_enabled & mask;
for_each_set_bit(bit, (unsigned long *)&pebs_status, X86_PMC_IDX_MAX) {
event = cpuc->events[bit];
if (WARN_ON_ONCE(!event) ||
WARN_ON_ONCE(!event->attr.precise_ip))
continue;
if (counts[bit]++) {
__intel_pmu_pebs_event(event, iregs, regs, data, last[bit],
pebs_status = mask & basic->applicable_counters;
__intel_pmu_handle_pebs_record(iregs, regs, data, at,
pebs_status, counts, last,
setup_pebs_adaptive_sample_data);
}
last[bit] = at;
}
__intel_pmu_handle_last_pebs_record(iregs, regs, data, mask, counts, last,
setup_pebs_adaptive_sample_data);
}
static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
struct perf_sample_data *data)
{
short counts[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS] = {};
void *last[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS];
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
union arch_pebs_index index;
struct x86_perf_regs perf_regs;
struct pt_regs *regs = &perf_regs.regs;
void *base, *at, *top;
u64 mask;
rdmsrq(MSR_IA32_PEBS_INDEX, index.whole);
if (unlikely(!index.wr)) {
intel_pmu_pebs_event_update_no_drain(cpuc, X86_PMC_IDX_MAX);
return;
}
for_each_set_bit(bit, (unsigned long *)&mask, X86_PMC_IDX_MAX) {
if (!counts[bit])
base = cpuc->pebs_vaddr;
top = cpuc->pebs_vaddr + (index.wr << ARCH_PEBS_INDEX_WR_SHIFT);
index.wr = 0;
index.full = 0;
index.en = 1;
if (cpuc->n_pebs == cpuc->n_large_pebs)
index.thresh = ARCH_PEBS_THRESH_MULTI;
else
index.thresh = ARCH_PEBS_THRESH_SINGLE;
wrmsrq(MSR_IA32_PEBS_INDEX, index.whole);
mask = hybrid(cpuc->pmu, arch_pebs_cap).counters & cpuc->pebs_enabled;
if (!iregs)
iregs = &dummy_iregs;
/* Process all but the last event for each counter. */
for (at = base; at < top;) {
struct arch_pebs_header *header;
struct arch_pebs_basic *basic;
u64 pebs_status;
header = at;
if (WARN_ON_ONCE(!header->size))
break;
/* 1st fragment or single record must have basic group */
if (!header->basic) {
at += header->size;
continue;
event = cpuc->events[bit];
__intel_pmu_pebs_last_event(event, iregs, regs, data, last[bit],
counts[bit], setup_pebs_adaptive_sample_data);
}
basic = at + sizeof(struct arch_pebs_header);
pebs_status = mask & basic->applicable_counters;
__intel_pmu_handle_pebs_record(iregs, regs, data, at,
pebs_status, counts, last,
setup_arch_pebs_sample_data);
/* Skip non-last fragments */
while (arch_pebs_record_continued(header)) {
if (!header->size)
break;
at += header->size;
header = at;
}
/* Skip last fragment or the single record */
at += header->size;
}
__intel_pmu_handle_last_pebs_record(iregs, regs, data, mask,
counts, last,
setup_arch_pebs_sample_data);
}
static void __init intel_arch_pebs_init(void)
{
/*
* Current hybrid platforms always both support arch-PEBS or not
* on all kinds of cores. So directly set x86_pmu.arch_pebs flag
* if boot cpu supports arch-PEBS.
*/
x86_pmu.arch_pebs = 1;
x86_pmu.pebs_buffer_size = PEBS_BUFFER_SIZE;
x86_pmu.drain_pebs = intel_pmu_drain_arch_pebs;
x86_pmu.pebs_capable = ~0ULL;
x86_pmu.flags |= PMU_FL_PEBS_ALL;
x86_pmu.pebs_enable = __intel_pmu_pebs_enable;
x86_pmu.pebs_disable = __intel_pmu_pebs_disable;
}
/*
* PEBS probe and setup
*/
void __init intel_pebs_init(void)
static void __init intel_ds_pebs_init(void)
{
/*
* No support for 32bit formats
@ -2736,10 +3119,8 @@ void __init intel_pebs_init(void)
break;
case 6:
if (x86_pmu.intel_cap.pebs_baseline) {
if (x86_pmu.intel_cap.pebs_baseline)
x86_pmu.large_pebs_flags |= PERF_SAMPLE_READ;
x86_pmu.late_setup = intel_pmu_late_setup;
}
fallthrough;
case 5:
x86_pmu.pebs_ept = 1;
@ -2789,6 +3170,14 @@ void __init intel_pebs_init(void)
}
}
void __init intel_pebs_init(void)
{
if (x86_pmu.intel_cap.pebs_format == 0xf)
intel_arch_pebs_init();
else
intel_ds_pebs_init();
}
void perf_restore_debug_store(void)
{
struct debug_store *ds = __this_cpu_read(cpu_hw_events.ds);

View File

@ -283,8 +283,9 @@ struct cpu_hw_events {
* Intel DebugStore bits
*/
struct debug_store *ds;
void *ds_pebs_vaddr;
void *ds_bts_vaddr;
/* DS based PEBS or arch-PEBS buffer address */
void *pebs_vaddr;
u64 pebs_enabled;
int n_pebs;
int n_large_pebs;
@ -303,6 +304,8 @@ struct cpu_hw_events {
/* Intel ACR configuration */
u64 acr_cfg_b[X86_PMC_IDX_MAX];
u64 acr_cfg_c[X86_PMC_IDX_MAX];
/* Cached CFG_C values */
u64 cfg_c_val[X86_PMC_IDX_MAX];
/*
* Intel LBR bits
@ -708,6 +711,12 @@ enum hybrid_pmu_type {
hybrid_big_small_tiny = hybrid_big | hybrid_small_tiny,
};
struct arch_pebs_cap {
u64 caps;
u64 counters;
u64 pdists;
};
struct x86_hybrid_pmu {
struct pmu pmu;
const char *name;
@ -752,6 +761,8 @@ struct x86_hybrid_pmu {
mid_ack :1,
enabled_ack :1;
struct arch_pebs_cap arch_pebs_cap;
u64 pebs_data_source[PERF_PEBS_DATA_SOURCE_MAX];
};
@ -906,7 +917,7 @@ struct x86_pmu {
union perf_capabilities intel_cap;
/*
* Intel DebugStore bits
* Intel DebugStore and PEBS bits
*/
unsigned int bts :1,
bts_active :1,
@ -917,7 +928,8 @@ struct x86_pmu {
pebs_no_tlb :1,
pebs_no_isolation :1,
pebs_block :1,
pebs_ept :1;
pebs_ept :1,
arch_pebs :1;
int pebs_record_size;
int pebs_buffer_size;
u64 pebs_events_mask;
@ -929,6 +941,11 @@ struct x86_pmu {
u64 rtm_abort_event;
u64 pebs_capable;
/*
* Intel Architectural PEBS
*/
struct arch_pebs_cap arch_pebs_cap;
/*
* Intel LBR
*/
@ -1124,7 +1141,6 @@ static struct perf_pmu_format_hybrid_attr format_attr_hybrid_##_name = {\
.pmu_type = _pmu, \
}
int is_x86_event(struct perf_event *event);
struct pmu *x86_get_pmu(unsigned int cpu);
extern struct x86_pmu x86_pmu __read_mostly;
@ -1217,7 +1233,7 @@ int x86_reserve_hardware(void);
void x86_release_hardware(void);
int x86_pmu_max_precise(void);
int x86_pmu_max_precise(struct pmu *pmu);
void hw_perf_lbr_event_destroy(struct perf_event *event);
@ -1604,6 +1620,14 @@ extern void intel_cpuc_finish(struct cpu_hw_events *cpuc);
int intel_pmu_init(void);
int alloc_arch_pebs_buf_on_cpu(int cpu);
void release_arch_pebs_buf_on_cpu(int cpu);
void init_arch_pebs_on_cpu(int cpu);
void fini_arch_pebs_on_cpu(int cpu);
void init_debug_store_on_cpu(int cpu);
void fini_debug_store_on_cpu(int cpu);
@ -1760,6 +1784,8 @@ void intel_pmu_pebs_data_source_cmt(void);
void intel_pmu_pebs_data_source_lnl(void);
u64 intel_get_arch_pebs_data_config(struct perf_event *event);
int intel_pmu_setup_lbr_filter(struct perf_event *event);
void intel_pt_interrupt(void);
@ -1792,6 +1818,11 @@ static inline int intel_pmu_max_num_pebs(struct pmu *pmu)
return fls((u32)hybrid(pmu, pebs_events_mask));
}
static inline bool intel_pmu_has_pebs(void)
{
return x86_pmu.ds_pebs || x86_pmu.arch_pebs;
}
#else /* CONFIG_CPU_SUP_INTEL */
static inline void reserve_ds_buffers(void)

View File

@ -198,6 +198,7 @@ static inline int alternatives_text_reserved(void *start, void *end)
#define ALTINSTR_ENTRY(ft_flags) \
".pushsection .altinstructions,\"a\"\n" \
ANNOTATE_DATA_SPECIAL \
" .long 771b - .\n" /* label */ \
" .long 774f - .\n" /* new instruction */ \
" .4byte " __stringify(ft_flags) "\n" /* feature + flags */ \
@ -207,6 +208,7 @@ static inline int alternatives_text_reserved(void *start, void *end)
#define ALTINSTR_REPLACEMENT(newinstr) /* replacement */ \
".pushsection .altinstr_replacement, \"ax\"\n" \
ANNOTATE_DATA_SPECIAL \
"# ALT: replacement\n" \
"774:\n\t" newinstr "\n775:\n" \
".popsection\n"
@ -337,6 +339,7 @@ void nop_func(void);
* instruction. See apply_alternatives().
*/
.macro altinstr_entry orig alt ft_flags orig_len alt_len
ANNOTATE_DATA_SPECIAL
.long \orig - .
.long \alt - .
.4byte \ft_flags
@ -365,6 +368,7 @@ void nop_func(void);
.popsection ; \
.pushsection .altinstr_replacement,"ax" ; \
743: \
ANNOTATE_DATA_SPECIAL ; \
newinst ; \
744: \
.popsection ;

View File

@ -2,6 +2,8 @@
#ifndef _ASM_X86_ASM_H
#define _ASM_X86_ASM_H
#include <linux/annotate.h>
#ifdef __ASSEMBLER__
# define __ASM_FORM(x, ...) x,## __VA_ARGS__
# define __ASM_FORM_RAW(x, ...) x,## __VA_ARGS__
@ -132,6 +134,7 @@ static __always_inline __pure void *rip_rel_ptr(void *p)
# define _ASM_EXTABLE_TYPE(from, to, type) \
.pushsection "__ex_table","a" ; \
.balign 4 ; \
ANNOTATE_DATA_SPECIAL ; \
.long (from) - . ; \
.long (to) - . ; \
.long type ; \
@ -179,6 +182,7 @@ static __always_inline __pure void *rip_rel_ptr(void *p)
# define _ASM_EXTABLE_TYPE(from, to, type) \
" .pushsection \"__ex_table\",\"a\"\n" \
" .balign 4\n" \
ANNOTATE_DATA_SPECIAL \
" .long (" #from ") - .\n" \
" .long (" #to ") - .\n" \
" .long " __stringify(type) " \n" \
@ -187,6 +191,7 @@ static __always_inline __pure void *rip_rel_ptr(void *p)
# define _ASM_EXTABLE_TYPE_REG(from, to, type, reg) \
" .pushsection \"__ex_table\",\"a\"\n" \
" .balign 4\n" \
ANNOTATE_DATA_SPECIAL \
" .long (" #from ") - .\n" \
" .long (" #to ") - .\n" \
DEFINE_EXTABLE_TYPE_REG \

View File

@ -7,6 +7,11 @@
#include <linux/objtool.h>
#include <asm/asm.h>
#ifndef __ASSEMBLY__
struct bug_entry;
extern void __WARN_trap(struct bug_entry *bug, ...);
#endif
/*
* Despite that some emulators terminate on UD2, we use it for WARN().
*/
@ -31,52 +36,77 @@
#define BUG_UD2 0xfffe
#define BUG_UD1 0xfffd
#define BUG_UD1_UBSAN 0xfffc
#define BUG_UD1_WARN 0xfffb
#define BUG_UDB 0xffd6
#define BUG_LOCK 0xfff0
#ifdef CONFIG_GENERIC_BUG
#ifdef CONFIG_X86_32
# define __BUG_REL(val) ".long " val
#else
# define __BUG_REL(val) ".long " val " - ."
#endif
#ifdef CONFIG_DEBUG_BUGVERBOSE
#define __BUG_ENTRY(file, line, flags) \
"2:\t" __BUG_REL("1b") "\t# bug_entry::bug_addr\n" \
"\t" __BUG_REL(file) "\t# bug_entry::file\n" \
"\t.word " line "\t# bug_entry::line\n" \
"\t.word " flags "\t# bug_entry::flags\n"
#define __BUG_ENTRY_VERBOSE(file, line) \
"\t.long " file " - .\t# bug_entry::file\n" \
"\t.word " line "\t# bug_entry::line\n"
#else
#define __BUG_ENTRY(file, line, flags) \
"2:\t" __BUG_REL("1b") "\t# bug_entry::bug_addr\n" \
"\t.word " flags "\t# bug_entry::flags\n"
#define __BUG_ENTRY_VERBOSE(file, line)
#endif
#define _BUG_FLAGS_ASM(ins, file, line, flags, size, extra) \
"1:\t" ins "\n" \
".pushsection __bug_table,\"aw\"\n" \
__BUG_ENTRY(file, line, flags) \
#if defined(CONFIG_X86_64) || defined(CONFIG_DEBUG_BUGVERBOSE_DETAILED)
#define HAVE_ARCH_BUG_FORMAT
#define __BUG_ENTRY_FORMAT(format) \
"\t.long " format " - .\t# bug_entry::format\n"
#else
#define __BUG_ENTRY_FORMAT(format)
#endif
#ifdef CONFIG_X86_64
#define HAVE_ARCH_BUG_FORMAT_ARGS
#endif
#define __BUG_ENTRY(format, file, line, flags) \
"\t.long 1b - ." "\t# bug_entry::bug_addr\n" \
__BUG_ENTRY_FORMAT(format) \
__BUG_ENTRY_VERBOSE(file, line) \
"\t.word " flags "\t# bug_entry::flags\n"
#define _BUG_FLAGS_ASM(format, file, line, flags, size, extra) \
".pushsection __bug_table,\"aw\"\n\t" \
ANNOTATE_DATA_SPECIAL \
"2:\n\t" \
__BUG_ENTRY(format, file, line, flags) \
"\t.org 2b + " size "\n" \
".popsection\n" \
extra
#define _BUG_FLAGS(ins, flags, extra) \
#ifdef CONFIG_DEBUG_BUGVERBOSE_DETAILED
#define WARN_CONDITION_STR(cond_str) cond_str
#else
#define WARN_CONDITION_STR(cond_str) ""
#endif
#define _BUG_FLAGS(cond_str, ins, flags, extra) \
do { \
asm_inline volatile(_BUG_FLAGS_ASM(ins, "%c0", \
"%c1", "%c2", "%c3", extra) \
: : "i" (__FILE__), "i" (__LINE__), \
"i" (flags), \
"i" (sizeof(struct bug_entry))); \
asm_inline volatile("1:\t" ins "\n" \
_BUG_FLAGS_ASM("%c[fmt]", "%c[file]", \
"%c[line]", "%c[fl]", \
"%c[size]", extra) \
: : [fmt] "i" (WARN_CONDITION_STR(cond_str)), \
[file] "i" (__FILE__), \
[line] "i" (__LINE__), \
[fl] "i" (flags), \
[size] "i" (sizeof(struct bug_entry))); \
} while (0)
#define ARCH_WARN_ASM(file, line, flags, size) \
_BUG_FLAGS_ASM(ASM_UD2, file, line, flags, size, "")
".pushsection .rodata.str1.1, \"aMS\", @progbits, 1\n" \
"99:\n" \
"\t.string \"\"\n" \
".popsection\n" \
"1:\t " ASM_UD2 "\n" \
_BUG_FLAGS_ASM("99b", file, line, flags, size, "")
#else
#define _BUG_FLAGS(ins, flags, extra) asm volatile(ins)
#define _BUG_FLAGS(cond_str, ins, flags, extra) asm volatile(ins)
#endif /* CONFIG_GENERIC_BUG */
@ -84,7 +114,7 @@ do { \
#define BUG() \
do { \
instrumentation_begin(); \
_BUG_FLAGS(ASM_UD2, 0, ""); \
_BUG_FLAGS("", ASM_UD2, 0, ""); \
__builtin_unreachable(); \
} while (0)
@ -97,14 +127,69 @@ do { \
#define ARCH_WARN_REACHABLE ANNOTATE_REACHABLE(1b)
#define __WARN_FLAGS(flags) \
#define __WARN_FLAGS(cond_str, flags) \
do { \
__auto_type __flags = BUGFLAG_WARNING|(flags); \
instrumentation_begin(); \
_BUG_FLAGS(ASM_UD2, __flags, ARCH_WARN_REACHABLE); \
_BUG_FLAGS(cond_str, ASM_UD2, __flags, ARCH_WARN_REACHABLE); \
instrumentation_end(); \
} while (0)
#ifdef HAVE_ARCH_BUG_FORMAT_ARGS
#ifndef __ASSEMBLY__
#include <linux/static_call_types.h>
DECLARE_STATIC_CALL(WARN_trap, __WARN_trap);
struct pt_regs;
struct sysv_va_list { /* from AMD64 System V ABI */
unsigned int gp_offset;
unsigned int fp_offset;
void *overflow_arg_area;
void *reg_save_area;
};
struct arch_va_list {
unsigned long regs[6];
struct sysv_va_list args;
};
extern void *__warn_args(struct arch_va_list *args, struct pt_regs *regs);
#endif /* __ASSEMBLY__ */
#define __WARN_bug_entry(flags, format) ({ \
struct bug_entry *bug; \
asm_inline volatile("lea (2f)(%%rip), %[addr]\n1:\n" \
_BUG_FLAGS_ASM("%c[fmt]", "%c[file]", \
"%c[line]", "%c[fl]", \
"%c[size]", "") \
: [addr] "=r" (bug) \
: [fmt] "i" (format), \
[file] "i" (__FILE__), \
[line] "i" (__LINE__), \
[fl] "i" (flags), \
[size] "i" (sizeof(struct bug_entry))); \
bug; })
#define __WARN_print_arg(flags, format, arg...) \
do { \
int __flags = (flags) | BUGFLAG_WARNING | BUGFLAG_ARGS ; \
static_call_mod(WARN_trap)(__WARN_bug_entry(__flags, format), ## arg); \
asm (""); /* inhibit tail-call optimization */ \
} while (0)
#define __WARN_printf(taint, fmt, arg...) \
__WARN_print_arg(BUGFLAG_TAINT(taint), fmt, ## arg)
#define WARN_ONCE(cond, format, arg...) ({ \
int __ret_warn_on = !!(cond); \
if (unlikely(__ret_warn_on)) { \
__WARN_print_arg(BUGFLAG_ONCE|BUGFLAG_TAINT(TAINT_WARN),\
format, ## arg); \
} \
__ret_warn_on; \
})
#endif /* HAVE_ARCH_BUG_FORMAT_ARGS */
#include <asm-generic/bug.h>
#endif /* _ASM_X86_BUG_H */

View File

@ -101,6 +101,7 @@ static __always_inline bool _static_cpu_has(u16 bit)
asm goto(ALTERNATIVE_TERNARY("jmp 6f", %c[feature], "", "jmp %l[t_no]")
".pushsection .altinstr_aux,\"ax\"\n"
"6:\n"
ANNOTATE_DATA_SPECIAL
" testb %[bitnum], %a[cap_byte]\n"
" jnz %l[t_yes]\n"
" jmp %l[t_no]\n"

View File

@ -44,4 +44,6 @@ enum insn_mmio_type {
enum insn_mmio_type insn_decode_mmio(struct insn *insn, int *bytes);
bool insn_is_nop(struct insn *insn);
#endif /* _ASM_X86_INSN_EVAL_H */

View File

@ -312,7 +312,6 @@ static inline int insn_offset_immediate(struct insn *insn)
/**
* for_each_insn_prefix() -- Iterate prefixes in the instruction
* @insn: Pointer to struct insn.
* @idx: Index storage.
* @prefix: Prefix byte.
*
* Iterate prefix bytes of given @insn. Each prefix byte is stored in @prefix
@ -321,8 +320,8 @@ static inline int insn_offset_immediate(struct insn *insn)
* Since prefixes.nbytes can be bigger than 4 if some prefixes
* are repeated, it cannot be used for looping over the prefixes.
*/
#define for_each_insn_prefix(insn, idx, prefix) \
for (idx = 0; idx < ARRAY_SIZE(insn->prefixes.bytes) && (prefix = insn->prefixes.bytes[idx]) != 0; idx++)
#define for_each_insn_prefix(insn, prefix) \
for (int idx = 0; idx < ARRAY_SIZE(insn->prefixes.bytes) && (prefix = insn->prefixes.bytes[idx]) != 0; idx++)
#define POP_SS_OPCODE 0x1f
#define MOV_SREG_OPCODE 0x8e

View File

@ -4,7 +4,15 @@
#include <linux/percpu-defs.h>
#define BTS_BUFFER_SIZE (PAGE_SIZE << 4)
#define PEBS_BUFFER_SIZE (PAGE_SIZE << 4)
#define PEBS_BUFFER_SHIFT 4
#define PEBS_BUFFER_SIZE (PAGE_SIZE << PEBS_BUFFER_SHIFT)
/*
* The largest PEBS record could consume a page, ensure
* a record at least can be written after triggering PMI.
*/
#define ARCH_PEBS_THRESH_MULTI ((PEBS_BUFFER_SIZE - PAGE_SIZE) >> PEBS_BUFFER_SHIFT)
#define ARCH_PEBS_THRESH_SINGLE 1
/* The maximal number of PEBS events: */
#define MAX_PEBS_EVENTS_FMT4 8

View File

@ -15,6 +15,7 @@
#define JUMP_TABLE_ENTRY(key, label) \
".pushsection __jump_table, \"aw\" \n\t" \
_ASM_ALIGN "\n\t" \
ANNOTATE_DATA_SPECIAL \
".long 1b - . \n\t" \
".long " label " - . \n\t" \
_ASM_PTR " " key " - . \n\t" \

View File

@ -327,6 +327,26 @@
PERF_CAP_PEBS_FORMAT | PERF_CAP_PEBS_BASELINE | \
PERF_CAP_PEBS_TIMING_INFO)
/* Arch PEBS */
#define MSR_IA32_PEBS_BASE 0x000003f4
#define MSR_IA32_PEBS_INDEX 0x000003f5
#define ARCH_PEBS_OFFSET_MASK 0x7fffff
#define ARCH_PEBS_INDEX_WR_SHIFT 4
#define ARCH_PEBS_RELOAD 0xffffffff
#define ARCH_PEBS_CNTR_ALLOW BIT_ULL(35)
#define ARCH_PEBS_CNTR_GP BIT_ULL(36)
#define ARCH_PEBS_CNTR_FIXED BIT_ULL(37)
#define ARCH_PEBS_CNTR_METRICS BIT_ULL(38)
#define ARCH_PEBS_LBR_SHIFT 40
#define ARCH_PEBS_LBR (0x3ull << ARCH_PEBS_LBR_SHIFT)
#define ARCH_PEBS_VECR_XMM BIT_ULL(49)
#define ARCH_PEBS_GPR BIT_ULL(61)
#define ARCH_PEBS_AUX BIT_ULL(62)
#define ARCH_PEBS_EN BIT_ULL(63)
#define ARCH_PEBS_CNTR_MASK (ARCH_PEBS_CNTR_GP | ARCH_PEBS_CNTR_FIXED | \
ARCH_PEBS_CNTR_METRICS)
#define MSR_IA32_RTIT_CTL 0x00000570
#define RTIT_CTL_TRACEEN BIT(0)
#define RTIT_CTL_CYCLEACC BIT(1)

View File

@ -144,13 +144,13 @@
#define PEBS_DATACFG_GP BIT_ULL(1)
#define PEBS_DATACFG_XMMS BIT_ULL(2)
#define PEBS_DATACFG_LBRS BIT_ULL(3)
#define PEBS_DATACFG_LBR_SHIFT 24
#define PEBS_DATACFG_CNTR BIT_ULL(4)
#define PEBS_DATACFG_METRICS BIT_ULL(5)
#define PEBS_DATACFG_LBR_SHIFT 24
#define PEBS_DATACFG_CNTR_SHIFT 32
#define PEBS_DATACFG_CNTR_MASK GENMASK_ULL(15, 0)
#define PEBS_DATACFG_FIX_SHIFT 48
#define PEBS_DATACFG_FIX_MASK GENMASK_ULL(7, 0)
#define PEBS_DATACFG_METRICS BIT_ULL(5)
/* Steal the highest bit of pebs_data_cfg for SW usage */
#define PEBS_UPDATE_DS_SW BIT_ULL(63)
@ -200,6 +200,8 @@ union cpuid10_edx {
#define ARCH_PERFMON_EXT_LEAF 0x00000023
#define ARCH_PERFMON_NUM_COUNTER_LEAF 0x1
#define ARCH_PERFMON_ACR_LEAF 0x2
#define ARCH_PERFMON_PEBS_CAP_LEAF 0x4
#define ARCH_PERFMON_PEBS_COUNTER_LEAF 0x5
union cpuid35_eax {
struct {
@ -210,7 +212,10 @@ union cpuid35_eax {
unsigned int acr_subleaf:1;
/* Events Sub-Leaf */
unsigned int events_subleaf:1;
unsigned int reserved:28;
/* arch-PEBS Sub-Leaves */
unsigned int pebs_caps_subleaf:1;
unsigned int pebs_cnts_subleaf:1;
unsigned int reserved:26;
} split;
unsigned int full;
};
@ -432,6 +437,8 @@ static inline bool is_topdown_idx(int idx)
#define GLOBAL_STATUS_LBRS_FROZEN BIT_ULL(GLOBAL_STATUS_LBRS_FROZEN_BIT)
#define GLOBAL_STATUS_TRACE_TOPAPMI_BIT 55
#define GLOBAL_STATUS_TRACE_TOPAPMI BIT_ULL(GLOBAL_STATUS_TRACE_TOPAPMI_BIT)
#define GLOBAL_STATUS_ARCH_PEBS_THRESHOLD_BIT 54
#define GLOBAL_STATUS_ARCH_PEBS_THRESHOLD BIT_ULL(GLOBAL_STATUS_ARCH_PEBS_THRESHOLD_BIT)
#define GLOBAL_STATUS_PERF_METRICS_OVF_BIT 48
#define GLOBAL_CTRL_EN_PERF_METRICS BIT_ULL(48)
@ -502,6 +509,107 @@ struct pebs_cntr_header {
#define INTEL_CNTR_METRICS 0x3
/*
* Arch PEBS
*/
union arch_pebs_index {
struct {
u64 rsvd:4,
wr:23,
rsvd2:4,
full:1,
en:1,
rsvd3:3,
thresh:23,
rsvd4:5;
};
u64 whole;
};
struct arch_pebs_header {
union {
u64 format;
struct {
u64 size:16, /* Record size */
rsvd:14,
mode:1, /* 64BIT_MODE */
cont:1,
rsvd2:3,
cntr:5,
lbr:2,
rsvd3:7,
xmm:1,
ymmh:1,
rsvd4:2,
opmask:1,
zmmh:1,
h16zmm:1,
rsvd5:5,
gpr:1,
aux:1,
basic:1;
};
};
u64 rsvd6;
};
struct arch_pebs_basic {
u64 ip;
u64 applicable_counters;
u64 tsc;
u64 retire :16, /* Retire Latency */
valid :1,
rsvd :47;
u64 rsvd2;
u64 rsvd3;
};
struct arch_pebs_aux {
u64 address;
u64 rsvd;
u64 rsvd2;
u64 rsvd3;
u64 rsvd4;
u64 aux;
u64 instr_latency :16,
pad2 :16,
cache_latency :16,
pad3 :16;
u64 tsx_tuning;
};
struct arch_pebs_gprs {
u64 flags, ip, ax, cx, dx, bx, sp, bp, si, di;
u64 r8, r9, r10, r11, r12, r13, r14, r15, ssp;
u64 rsvd;
};
struct arch_pebs_xer_header {
u64 xstate;
u64 rsvd;
};
#define ARCH_PEBS_LBR_NAN 0x0
#define ARCH_PEBS_LBR_NUM_8 0x1
#define ARCH_PEBS_LBR_NUM_16 0x2
#define ARCH_PEBS_LBR_NUM_VAR 0x3
#define ARCH_PEBS_BASE_LBR_ENTRIES 8
struct arch_pebs_lbr_header {
u64 rsvd;
u64 ctl;
u64 depth;
u64 ler_from;
u64 ler_to;
u64 ler_info;
};
struct arch_pebs_cntr_header {
u32 cntr;
u32 fixed;
u32 metrics;
u32 reserved;
};
/*
* AMD Extended Performance Monitoring and Debug cpuid feature detection
*/

View File

@ -109,7 +109,7 @@ int common_cpu_up(unsigned int cpunum, struct task_struct *tidle);
int native_kick_ap(unsigned int cpu, struct task_struct *tidle);
int native_cpu_disable(void);
void __noreturn hlt_play_dead(void);
void native_play_dead(void);
void __noreturn native_play_dead(void);
void play_dead_common(void);
void wbinvd_on_cpu(int cpu);
void wbinvd_on_all_cpus(void);

View File

@ -325,4 +325,6 @@ static inline void freq_invariance_set_perf_ratio(u64 ratio, bool turbo_disabled
extern void arch_scale_freq_tick(void);
#define arch_scale_freq_tick arch_scale_freq_tick
extern int arch_sched_node_distance(int from, int to);
#endif /* _ASM_X86_TOPOLOGY_H */

View File

@ -0,0 +1,41 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _ASM_X86_UNWIND_USER_H
#define _ASM_X86_UNWIND_USER_H
#ifdef CONFIG_HAVE_UNWIND_USER_FP
#include <asm/ptrace.h>
#include <asm/uprobes.h>
#define ARCH_INIT_USER_FP_FRAME(ws) \
.cfa_off = 2*(ws), \
.ra_off = -1*(ws), \
.fp_off = -2*(ws), \
.use_fp = true,
#define ARCH_INIT_USER_FP_ENTRY_FRAME(ws) \
.cfa_off = 1*(ws), \
.ra_off = -1*(ws), \
.fp_off = 0, \
.use_fp = false,
static inline int unwind_user_word_size(struct pt_regs *regs)
{
/* We can't unwind VM86 stacks */
if (regs->flags & X86_VM_MASK)
return 0;
#ifdef CONFIG_X86_64
if (!user_64bit_mode(regs))
return sizeof(int);
#endif
return sizeof(long);
}
static inline bool unwind_user_at_function_start(struct pt_regs *regs)
{
return is_uprobe_at_func_entry(regs);
}
#endif /* CONFIG_HAVE_UNWIND_USER_FP */
#endif /* _ASM_X86_UNWIND_USER_H */

View File

@ -62,4 +62,13 @@ struct arch_uprobe_task {
unsigned int saved_tf;
};
#ifdef CONFIG_UPROBES
extern bool is_uprobe_at_func_entry(struct pt_regs *regs);
#else
static bool is_uprobe_at_func_entry(struct pt_regs *regs)
{
return false;
}
#endif /* CONFIG_UPROBES */
#endif /* _ASM_UPROBES_H */

View File

@ -9,6 +9,7 @@
#include <asm/text-patching.h>
#include <asm/insn.h>
#include <asm/insn-eval.h>
#include <asm/ibt.h>
#include <asm/set_memory.h>
#include <asm/nmi.h>
@ -345,25 +346,6 @@ static void add_nop(u8 *buf, unsigned int len)
*buf = INT3_INSN_OPCODE;
}
/*
* Matches NOP and NOPL, not any of the other possible NOPs.
*/
static bool insn_is_nop(struct insn *insn)
{
/* Anything NOP, but no REP NOP */
if (insn->opcode.bytes[0] == 0x90 &&
(!insn->prefixes.nbytes || insn->prefixes.bytes[0] != 0xF3))
return true;
/* NOPL */
if (insn->opcode.bytes[0] == 0x0F && insn->opcode.bytes[1] == 0x1F)
return true;
/* TODO: more nops */
return false;
}
/*
* Find the offset of the first non-NOP instruction starting at @offset
* but no further than @len.
@ -559,7 +541,7 @@ EXPORT_SYMBOL(BUG_func);
* Rewrite the "call BUG_func" replacement to point to the target of the
* indirect pv_ops call "call *disp(%ip)".
*/
static int alt_replace_call(u8 *instr, u8 *insn_buff, struct alt_instr *a)
static unsigned int alt_replace_call(u8 *instr, u8 *insn_buff, struct alt_instr *a)
{
void *target, *bug = &BUG_func;
s32 disp;
@ -643,7 +625,7 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start,
* order.
*/
for (a = start; a < end; a++) {
int insn_buff_sz = 0;
unsigned int insn_buff_sz = 0;
/*
* In case of nested ALTERNATIVE()s the outer alternative might
@ -683,11 +665,8 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start,
memcpy(insn_buff, replacement, a->replacementlen);
insn_buff_sz = a->replacementlen;
if (a->flags & ALT_FLAG_DIRECT_CALL) {
if (a->flags & ALT_FLAG_DIRECT_CALL)
insn_buff_sz = alt_replace_call(instr, insn_buff, a);
if (insn_buff_sz < 0)
continue;
}
for (; insn_buff_sz < a->instrlen; insn_buff_sz++)
insn_buff[insn_buff_sz] = 0x90;
@ -2244,21 +2223,34 @@ int alternatives_text_reserved(void *start, void *end)
* See entry_{32,64}.S for more details.
*/
/*
* We define the int3_magic() function in assembly to control the calling
* convention such that we can 'call' it from assembly.
*/
extern void int3_magic(unsigned int *ptr); /* defined in asm */
extern void int3_selftest_asm(unsigned int *ptr);
asm (
" .pushsection .init.text, \"ax\", @progbits\n"
" .type int3_magic, @function\n"
"int3_magic:\n"
" .type int3_selftest_asm, @function\n"
"int3_selftest_asm:\n"
ANNOTATE_NOENDBR
" movl $1, (%" _ASM_ARG1 ")\n"
/*
* INT3 padded with NOP to CALL_INSN_SIZE. The INT3 triggers an
* exception, then the int3_exception_nb notifier emulates a call to
* int3_selftest_callee().
*/
" int3; nop; nop; nop; nop\n"
ASM_RET
" .size int3_magic, .-int3_magic\n"
" .size int3_selftest_asm, . - int3_selftest_asm\n"
" .popsection\n"
);
extern void int3_selftest_callee(unsigned int *ptr);
asm (
" .pushsection .init.text, \"ax\", @progbits\n"
" .type int3_selftest_callee, @function\n"
"int3_selftest_callee:\n"
ANNOTATE_NOENDBR
" movl $0x1234, (%" _ASM_ARG1 ")\n"
ASM_RET
" .size int3_selftest_callee, . - int3_selftest_callee\n"
" .popsection\n"
);
@ -2267,7 +2259,7 @@ extern void int3_selftest_ip(void); /* defined in asm below */
static int __init
int3_exception_notify(struct notifier_block *self, unsigned long val, void *data)
{
unsigned long selftest = (unsigned long)&int3_selftest_ip;
unsigned long selftest = (unsigned long)&int3_selftest_asm;
struct die_args *args = data;
struct pt_regs *regs = args->regs;
@ -2282,7 +2274,7 @@ int3_exception_notify(struct notifier_block *self, unsigned long val, void *data
if (regs->ip - INT3_INSN_SIZE != selftest)
return NOTIFY_DONE;
int3_emulate_call(regs, (unsigned long)&int3_magic);
int3_emulate_call(regs, (unsigned long)&int3_selftest_callee);
return NOTIFY_STOP;
}
@ -2298,19 +2290,11 @@ static noinline void __init int3_selftest(void)
BUG_ON(register_die_notifier(&int3_exception_nb));
/*
* Basically: int3_magic(&val); but really complicated :-)
*
* INT3 padded with NOP to CALL_INSN_SIZE. The int3_exception_nb
* notifier above will emulate CALL for us.
* Basically: int3_selftest_callee(&val); but really complicated :-)
*/
asm volatile ("int3_selftest_ip:\n\t"
ANNOTATE_NOENDBR
" int3; nop; nop; nop; nop\n\t"
: ASM_CALL_CONSTRAINT
: __ASM_SEL_RAW(a, D) (&val)
: "memory");
int3_selftest_asm(&val);
BUG_ON(val != 1);
BUG_ON(val != 0x1234);
unregister_die_notifier(&int3_exception_nb);
}

View File

@ -173,6 +173,7 @@ static struct resource lapic_resource = {
.flags = IORESOURCE_MEM | IORESOURCE_BUSY,
};
/* Measured in ticks per HZ. */
unsigned int lapic_timer_period = 0;
static void apic_pm_activate(void);
@ -792,6 +793,7 @@ static int __init calibrate_APIC_clock(void)
{
struct clock_event_device *levt = this_cpu_ptr(&lapic_events);
u64 tsc_perj = 0, tsc_start = 0;
long delta_tsc_khz, bus_khz;
unsigned long jif_start;
unsigned long deltaj;
long delta, deltatsc;
@ -894,14 +896,15 @@ static int __init calibrate_APIC_clock(void)
apic_pr_verbose("..... calibration result: %u\n", lapic_timer_period);
if (boot_cpu_has(X86_FEATURE_TSC)) {
apic_pr_verbose("..... CPU clock speed is %ld.%04ld MHz.\n",
(deltatsc / LAPIC_CAL_LOOPS) / (1000000 / HZ),
(deltatsc / LAPIC_CAL_LOOPS) % (1000000 / HZ));
delta_tsc_khz = (deltatsc * HZ) / (1000 * LAPIC_CAL_LOOPS);
apic_pr_verbose("..... CPU clock speed is %ld.%03ld MHz.\n",
delta_tsc_khz / 1000, delta_tsc_khz % 1000);
}
apic_pr_verbose("..... host bus clock speed is %u.%04u MHz.\n",
lapic_timer_period / (1000000 / HZ),
lapic_timer_period % (1000000 / HZ));
bus_khz = (long)lapic_timer_period * HZ / 1000;
apic_pr_verbose("..... host bus clock speed is %ld.%03ld MHz.\n",
bus_khz / 1000, bus_khz % 1000);
/*
* Do a sanity check on the APIC calibration result

View File

@ -2864,7 +2864,7 @@ int mp_irqdomain_alloc(struct irq_domain *domain, unsigned int virq,
ioapic = mp_irqdomain_ioapic_idx(domain);
pin = info->ioapic.pin;
if (irq_find_mapping(domain, (irq_hw_number_t)pin) > 0)
if (irq_resolve_mapping(domain, (irq_hw_number_t)pin))
return -EEXIST;
data = kzalloc(sizeof(*data), GFP_KERNEL);

View File

@ -181,7 +181,7 @@ static void show_regs_if_on_stack(struct stack_info *info, struct pt_regs *regs,
* in false positive reports. Disable instrumentation to avoid those.
*/
__no_kmsan_checks
static void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
static void __show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
unsigned long *stack, const char *log_lvl)
{
struct unwind_state state;
@ -303,6 +303,25 @@ static void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
}
}
static void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
unsigned long *stack, const char *log_lvl)
{
/*
* Disable KASAN to avoid false positives during walking another
* task's stacks, as values on these stacks may change concurrently
* with task execution.
*/
bool disable_kasan = task && task != current;
if (disable_kasan)
kasan_disable_current();
__show_trace_log_lvl(task, regs, stack, log_lvl);
if (disable_kasan)
kasan_enable_current();
}
void show_stack(struct task_struct *task, unsigned long *sp,
const char *loglvl)
{

View File

@ -141,7 +141,6 @@ bool can_boost(struct insn *insn, void *addr)
{
kprobe_opcode_t opcode;
insn_byte_t prefix;
int i;
if (search_exception_tables((unsigned long)addr))
return false; /* Page fault may occur on this address. */
@ -154,7 +153,7 @@ bool can_boost(struct insn *insn, void *addr)
if (insn->opcode.nbytes != 1)
return false;
for_each_insn_prefix(insn, i, prefix) {
for_each_insn_prefix(insn, prefix) {
insn_attr_t attr;
attr = inat_get_opcode_attribute(prefix);

View File

@ -103,7 +103,6 @@ static void synthesize_set_arg1(kprobe_opcode_t *addr, unsigned long val)
asm (
".pushsection .rodata\n"
"optprobe_template_func:\n"
".global optprobe_template_entry\n"
"optprobe_template_entry:\n"
#ifdef CONFIG_X86_64
@ -160,9 +159,6 @@ asm (
"optprobe_template_end:\n"
".popsection\n");
void optprobe_template_func(void);
STACK_FRAME_NON_STANDARD(optprobe_template_func);
#define TMPL_CLAC_IDX \
((long)optprobe_template_clac - (long)optprobe_template_entry)
#define TMPL_MOVE_IDX \

View File

@ -97,6 +97,7 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
DEBUGP("%s relocate section %u to %u\n",
apply ? "Applying" : "Clearing",
relsec, sechdrs[relsec].sh_info);
for (i = 0; i < sechdrs[relsec].sh_size / sizeof(*rel); i++) {
size_t size;
@ -162,15 +163,17 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
if (apply) {
if (memcmp(loc, &zero, size)) {
pr_err("x86/modules: Invalid relocation target, existing value is nonzero for type %d, loc %p, val %Lx\n",
(int)ELF64_R_TYPE(rel[i].r_info), loc, val);
pr_err("x86/modules: Invalid relocation target, existing value is nonzero for sec %u, idx %u, type %d, loc %lx, val %llx\n",
relsec, i, (int)ELF64_R_TYPE(rel[i].r_info),
(unsigned long)loc, val);
return -ENOEXEC;
}
write(loc, &val, size);
} else {
if (memcmp(loc, &val, size)) {
pr_warn("x86/modules: Invalid relocation target, existing value does not match expected value for type %d, loc %p, val %Lx\n",
(int)ELF64_R_TYPE(rel[i].r_info), loc, val);
pr_warn("x86/modules: Invalid relocation target, existing value does not match expected value for sec %u, idx %u, type %d, loc %lx, val %llx\n",
relsec, i, (int)ELF64_R_TYPE(rel[i].r_info),
(unsigned long)loc, val);
return -ENOEXEC;
}
write(loc, &zero, size);
@ -179,8 +182,8 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
return 0;
overflow:
pr_err("overflow in relocation type %d val %Lx\n",
(int)ELF64_R_TYPE(rel[i].r_info), val);
pr_err("overflow in relocation type %d val %llx sec %u idx %d\n",
(int)ELF64_R_TYPE(rel[i].r_info), val, relsec, i);
pr_err("`%s' likely not compiled with -mcmodel=kernel\n",
me->name);
return -ENOEXEC;

View File

@ -515,6 +515,76 @@ static void __init build_sched_topology(void)
set_sched_topology(topology);
}
#ifdef CONFIG_NUMA
static int sched_avg_remote_distance;
static int avg_remote_numa_distance(void)
{
int i, j;
int distance, nr_remote, total_distance;
if (sched_avg_remote_distance > 0)
return sched_avg_remote_distance;
nr_remote = 0;
total_distance = 0;
for_each_node_state(i, N_CPU) {
for_each_node_state(j, N_CPU) {
distance = node_distance(i, j);
if (distance >= REMOTE_DISTANCE) {
nr_remote++;
total_distance += distance;
}
}
}
if (nr_remote)
sched_avg_remote_distance = total_distance / nr_remote;
else
sched_avg_remote_distance = REMOTE_DISTANCE;
return sched_avg_remote_distance;
}
int arch_sched_node_distance(int from, int to)
{
int d = node_distance(from, to);
switch (boot_cpu_data.x86_vfm) {
case INTEL_GRANITERAPIDS_X:
case INTEL_ATOM_DARKMONT_X:
if (!x86_has_numa_in_package || topology_max_packages() == 1 ||
d < REMOTE_DISTANCE)
return d;
/*
* With SNC enabled, there could be too many levels of remote
* NUMA node distances, creating NUMA domain levels
* including local nodes and partial remote nodes.
*
* Trim finer distance tuning for NUMA nodes in remote package
* for the purpose of building sched domains. Group NUMA nodes
* in the remote package in the same sched group.
* Simplify NUMA domains and avoid extra NUMA levels including
* different remote NUMA nodes and local nodes.
*
* GNR and CWF don't expect systems with more than 2 packages
* and more than 2 hops between packages. Single average remote
* distance won't be appropriate if there are more than 2
* packages as average distance to different remote packages
* could be different.
*/
WARN_ONCE(topology_max_packages() > 2,
"sched: Expect only up to 2 packages for GNR or CWF, "
"but saw %d packages when building sched domains.",
topology_max_packages());
d = avg_remote_numa_distance();
}
return d;
}
#endif /* CONFIG_NUMA */
void set_cpu_sibling_map(int cpu)
{
bool has_smt = __max_threads_per_core > 1;
@ -1328,11 +1398,7 @@ void __noreturn hlt_play_dead(void)
native_halt();
}
/*
* native_play_dead() is essentially a __noreturn function, but it can't
* be marked as such as the compiler may complain about it.
*/
void native_play_dead(void)
void __noreturn native_play_dead(void)
{
if (cpu_feature_enabled(X86_FEATURE_KERNEL_IBRS))
__update_spec_ctrl(0);
@ -1351,7 +1417,7 @@ int native_cpu_disable(void)
return -ENOSYS;
}
void native_play_dead(void)
void __noreturn native_play_dead(void)
{
BUG();
}

View File

@ -26,6 +26,11 @@ static const u8 xor5rax[] = { 0x2e, 0x2e, 0x2e, 0x31, 0xc0 };
static const u8 retinsn[] = { RET_INSN_OPCODE, 0xcc, 0xcc, 0xcc, 0xcc };
/*
* ud1 (%edx),%rdi -- see __WARN_trap() / decode_bug()
*/
static const u8 warninsn[] = { 0x67, 0x48, 0x0f, 0xb9, 0x3a };
static u8 __is_Jcc(u8 *insn) /* Jcc.d32 */
{
u8 ret = 0;
@ -69,7 +74,10 @@ static void __ref __static_call_transform(void *insn, enum insn_type type,
emulate = code;
code = &xor5rax;
}
if (func == &__WARN_trap) {
emulate = code;
code = &warninsn;
}
break;
case NOP:
@ -128,7 +136,8 @@ static void __static_call_validate(u8 *insn, bool tail, bool tramp)
} else {
if (opcode == CALL_INSN_OPCODE ||
!memcmp(insn, x86_nops[5], 5) ||
!memcmp(insn, xor5rax, 5))
!memcmp(insn, xor5rax, 5) ||
!memcmp(insn, warninsn, 5))
return;
}

View File

@ -31,6 +31,7 @@
#include <linux/kexec.h>
#include <linux/sched.h>
#include <linux/sched/task_stack.h>
#include <linux/static_call.h>
#include <linux/timer.h>
#include <linux/init.h>
#include <linux/bug.h>
@ -102,25 +103,37 @@ __always_inline int is_valid_bugaddr(unsigned long addr)
* UBSan{0}: 67 0f b9 00 ud1 (%eax),%eax
* UBSan{10}: 67 0f b9 40 10 ud1 0x10(%eax),%eax
* static_call: 0f b9 cc ud1 %esp,%ecx
* __WARN_trap: 67 48 0f b9 3a ud1 (%edx),%reg
*
* Notably UBSAN uses EAX, static_call uses ECX.
* Notable, since __WARN_trap can use all registers, the distinction between
* UD1 users is through R/M.
*/
__always_inline int decode_bug(unsigned long addr, s32 *imm, int *len)
{
unsigned long start = addr;
u8 v, reg, rm, rex = 0;
int type = BUG_UD1;
bool lock = false;
u8 v;
if (addr < TASK_SIZE_MAX)
return BUG_NONE;
for (;;) {
v = *(u8 *)(addr++);
if (v == INSN_ASOP)
v = *(u8 *)(addr++);
continue;
if (v == INSN_LOCK) {
lock = true;
v = *(u8 *)(addr++);
continue;
}
if ((v & 0xf0) == 0x40) {
rex = v;
continue;
}
break;
}
switch (v) {
@ -156,18 +169,33 @@ __always_inline int decode_bug(unsigned long addr, s32 *imm, int *len)
if (X86_MODRM_MOD(v) != 3 && X86_MODRM_RM(v) == 4)
addr++; /* SIB */
reg = X86_MODRM_REG(v) + 8*!!X86_REX_R(rex);
rm = X86_MODRM_RM(v) + 8*!!X86_REX_B(rex);
/* Decode immediate, if present */
switch (X86_MODRM_MOD(v)) {
case 0: if (X86_MODRM_RM(v) == 5)
addr += 4; /* RIP + disp32 */
if (rm == 0) /* (%eax) */
type = BUG_UD1_UBSAN;
if (rm == 2) { /* (%edx) */
*imm = reg;
type = BUG_UD1_WARN;
}
break;
case 1: *imm = *(s8 *)addr;
addr += 1;
if (rm == 0) /* (%eax) */
type = BUG_UD1_UBSAN;
break;
case 2: *imm = *(s32 *)addr;
addr += 4;
if (rm == 0) /* (%eax) */
type = BUG_UD1_UBSAN;
break;
case 3: break;
@ -176,12 +204,76 @@ __always_inline int decode_bug(unsigned long addr, s32 *imm, int *len)
/* record instruction length */
*len = addr - start;
if (X86_MODRM_REG(v) == 0) /* EAX */
return BUG_UD1_UBSAN;
return BUG_UD1;
return type;
}
static inline unsigned long pt_regs_val(struct pt_regs *regs, int nr)
{
int offset = pt_regs_offset(regs, nr);
if (WARN_ON_ONCE(offset < -0))
return 0;
return *((unsigned long *)((void *)regs + offset));
}
#ifdef HAVE_ARCH_BUG_FORMAT_ARGS
DEFINE_STATIC_CALL(WARN_trap, __WARN_trap);
EXPORT_STATIC_CALL_TRAMP(WARN_trap);
/*
* Create a va_list from an exception context.
*/
void *__warn_args(struct arch_va_list *args, struct pt_regs *regs)
{
/*
* Register save area; populate with function call argument registers
*/
args->regs[0] = regs->di;
args->regs[1] = regs->si;
args->regs[2] = regs->dx;
args->regs[3] = regs->cx;
args->regs[4] = regs->r8;
args->regs[5] = regs->r9;
/*
* From the ABI document:
*
* @gp_offset - the element holds the offset in bytes from
* reg_save_area to the place where the next available general purpose
* argument register is saved. In case all argument registers have
* been exhausted, it is set to the value 48 (6*8).
*
* @fp_offset - the element holds the offset in bytes from
* reg_save_area to the place where the next available floating point
* argument is saved. In case all argument registers have been
* exhausted, it is set to the value 176 (6*8 + 8*16)
*
* @overflow_arg_area - this pointer is used to fetch arguments passed
* on the stack. It is initialized with the address of the first
* argument passed on the stack, if any, and then always updated to
* point to the start of the next argument on the stack.
*
* @reg_save_area - the element points to the start of the register
* save area.
*
* Notably the vararg starts with the second argument and there are no
* floating point arguments in the kernel.
*/
args->args.gp_offset = 1*8;
args->args.fp_offset = 6*8 + 8*16;
args->args.reg_save_area = &args->regs;
args->args.overflow_arg_area = (void *)regs->sp;
/*
* If the exception came from __WARN_trap, there is a return
* address on the stack, skip that. This is why any __WARN_trap()
* caller must inhibit tail-call optimization.
*/
if ((void *)regs->ip == &__WARN_trap)
args->args.overflow_arg_area += 8;
return &args->args;
}
#endif /* HAVE_ARCH_BUG_FORMAT */
static nokprobe_inline int
do_trap_no_signal(struct task_struct *tsk, int trapnr, const char *str,
@ -334,6 +426,11 @@ static noinstr bool handle_bug(struct pt_regs *regs)
raw_local_irq_enable();
switch (ud_type) {
case BUG_UD1_WARN:
if (report_bug_entry((void *)pt_regs_val(regs, ud_imm), regs) == BUG_TRAP_TYPE_WARN)
handled = true;
break;
case BUG_UD2:
if (report_bug(regs->ip, regs) == BUG_TRAP_TYPE_WARN) {
handled = true;

View File

@ -17,6 +17,7 @@
#include <linux/kdebug.h>
#include <asm/processor.h>
#include <asm/insn.h>
#include <asm/insn-eval.h>
#include <asm/mmu_context.h>
#include <asm/nops.h>
@ -258,9 +259,8 @@ static volatile u32 good_2byte_insns[256 / 32] = {
static bool is_prefix_bad(struct insn *insn)
{
insn_byte_t p;
int i;
for_each_insn_prefix(insn, i, p) {
for_each_insn_prefix(insn, p) {
insn_attr_t attr;
attr = inat_get_opcode_attribute(p);
@ -1158,35 +1158,12 @@ void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
mmap_write_unlock(mm);
}
static bool insn_is_nop(struct insn *insn)
{
return insn->opcode.nbytes == 1 && insn->opcode.bytes[0] == 0x90;
}
static bool insn_is_nopl(struct insn *insn)
{
if (insn->opcode.nbytes != 2)
return false;
if (insn->opcode.bytes[0] != 0x0f || insn->opcode.bytes[1] != 0x1f)
return false;
if (!insn->modrm.nbytes)
return false;
if (X86_MODRM_REG(insn->modrm.bytes[0]) != 0)
return false;
/* 0f 1f /0 - NOPL */
return true;
}
static bool can_optimize(struct insn *insn, unsigned long vaddr)
{
if (!insn->x86_64 || insn->length != 5)
return false;
if (!insn_is_nop(insn) && !insn_is_nopl(insn))
if (!insn_is_nop(insn))
return false;
/* We can't do cross page atomic writes yet. */
@ -1426,19 +1403,14 @@ static int branch_setup_xol_ops(struct arch_uprobe *auprobe, struct insn *insn)
{
u8 opc1 = OPCODE1(insn);
insn_byte_t p;
int i;
/* x86_nops[insn->length]; same as jmp with .offs = 0 */
if (insn->length <= ASM_NOP_MAX &&
!memcmp(insn->kaddr, x86_nops[insn->length], insn->length))
if (insn_is_nop(insn))
goto setup;
switch (opc1) {
case 0xeb: /* jmp 8 */
case 0xe9: /* jmp 32 */
break;
case 0x90: /* prefix* + nop; same as jmp with .offs = 0 */
goto setup;
case 0xe8: /* call relative */
branch_clear_offset(auprobe, insn);
@ -1463,7 +1435,7 @@ static int branch_setup_xol_ops(struct arch_uprobe *auprobe, struct insn *insn)
* Intel and AMD behavior differ in 64-bit mode: Intel ignores 66 prefix.
* No one uses these insns, reject any branch insns with such prefix.
*/
for_each_insn_prefix(insn, i, p) {
for_each_insn_prefix(insn, p) {
if (p == 0x66)
return -ENOTSUPP;
}
@ -1819,3 +1791,35 @@ bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check ctx,
else
return regs->sp <= ret->stack;
}
/*
* Heuristic-based check if uprobe is installed at the function entry.
*
* Under assumption of user code being compiled with frame pointers,
* `push %rbp/%ebp` is a good indicator that we indeed are.
*
* Similarly, `endbr64` (assuming 64-bit mode) is also a common pattern.
* If we get this wrong, captured stack trace might have one extra bogus
* entry, but the rest of stack trace will still be meaningful.
*/
bool is_uprobe_at_func_entry(struct pt_regs *regs)
{
struct arch_uprobe *auprobe;
if (!current->utask)
return false;
auprobe = current->utask->auprobe;
if (!auprobe)
return false;
/* push %rbp/%ebp */
if (auprobe->insn[0] == 0x55)
return true;
/* endbr64 (64-bit only) */
if (user_64bit_mode(regs) && is_endbr((u32 *)auprobe->insn))
return true;
return false;
}

View File

@ -63,11 +63,10 @@ static bool is_string_insn(struct insn *insn)
bool insn_has_rep_prefix(struct insn *insn)
{
insn_byte_t p;
int i;
insn_get_prefixes(insn);
for_each_insn_prefix(insn, i, p) {
for_each_insn_prefix(insn, p) {
if (p == 0xf2 || p == 0xf3)
return true;
}
@ -92,13 +91,13 @@ bool insn_has_rep_prefix(struct insn *insn)
static int get_seg_reg_override_idx(struct insn *insn)
{
int idx = INAT_SEG_REG_DEFAULT;
int num_overrides = 0, i;
int num_overrides = 0;
insn_byte_t p;
insn_get_prefixes(insn);
/* Look for any segment override prefixes. */
for_each_insn_prefix(insn, i, p) {
for_each_insn_prefix(insn, p) {
insn_attr_t attr;
attr = inat_get_opcode_attribute(p);
@ -1676,3 +1675,147 @@ enum insn_mmio_type insn_decode_mmio(struct insn *insn, int *bytes)
return type;
}
/*
* Recognise typical NOP patterns for both 32bit and 64bit.
*
* Notably:
* - NOP, but not: REP NOP aka PAUSE
* - NOPL
* - MOV %reg, %reg
* - LEA 0(%reg),%reg
* - JMP +0
*
* Must not have false-positives; instructions identified as a NOP might be
* emulated as a NOP (uprobe) or Run Length Encoded in a larger NOP
* (alternatives).
*
* False-negatives are fine; need not be exhaustive.
*/
bool insn_is_nop(struct insn *insn)
{
u8 b3 = 0, x3 = 0, r3 = 0;
u8 b4 = 0, x4 = 0, r4 = 0, m = 0;
u8 modrm, modrm_mod, modrm_reg, modrm_rm;
u8 sib = 0, sib_scale, sib_index, sib_base;
u8 nrex, rex;
u8 p, rep = 0;
if ((nrex = insn->rex_prefix.nbytes)) {
rex = insn->rex_prefix.bytes[nrex-1];
r3 = !!X86_REX_R(rex);
x3 = !!X86_REX_X(rex);
b3 = !!X86_REX_B(rex);
if (nrex > 1) {
r4 = !!X86_REX2_R(rex);
x4 = !!X86_REX2_X(rex);
b4 = !!X86_REX2_B(rex);
m = !!X86_REX2_M(rex);
}
} else if (insn->vex_prefix.nbytes) {
/*
* Ignore VEX encoded NOPs
*/
return false;
}
if (insn->modrm.nbytes) {
modrm = insn->modrm.bytes[0];
modrm_mod = X86_MODRM_MOD(modrm);
modrm_reg = X86_MODRM_REG(modrm) + 8*r3 + 16*r4;
modrm_rm = X86_MODRM_RM(modrm) + 8*b3 + 16*b4;
modrm = 1;
}
if (insn->sib.nbytes) {
sib = insn->sib.bytes[0];
sib_scale = X86_SIB_SCALE(sib);
sib_index = X86_SIB_INDEX(sib) + 8*x3 + 16*x4;
sib_base = X86_SIB_BASE(sib) + 8*b3 + 16*b4;
sib = 1;
modrm_rm = sib_base;
}
for_each_insn_prefix(insn, p) {
if (p == 0xf3) /* REPE */
rep = 1;
}
/*
* Opcode map munging:
*
* REX2: 0 - single byte opcode
* 1 - 0f second byte opcode
*/
switch (m) {
case 0: break;
case 1: insn->opcode.value <<= 8;
insn->opcode.value |= 0x0f;
break;
default:
return false;
}
switch (insn->opcode.bytes[0]) {
case 0x0f: /* 2nd byte */
break;
case 0x89: /* MOV */
if (modrm_mod != 3) /* register-direct */
return false;
/* native size */
if (insn->opnd_bytes != 4 * (1 + insn->x86_64))
return false;
return modrm_reg == modrm_rm; /* MOV %reg, %reg */
case 0x8d: /* LEA */
if (modrm_mod == 0 || modrm_mod == 3) /* register-indirect with disp */
return false;
/* native size */
if (insn->opnd_bytes != 4 * (1 + insn->x86_64))
return false;
if (insn->displacement.value != 0)
return false;
if (sib && (sib_scale != 0 || sib_index != 4)) /* (%reg, %eiz, 1) */
return false;
for_each_insn_prefix(insn, p) {
if (p != 0x3e) /* DS */
return false;
}
return modrm_reg == modrm_rm; /* LEA 0(%reg), %reg */
case 0x90: /* NOP */
if (b3 || b4) /* XCHG %r{8,16,24},%rax */
return false;
if (rep) /* REP NOP := PAUSE */
return false;
return true;
case 0xe9: /* JMP.d32 */
case 0xeb: /* JMP.d8 */
return insn->immediate.value == 0; /* JMP +0 */
default:
return false;
}
switch (insn->opcode.bytes[1]) {
case 0x1f:
return modrm_reg == 0; /* 0f 1f /0 -- NOPL */
default:
return false;
}
}

View File

@ -39,7 +39,7 @@ asmlinkage void mul_Xsig_Xsig(Xsig *dest, const Xsig *mult);
asmlinkage void shr_Xsig(Xsig *, const int n);
asmlinkage int round_Xsig(Xsig *);
asmlinkage int norm_Xsig(Xsig *);
asmlinkage void div_Xsig(Xsig *x1, const Xsig *x2, const Xsig *dest);
asmlinkage void div_Xsig(Xsig *x1, const Xsig *x2, Xsig *dest);
/* Macro to extract the most significant 32 bits from a long long */
#define LL_MSW(x) (((unsigned long *)&x)[1])

View File

@ -180,7 +180,7 @@ static int dev_mkdir(const char *name, umode_t mode)
if (IS_ERR(dentry))
return PTR_ERR(dentry);
dentry = vfs_mkdir(&nop_mnt_idmap, d_inode(path.dentry), dentry, mode);
dentry = vfs_mkdir(&nop_mnt_idmap, d_inode(path.dentry), dentry, mode, NULL);
if (!IS_ERR(dentry))
/* mark as kernel-created inode */
d_inode(dentry)->i_private = &thread;
@ -231,7 +231,7 @@ static int handle_create(const char *nodename, umode_t mode, kuid_t uid,
return PTR_ERR(dentry);
err = vfs_mknod(&nop_mnt_idmap, d_inode(path.dentry), dentry, mode,
dev->devt);
dev->devt, NULL);
if (!err) {
struct iattr newattrs;
@ -261,7 +261,7 @@ static int dev_rmdir(const char *name)
return PTR_ERR(dentry);
if (d_inode(dentry)->i_private == &thread)
err = vfs_rmdir(&nop_mnt_idmap, d_inode(parent.dentry),
dentry);
dentry, NULL);
else
err = -EPERM;

View File

@ -768,18 +768,10 @@ EXPORT_SYMBOL_NS_GPL(dma_buf_export, "DMA_BUF");
*/
int dma_buf_fd(struct dma_buf *dmabuf, int flags)
{
int fd;
if (!dmabuf || !dmabuf->file)
return -EINVAL;
fd = get_unused_fd_flags(flags);
if (fd < 0)
return fd;
fd_install(fd, dmabuf->file);
return fd;
return FD_ADD(flags, dmabuf->file);
}
EXPORT_SYMBOL_NS_GPL(dma_buf_fd, "DMA_BUF");

View File

@ -298,12 +298,13 @@ static const struct file_operations linehandle_fileops = {
#endif
};
DEFINE_FREE(linehandle_free, struct linehandle_state *, if (!IS_ERR_OR_NULL(_T)) linehandle_free(_T))
static int linehandle_create(struct gpio_device *gdev, void __user *ip)
{
struct gpiohandle_request handlereq;
struct linehandle_state *lh;
struct file *file;
int fd, i, ret;
struct linehandle_state *lh __free(linehandle_free) = NULL;
int i, ret;
u32 lflags;
if (copy_from_user(&handlereq, ip, sizeof(handlereq)))
@ -327,10 +328,8 @@ static int linehandle_create(struct gpio_device *gdev, void __user *ip)
lh->label = kstrndup(handlereq.consumer_label,
sizeof(handlereq.consumer_label) - 1,
GFP_KERNEL);
if (!lh->label) {
ret = -ENOMEM;
goto out_free_lh;
}
if (!lh->label)
return -ENOMEM;
}
lh->num_descs = handlereq.lines;
@ -340,20 +339,18 @@ static int linehandle_create(struct gpio_device *gdev, void __user *ip)
u32 offset = handlereq.lineoffsets[i];
struct gpio_desc *desc = gpio_device_get_desc(gdev, offset);
if (IS_ERR(desc)) {
ret = PTR_ERR(desc);
goto out_free_lh;
}
if (IS_ERR(desc))
return PTR_ERR(desc);
ret = gpiod_request_user(desc, lh->label);
if (ret)
goto out_free_lh;
return ret;
lh->descs[i] = desc;
linehandle_flags_to_desc_flags(handlereq.flags, &desc->flags);
ret = gpiod_set_transitory(desc, false);
if (ret < 0)
goto out_free_lh;
return ret;
/*
* Lines have to be requested explicitly for input
@ -364,11 +361,11 @@ static int linehandle_create(struct gpio_device *gdev, void __user *ip)
ret = gpiod_direction_output_nonotify(desc, val);
if (ret)
goto out_free_lh;
return ret;
} else if (lflags & GPIOHANDLE_REQUEST_INPUT) {
ret = gpiod_direction_input_nonotify(desc);
if (ret)
goto out_free_lh;
return ret;
}
gpiod_line_state_notify(desc, GPIO_V2_LINE_CHANGED_REQUESTED);
@ -377,44 +374,23 @@ static int linehandle_create(struct gpio_device *gdev, void __user *ip)
offset);
}
fd = get_unused_fd_flags(O_RDONLY | O_CLOEXEC);
if (fd < 0) {
ret = fd;
goto out_free_lh;
}
FD_PREPARE(fdf, O_RDONLY | O_CLOEXEC,
anon_inode_getfile("gpio-linehandle", &linehandle_fileops,
lh, O_RDONLY | O_CLOEXEC));
if (fdf.err)
return fdf.err;
retain_and_null_ptr(lh);
file = anon_inode_getfile("gpio-linehandle",
&linehandle_fileops,
lh,
O_RDONLY | O_CLOEXEC);
if (IS_ERR(file)) {
ret = PTR_ERR(file);
goto out_put_unused_fd;
}
handlereq.fd = fd;
if (copy_to_user(ip, &handlereq, sizeof(handlereq))) {
/*
* fput() will trigger the release() callback, so do not go onto
* the regular error cleanup path here.
*/
fput(file);
put_unused_fd(fd);
handlereq.fd = fd_prepare_fd(fdf);
if (copy_to_user(ip, &handlereq, sizeof(handlereq)))
return -EFAULT;
}
fd_install(fd, file);
fd_publish(fdf);
dev_dbg(&gdev->dev, "registered chardev handle for %d lines\n",
lh->num_descs);
return 0;
out_put_unused_fd:
put_unused_fd(fd);
out_free_lh:
linehandle_free(lh);
return ret;
}
#endif /* CONFIG_GPIO_CDEV_V1 */

View File

@ -1870,8 +1870,6 @@ mshv_ioctl_create_partition(void __user *user_arg, struct device *module_dev)
struct hv_partition_creation_properties creation_properties = {};
union hv_partition_isolation_properties isolation_properties = {};
struct mshv_partition *partition;
struct file *file;
int fd;
long ret;
if (copy_from_user(&args, user_arg, sizeof(args)))
@ -1938,29 +1936,13 @@ mshv_ioctl_create_partition(void __user *user_arg, struct device *module_dev)
goto delete_partition;
ret = mshv_init_async_handler(partition);
if (ret)
goto remove_partition;
fd = get_unused_fd_flags(O_CLOEXEC);
if (fd < 0) {
ret = fd;
goto remove_partition;
if (!ret) {
ret = FD_ADD(O_CLOEXEC, anon_inode_getfile("mshv_partition",
&mshv_partition_fops,
partition, O_RDWR));
if (ret >= 0)
return ret;
}
file = anon_inode_getfile("mshv_partition", &mshv_partition_fops,
partition, O_RDWR);
if (IS_ERR(file)) {
ret = PTR_ERR(file);
goto put_fd;
}
fd_install(fd, file);
return fd;
put_fd:
put_unused_fd(fd);
remove_partition:
remove_partition(partition);
delete_partition:
hv_call_delete_partition(partition->pt_id);

View File

@ -53,6 +53,10 @@ extern void
usnic_uiom_interval_tree_remove(struct usnic_uiom_interval_node *node,
struct rb_root_cached *root);
extern struct usnic_uiom_interval_node *
usnic_uiom_interval_tree_subtree_search(struct usnic_uiom_interval_node *node,
unsigned long start,
unsigned long last);
extern struct usnic_uiom_interval_node *
usnic_uiom_interval_tree_iter_first(struct rb_root_cached *root,
unsigned long start,
unsigned long last);

View File

@ -282,8 +282,6 @@ EXPORT_SYMBOL_GPL(media_request_get_by_fd);
int media_request_alloc(struct media_device *mdev, int *alloc_fd)
{
struct media_request *req;
struct file *filp;
int fd;
int ret;
/* Either both are NULL or both are non-NULL */
@ -297,19 +295,6 @@ int media_request_alloc(struct media_device *mdev, int *alloc_fd)
if (!req)
return -ENOMEM;
fd = get_unused_fd_flags(O_CLOEXEC);
if (fd < 0) {
ret = fd;
goto err_free_req;
}
filp = anon_inode_getfile("request", &request_fops, NULL, O_CLOEXEC);
if (IS_ERR(filp)) {
ret = PTR_ERR(filp);
goto err_put_fd;
}
filp->private_data = req;
req->mdev = mdev;
req->state = MEDIA_REQUEST_STATE_IDLE;
req->num_incomplete_objects = 0;
@ -320,19 +305,24 @@ int media_request_alloc(struct media_device *mdev, int *alloc_fd)
req->updating_count = 0;
req->access_count = 0;
*alloc_fd = fd;
FD_PREPARE(fdf, O_CLOEXEC,
anon_inode_getfile("request", &request_fops, NULL,
O_CLOEXEC));
if (fdf.err) {
ret = fdf.err;
goto err_free_req;
}
fd_prepare_file(fdf)->private_data = req;
*alloc_fd = fd_publish(fdf);
snprintf(req->debug_str, sizeof(req->debug_str), "%u:%d",
atomic_inc_return(&mdev->request_id), fd);
atomic_inc_return(&mdev->request_id), *alloc_fd);
dev_dbg(mdev->dev, "request: allocated %s\n", req->debug_str);
fd_install(fd, filp);
return 0;
err_put_fd:
put_unused_fd(fd);
err_free_req:
if (mdev->ops->req_free)
mdev->ops->req_free(req);

View File

@ -721,21 +721,12 @@ static struct ntsync_obj *ntsync_alloc_obj(struct ntsync_device *dev,
static int ntsync_obj_get_fd(struct ntsync_obj *obj)
{
struct file *file;
int fd;
fd = get_unused_fd_flags(O_CLOEXEC);
if (fd < 0)
return fd;
file = anon_inode_getfile("ntsync", &ntsync_obj_fops, obj, O_RDWR);
if (IS_ERR(file)) {
put_unused_fd(fd);
return PTR_ERR(file);
}
obj->file = file;
fd_install(fd, file);
return fd;
FD_PREPARE(fdf, O_CLOEXEC,
anon_inode_getfile("ntsync", &ntsync_obj_fops, obj, O_RDWR));
if (fdf.err)
return fdf.err;
obj->file = fd_prepare_file(fdf);
return fd_publish(fdf);
}
static int ntsync_create_sem(struct ntsync_device *dev, void __user *argp)

View File

@ -491,7 +491,7 @@ static int gc2235_s_power(struct v4l2_subdev *sd, int on)
return ret;
}
static int startup(struct v4l2_subdev *sd)
static int gc2235_startup(struct v4l2_subdev *sd)
{
struct gc2235_device *dev = to_gc2235_sensor(sd);
struct i2c_client *client = v4l2_get_subdevdata(sd);
@ -556,7 +556,7 @@ static int gc2235_set_fmt(struct v4l2_subdev *sd,
return 0;
}
ret = startup(sd);
ret = gc2235_startup(sd);
if (ret) {
dev_err(&client->dev, "gc2235 startup err\n");
goto err;

View File

@ -600,7 +600,7 @@ static int ov2722_s_power(struct v4l2_subdev *sd, int on)
}
/* TODO: remove it. */
static int startup(struct v4l2_subdev *sd)
static int ov2722_startup(struct v4l2_subdev *sd)
{
struct ov2722_device *dev = to_ov2722_sensor(sd);
struct i2c_client *client = v4l2_get_subdevdata(sd);
@ -662,7 +662,7 @@ static int ov2722_set_fmt(struct v4l2_subdev *sd,
dev->pixels_per_line = dev->res->pixels_per_line;
dev->lines_per_frame = dev->res->lines_per_frame;
ret = startup(sd);
ret = ov2722_startup(sd);
if (ret) {
int i = 0;
@ -677,7 +677,7 @@ static int ov2722_set_fmt(struct v4l2_subdev *sd,
dev_err(&client->dev, "power up failed, continue\n");
continue;
}
ret = startup(sd);
ret = ov2722_startup(sd);
if (ret) {
dev_err(&client->dev, " startup FAILED!\n");
} else {

View File

@ -438,7 +438,7 @@ static irqreturn_t ser_tx_int(int irq, void *dev_id)
* ---------------------------------------------------------------
*/
static int startup(struct tty_struct *tty, struct serial_state *info)
static int rs_startup(struct tty_struct *tty, struct serial_state *info)
{
struct tty_port *port = &info->tport;
unsigned long flags;
@ -513,7 +513,7 @@ static int startup(struct tty_struct *tty, struct serial_state *info)
* This routine will shutdown a serial port; interrupts are disabled, and
* DTR is dropped if the hangup on close termio flag is on.
*/
static void shutdown(struct tty_struct *tty, struct serial_state *info)
static void rs_shutdown(struct tty_struct *tty, struct serial_state *info)
{
unsigned long flags;
@ -975,7 +975,7 @@ static int set_serial_info(struct tty_struct *tty, struct serial_struct *ss)
change_speed(tty, state, NULL);
}
} else
retval = startup(tty, state);
retval = rs_startup(tty, state);
tty_unlock(tty);
return retval;
}
@ -1251,7 +1251,7 @@ static void rs_close(struct tty_struct *tty, struct file * filp)
*/
rs_wait_until_sent(tty, state->timeout);
}
shutdown(tty, state);
rs_shutdown(tty, state);
rs_flush_buffer(tty);
tty_ldisc_flush(tty);
@ -1325,7 +1325,7 @@ static void rs_hangup(struct tty_struct *tty)
struct serial_state *info = tty->driver_data;
rs_flush_buffer(tty);
shutdown(tty, info);
rs_shutdown(tty, info);
info->tport.count = 0;
tty_port_set_active(&info->tport, false);
info->tport.tty = NULL;
@ -1349,7 +1349,7 @@ static int rs_open(struct tty_struct *tty, struct file * filp)
port->tty = tty;
tty->driver_data = info;
retval = startup(tty, info);
retval = rs_startup(tty, info);
if (retval) {
return retval;
}

View File

@ -589,6 +589,23 @@ static inline void legacy_pty_init(void) { }
#ifdef CONFIG_UNIX98_PTYS
static struct cdev ptmx_cdev;
static struct file *ptm_open_peer_file(struct file *master,
struct tty_struct *tty, int flags)
{
struct path path;
struct file *file;
/* Compute the slave's path */
path.mnt = devpts_mntget(master, tty->driver_data);
if (IS_ERR(path.mnt))
return ERR_CAST(path.mnt);
path.dentry = tty->link->driver_data;
file = dentry_open(&path, flags, current_cred());
mntput(path.mnt);
return file;
}
/**
* ptm_open_peer - open the peer of a pty
* @master: the open struct file of the ptmx device node
@ -601,42 +618,10 @@ static struct cdev ptmx_cdev;
*/
int ptm_open_peer(struct file *master, struct tty_struct *tty, int flags)
{
int fd;
struct file *filp;
int retval = -EINVAL;
struct path path;
if (tty->driver != ptm_driver)
return -EIO;
fd = get_unused_fd_flags(flags);
if (fd < 0) {
retval = fd;
goto err;
}
/* Compute the slave's path */
path.mnt = devpts_mntget(master, tty->driver_data);
if (IS_ERR(path.mnt)) {
retval = PTR_ERR(path.mnt);
goto err_put;
}
path.dentry = tty->link->driver_data;
filp = dentry_open(&path, flags, current_cred());
mntput(path.mnt);
if (IS_ERR(filp)) {
retval = PTR_ERR(filp);
goto err_put;
}
fd_install(fd, filp);
return fd;
err_put:
put_unused_fd(fd);
err:
return retval;
return FD_ADD(flags, ptm_open_peer_file(master, tty, flags));
}
static int pty_unix98_ioctl(struct tty_struct *tty,

View File

@ -760,7 +760,7 @@ static void load_code(struct icom_port *icom_port)
dma_free_coherent(&dev->dev, 4096, new_page, temp_pci);
}
static int startup(struct icom_port *icom_port)
static int icom_startup(struct icom_port *icom_port)
{
unsigned long temp;
unsigned char cable_id, raw_cable_id;
@ -832,7 +832,7 @@ static int startup(struct icom_port *icom_port)
return 0;
}
static void shutdown(struct icom_port *icom_port)
static void icom_shutdown(struct icom_port *icom_port)
{
unsigned long temp;
unsigned char cmdReg;
@ -1311,7 +1311,7 @@ static int icom_open(struct uart_port *port)
int retval;
kref_get(&icom_port->adapter->kref);
retval = startup(icom_port);
retval = icom_startup(icom_port);
if (retval) {
kref_put(&icom_port->adapter->kref, icom_kref_release);
@ -1333,7 +1333,7 @@ static void icom_close(struct uart_port *port)
cmdReg = readb(&icom_port->dram->CmdReg);
writeb(cmdReg & ~CMD_RCV_ENABLE, &icom_port->dram->CmdReg);
shutdown(icom_port);
icom_shutdown(icom_port);
kref_put(&icom_port->adapter->kref, icom_kref_release);
}

View File

@ -407,9 +407,9 @@ static void wr_reg32(struct slgt_info *info, unsigned int addr, __u32 value);
static void msc_set_vcr(struct slgt_info *info);
static int startup(struct slgt_info *info);
static int startup_hw(struct slgt_info *info);
static int block_til_ready(struct tty_struct *tty, struct file * filp,struct slgt_info *info);
static void shutdown(struct slgt_info *info);
static void shutdown_hw(struct slgt_info *info);
static void program_hw(struct slgt_info *info);
static void change_params(struct slgt_info *info);
@ -622,7 +622,7 @@ static int open(struct tty_struct *tty, struct file *filp)
if (info->port.count == 1) {
/* 1st open on this device, init hardware */
retval = startup(info);
retval = startup_hw(info);
if (retval < 0) {
mutex_unlock(&info->port.mutex);
goto cleanup;
@ -666,7 +666,7 @@ static void close(struct tty_struct *tty, struct file *filp)
flush_buffer(tty);
tty_ldisc_flush(tty);
shutdown(info);
shutdown_hw(info);
mutex_unlock(&info->port.mutex);
tty_port_close_end(&info->port, tty);
@ -687,7 +687,7 @@ static void hangup(struct tty_struct *tty)
flush_buffer(tty);
mutex_lock(&info->port.mutex);
shutdown(info);
shutdown_hw(info);
spin_lock_irqsave(&info->port.lock, flags);
info->port.count = 0;
@ -1445,7 +1445,7 @@ static int hdlcdev_open(struct net_device *dev)
spin_unlock_irqrestore(&info->netlock, flags);
/* claim resources and init adapter */
if ((rc = startup(info)) != 0) {
if ((rc = startup_hw(info)) != 0) {
spin_lock_irqsave(&info->netlock, flags);
info->netcount=0;
spin_unlock_irqrestore(&info->netlock, flags);
@ -1455,7 +1455,7 @@ static int hdlcdev_open(struct net_device *dev)
/* generic HDLC layer open processing */
rc = hdlc_open(dev);
if (rc) {
shutdown(info);
shutdown_hw(info);
spin_lock_irqsave(&info->netlock, flags);
info->netcount = 0;
spin_unlock_irqrestore(&info->netlock, flags);
@ -1499,7 +1499,7 @@ static int hdlcdev_close(struct net_device *dev)
netif_stop_queue(dev);
/* shutdown adapter and release resources */
shutdown(info);
shutdown_hw(info);
hdlc_close(dev);
@ -2328,7 +2328,7 @@ static irqreturn_t slgt_interrupt(int dummy, void *dev_id)
return IRQ_HANDLED;
}
static int startup(struct slgt_info *info)
static int startup_hw(struct slgt_info *info)
{
DBGINFO(("%s startup\n", info->device_name));
@ -2361,7 +2361,7 @@ static int startup(struct slgt_info *info)
/*
* called by close() and hangup() to shutdown hardware
*/
static void shutdown(struct slgt_info *info)
static void shutdown_hw(struct slgt_info *info)
{
unsigned long flags;

View File

@ -299,10 +299,8 @@ static int vfio_group_ioctl_get_device_fd(struct vfio_group *group,
char __user *arg)
{
struct vfio_device *device;
struct file *filep;
char *buf;
int fdno;
int ret;
int fd;
buf = strndup_user(arg, PAGE_SIZE);
if (IS_ERR(buf))
@ -313,26 +311,10 @@ static int vfio_group_ioctl_get_device_fd(struct vfio_group *group,
if (IS_ERR(device))
return PTR_ERR(device);
fdno = get_unused_fd_flags(O_CLOEXEC);
if (fdno < 0) {
ret = fdno;
goto err_put_device;
}
filep = vfio_device_open_file(device);
if (IS_ERR(filep)) {
ret = PTR_ERR(filep);
goto err_put_fdno;
}
fd_install(fdno, filep);
return fdno;
err_put_fdno:
put_unused_fd(fdno);
err_put_device:
fd = FD_ADD(O_CLOEXEC, vfio_device_open_file(device));
if (fd < 0)
vfio_device_put_registration(device);
return ret;
return fd;
}
static int vfio_group_ioctl_get_status(struct vfio_group *group,

View File

@ -410,7 +410,7 @@ static char *join(const char *dir, const char *name)
return (!buffer) ? ERR_PTR(-ENOMEM) : buffer;
}
static char **split(char *strings, unsigned int len, unsigned int *num)
static char **split_strings(char *strings, unsigned int len, unsigned int *num)
{
char *p, **ret;
@ -448,7 +448,7 @@ char **xenbus_directory(struct xenbus_transaction t,
if (IS_ERR(strings))
return ERR_CAST(strings);
return split(strings, len, num);
return split_strings(strings, len, num);
}
EXPORT_SYMBOL_GPL(xenbus_directory);

View File

@ -280,27 +280,8 @@ static int __anon_inode_getfd(const char *name,
const struct inode *context_inode,
bool make_inode)
{
int error, fd;
struct file *file;
error = get_unused_fd_flags(flags);
if (error < 0)
return error;
fd = error;
file = __anon_inode_getfile(name, fops, priv, flags, context_inode,
make_inode);
if (IS_ERR(file)) {
error = PTR_ERR(file);
goto err_put_unused_fd;
}
fd_install(fd, file);
return fd;
err_put_unused_fd:
put_unused_fd(fd);
return error;
return FD_ADD(flags, __anon_inode_getfile(name, fops, priv, flags,
context_inode, make_inode));
}
/**

View File

@ -415,7 +415,7 @@ EXPORT_SYMBOL(may_setattr);
* performed on the raw inode simply pass @nop_mnt_idmap.
*/
int notify_change(struct mnt_idmap *idmap, struct dentry *dentry,
struct iattr *attr, struct inode **delegated_inode)
struct iattr *attr, struct delegated_inode *delegated_inode)
{
struct inode *inode = dentry->d_inode;
umode_t mode = inode->i_mode;

View File

@ -16,6 +16,7 @@
#include <linux/wait.h>
#include <linux/sched.h>
#include <linux/sched/signal.h>
#include <uapi/linux/mount.h>
#include <linux/mount.h>
#include <linux/namei.h>
#include <linux/uaccess.h>
@ -27,6 +28,9 @@
#include <linux/magic.h>
#include <linux/fs_context.h>
#include <linux/fs_parser.h>
#include "../mount.h"
#include <linux/ns_common.h>
/* This is the range of ioctl() numbers we claim as ours */
#define AUTOFS_IOC_FIRST AUTOFS_IOC_READY
@ -114,6 +118,7 @@ struct autofs_sb_info {
int pipefd;
struct file *pipe;
struct pid *oz_pgrp;
u64 mnt_ns_id;
int version;
int sub_version;
int min_proto;

View File

@ -231,32 +231,14 @@ static int test_by_type(const struct path *path, void *p)
*/
static int autofs_dev_ioctl_open_mountpoint(const char *name, dev_t devid)
{
int err, fd;
fd = get_unused_fd_flags(O_CLOEXEC);
if (likely(fd >= 0)) {
struct file *filp;
struct path path;
struct path path __free(path_put) = {};
int err;
err = find_autofs_mount(name, &path, test_by_dev, &devid);
if (err)
goto out;
filp = dentry_open(&path, O_RDONLY, current_cred());
path_put(&path);
if (IS_ERR(filp)) {
err = PTR_ERR(filp);
goto out;
}
fd_install(fd, filp);
}
return fd;
out:
put_unused_fd(fd);
return err;
return FD_ADD(O_CLOEXEC, dentry_open(&path, O_RDONLY, current_cred()));
}
/* Open a file descriptor on an autofs mount point */
@ -381,6 +363,7 @@ static int autofs_dev_ioctl_setpipefd(struct file *fp,
swap(sbi->oz_pgrp, new_pid);
sbi->pipefd = pipefd;
sbi->pipe = pipe;
sbi->mnt_ns_id = to_ns_common(current->nsproxy->mnt_ns)->ns_id;
sbi->flags &= ~AUTOFS_SBI_CATATONIC;
}
out:

View File

@ -251,6 +251,7 @@ static struct autofs_sb_info *autofs_alloc_sbi(void)
sbi->min_proto = AUTOFS_MIN_PROTO_VERSION;
sbi->max_proto = AUTOFS_MAX_PROTO_VERSION;
sbi->pipefd = -1;
sbi->mnt_ns_id = to_ns_common(current->nsproxy->mnt_ns)->ns_id;
set_autofs_type_indirect(&sbi->type);
mutex_init(&sbi->wq_mutex);

View File

@ -341,6 +341,14 @@ static struct vfsmount *autofs_d_automount(struct path *path)
if (autofs_oz_mode(sbi))
return NULL;
/* Refuse to trigger mount if current namespace is not the owner
* and the mount is propagation private.
*/
if (sbi->mnt_ns_id != to_ns_common(current->nsproxy->mnt_ns)->ns_id) {
if (vfsmount_to_propagation_flags(path->mnt) & MS_PRIVATE)
return ERR_PTR(-EPERM);
}
/*
* If an expire request is pending everyone must wait.
* If the expire fails we're still mounted so continue

View File

@ -904,14 +904,9 @@ static noinline int btrfs_mksubvol(struct dentry *parent,
struct fscrypt_str name_str = FSTR_INIT((char *)qname->name, qname->len);
int ret;
ret = down_write_killable_nested(&dir->i_rwsem, I_MUTEX_PARENT);
if (ret == -EINTR)
return ret;
dentry = lookup_one(idmap, qname, parent);
ret = PTR_ERR(dentry);
dentry = start_creating_killable(idmap, parent, qname);
if (IS_ERR(dentry))
goto out_unlock;
return PTR_ERR(dentry);
ret = btrfs_may_create(idmap, dir, dentry);
if (ret)
@ -940,9 +935,7 @@ static noinline int btrfs_mksubvol(struct dentry *parent,
out_up_read:
up_read(&fs_info->subvol_sem);
out_dput:
dput(dentry);
out_unlock:
btrfs_inode_unlock(BTRFS_I(dir), 0);
end_creating(dentry);
return ret;
}
@ -2417,18 +2410,10 @@ static noinline int btrfs_ioctl_snap_destroy(struct file *file,
goto free_subvol_name;
}
ret = down_write_killable_nested(&dir->i_rwsem, I_MUTEX_PARENT);
if (ret == -EINTR)
goto free_subvol_name;
dentry = lookup_one(idmap, &QSTR(subvol_name), parent);
dentry = start_removing_killable(idmap, parent, &QSTR(subvol_name));
if (IS_ERR(dentry)) {
ret = PTR_ERR(dentry);
goto out_unlock_dir;
}
if (d_really_is_negative(dentry)) {
ret = -ENOENT;
goto out_dput;
goto out_end_removing;
}
inode = d_inode(dentry);
@ -2449,7 +2434,7 @@ static noinline int btrfs_ioctl_snap_destroy(struct file *file,
*/
ret = -EPERM;
if (!btrfs_test_opt(fs_info, USER_SUBVOL_RM_ALLOWED))
goto out_dput;
goto out_end_removing;
/*
* Do not allow deletion if the parent dir is the same
@ -2460,21 +2445,21 @@ static noinline int btrfs_ioctl_snap_destroy(struct file *file,
*/
ret = -EINVAL;
if (root == dest)
goto out_dput;
goto out_end_removing;
ret = inode_permission(idmap, inode, MAY_WRITE | MAY_EXEC);
if (ret)
goto out_dput;
goto out_end_removing;
}
/* check if subvolume may be deleted by a user */
ret = btrfs_may_delete(idmap, dir, dentry, 1);
if (ret)
goto out_dput;
goto out_end_removing;
if (btrfs_ino(BTRFS_I(inode)) != BTRFS_FIRST_FREE_OBJECTID) {
ret = -EINVAL;
goto out_dput;
goto out_end_removing;
}
btrfs_inode_lock(BTRFS_I(inode), 0);
@ -2483,10 +2468,8 @@ static noinline int btrfs_ioctl_snap_destroy(struct file *file,
if (!ret)
d_delete_notify(dir, dentry);
out_dput:
dput(dentry);
out_unlock_dir:
btrfs_inode_unlock(BTRFS_I(dir), 0);
out_end_removing:
end_removing(dentry);
free_subvol_name:
kfree(subvol_name_ptr);
free_parent:

View File

@ -9,6 +9,7 @@
#include <linux/mount.h>
#include <linux/xattr.h>
#include <linux/file.h>
#include <linux/namei.h>
#include <linux/falloc.h>
#include <trace/events/fscache.h>
#include "internal.h"
@ -428,10 +429,12 @@ static bool cachefiles_invalidate_cookie(struct fscache_cookie *cookie)
if (!old_tmpfile) {
struct cachefiles_volume *volume = object->volume;
struct dentry *fan = volume->fanout[(u8)cookie->key_hash];
struct dentry *obj;
inode_lock_nested(d_inode(fan), I_MUTEX_PARENT);
cachefiles_bury_object(volume->cache, object, fan,
old_file->f_path.dentry,
obj = start_removing_dentry(fan, old_file->f_path.dentry);
if (!IS_ERR(obj))
cachefiles_bury_object(volume->cache, object,
fan, obj,
FSCACHE_OBJECT_INVALIDATED);
}
fput(old_file);

View File

@ -93,12 +93,11 @@ struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache,
_enter(",,%s", dirname);
/* search the current directory for the element name */
inode_lock_nested(d_inode(dir), I_MUTEX_PARENT);
retry:
ret = cachefiles_inject_read_error();
if (ret == 0)
subdir = lookup_one(&nop_mnt_idmap, &QSTR(dirname), dir);
subdir = start_creating(&nop_mnt_idmap, dir, &QSTR(dirname));
else
subdir = ERR_PTR(ret);
trace_cachefiles_lookup(NULL, dir, subdir);
@ -129,10 +128,12 @@ struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache,
if (ret < 0)
goto mkdir_error;
ret = cachefiles_inject_write_error();
if (ret == 0)
subdir = vfs_mkdir(&nop_mnt_idmap, d_inode(dir), subdir, 0700);
else
if (ret == 0) {
subdir = vfs_mkdir(&nop_mnt_idmap, d_inode(dir), subdir, 0700, NULL);
} else {
end_creating(subdir);
subdir = ERR_PTR(ret);
}
if (IS_ERR(subdir)) {
trace_cachefiles_vfs_error(NULL, d_inode(dir), ret,
cachefiles_trace_mkdir_error);
@ -141,7 +142,7 @@ struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache,
trace_cachefiles_mkdir(dir, subdir);
if (unlikely(d_unhashed(subdir) || d_is_negative(subdir))) {
dput(subdir);
end_creating(subdir);
goto retry;
}
ASSERT(d_backing_inode(subdir));
@ -154,7 +155,7 @@ struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache,
/* Tell rmdir() it's not allowed to delete the subdir */
inode_lock(d_inode(subdir));
inode_unlock(d_inode(dir));
end_creating_keep(subdir);
if (!__cachefiles_mark_inode_in_use(NULL, d_inode(subdir))) {
pr_notice("cachefiles: Inode already in use: %pd (B=%lx)\n",
@ -196,14 +197,11 @@ struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache,
return ERR_PTR(-EBUSY);
mkdir_error:
inode_unlock(d_inode(dir));
if (!IS_ERR(subdir))
dput(subdir);
end_creating(subdir);
pr_err("mkdir %s failed with error %d\n", dirname, ret);
return ERR_PTR(ret);
lookup_error:
inode_unlock(d_inode(dir));
ret = PTR_ERR(subdir);
pr_err("Lookup %s failed with error %d\n", dirname, ret);
return ERR_PTR(ret);
@ -263,6 +261,8 @@ static int cachefiles_unlink(struct cachefiles_cache *cache,
* - File backed objects are unlinked
* - Directory backed objects are stuffed into the graveyard for userspace to
* delete
* On entry dir must be locked. It will be unlocked on exit.
* On entry there must be at least 2 refs on rep, one will be dropped on exit.
*/
int cachefiles_bury_object(struct cachefiles_cache *cache,
struct cachefiles_object *object,
@ -278,27 +278,23 @@ int cachefiles_bury_object(struct cachefiles_cache *cache,
_enter(",'%pd','%pd'", dir, rep);
if (rep->d_parent != dir) {
inode_unlock(d_inode(dir));
end_removing(rep);
_leave(" = -ESTALE");
return -ESTALE;
}
/* non-directories can just be unlinked */
if (!d_is_dir(rep)) {
dget(rep); /* Stop the dentry being negated if it's only pinned
* by a file struct.
*/
ret = cachefiles_unlink(cache, object, dir, rep, why);
dput(rep);
end_removing(rep);
inode_unlock(d_inode(dir));
_leave(" = %d", ret);
return ret;
}
/* directories have to be moved to the graveyard */
_debug("move stale object to graveyard");
inode_unlock(d_inode(dir));
end_removing(rep);
try_again:
/* first step is to make up a grave dentry in the graveyard */
@ -425,13 +421,12 @@ int cachefiles_delete_object(struct cachefiles_object *object,
_enter(",OBJ%x{%pD}", object->debug_id, object->file);
/* Stop the dentry being negated if it's only pinned by a file struct. */
dget(dentry);
inode_lock_nested(d_backing_inode(fan), I_MUTEX_PARENT);
dentry = start_removing_dentry(fan, dentry);
if (IS_ERR(dentry))
ret = PTR_ERR(dentry);
else
ret = cachefiles_unlink(volume->cache, object, fan, dentry, why);
inode_unlock(d_backing_inode(fan));
dput(dentry);
end_removing(dentry);
return ret;
}
@ -644,8 +639,12 @@ bool cachefiles_look_up_object(struct cachefiles_object *object)
if (!d_is_reg(dentry)) {
pr_err("%pd is not a file\n", dentry);
inode_lock_nested(d_inode(fan), I_MUTEX_PARENT);
ret = cachefiles_bury_object(volume->cache, object, fan, dentry,
struct dentry *de = start_removing_dentry(fan, dentry);
if (IS_ERR(de))
ret = PTR_ERR(de);
else
ret = cachefiles_bury_object(volume->cache, object,
fan, de,
FSCACHE_OBJECT_IS_WEIRD);
dput(dentry);
if (ret < 0)
@ -679,36 +678,41 @@ bool cachefiles_commit_tmpfile(struct cachefiles_cache *cache,
_enter(",%pD", object->file);
inode_lock_nested(d_inode(fan), I_MUTEX_PARENT);
ret = cachefiles_inject_read_error();
if (ret == 0)
dentry = lookup_one(&nop_mnt_idmap, &QSTR(object->d_name), fan);
dentry = start_creating(&nop_mnt_idmap, fan, &QSTR(object->d_name));
else
dentry = ERR_PTR(ret);
if (IS_ERR(dentry)) {
trace_cachefiles_vfs_error(object, d_inode(fan), PTR_ERR(dentry),
cachefiles_trace_lookup_error);
_debug("lookup fail %ld", PTR_ERR(dentry));
goto out_unlock;
goto out;
}
if (!d_is_negative(dentry)) {
/*
* This loop will only execute more than once if some other thread
* races to create the object we are trying to create.
*/
while (!d_is_negative(dentry)) {
ret = cachefiles_unlink(volume->cache, object, fan, dentry,
FSCACHE_OBJECT_IS_STALE);
if (ret < 0)
goto out_dput;
goto out_end;
end_creating(dentry);
dput(dentry);
ret = cachefiles_inject_read_error();
if (ret == 0)
dentry = lookup_one(&nop_mnt_idmap, &QSTR(object->d_name), fan);
dentry = start_creating(&nop_mnt_idmap, fan,
&QSTR(object->d_name));
else
dentry = ERR_PTR(ret);
if (IS_ERR(dentry)) {
trace_cachefiles_vfs_error(object, d_inode(fan), PTR_ERR(dentry),
cachefiles_trace_lookup_error);
_debug("lookup fail %ld", PTR_ERR(dentry));
goto out_unlock;
goto out;
}
}
@ -729,10 +733,9 @@ bool cachefiles_commit_tmpfile(struct cachefiles_cache *cache,
success = true;
}
out_dput:
dput(dentry);
out_unlock:
inode_unlock(d_inode(fan));
out_end:
end_creating(dentry);
out:
_leave(" = %u", success);
return success;
}
@ -748,26 +751,20 @@ static struct dentry *cachefiles_lookup_for_cull(struct cachefiles_cache *cache,
struct dentry *victim;
int ret = -ENOENT;
inode_lock_nested(d_inode(dir), I_MUTEX_PARENT);
victim = start_removing(&nop_mnt_idmap, dir, &QSTR(filename));
victim = lookup_one(&nop_mnt_idmap, &QSTR(filename), dir);
if (IS_ERR(victim))
goto lookup_error;
if (d_is_negative(victim))
goto lookup_put;
if (d_inode(victim)->i_flags & S_KERNEL_FILE)
goto lookup_busy;
return victim;
lookup_busy:
ret = -EBUSY;
lookup_put:
inode_unlock(d_inode(dir));
dput(victim);
end_removing(victim);
return ERR_PTR(ret);
lookup_error:
inode_unlock(d_inode(dir));
ret = PTR_ERR(victim);
if (ret == -ENOENT)
return ERR_PTR(-ESTALE); /* Probably got retired by the netfs */
@ -815,18 +812,17 @@ int cachefiles_cull(struct cachefiles_cache *cache, struct dentry *dir,
ret = cachefiles_bury_object(cache, NULL, dir, victim,
FSCACHE_OBJECT_WAS_CULLED);
dput(victim);
if (ret < 0)
goto error;
fscache_count_culled();
dput(victim);
_leave(" = 0");
return 0;
error_unlock:
inode_unlock(d_inode(dir));
end_removing(victim);
error:
dput(victim);
if (ret == -ENOENT)
return -ESTALE; /* Probably got retired by the netfs */

View File

@ -7,6 +7,7 @@
#include <linux/fs.h>
#include <linux/slab.h>
#include <linux/namei.h>
#include "internal.h"
#include <trace/events/fscache.h>
@ -58,8 +59,10 @@ void cachefiles_acquire_volume(struct fscache_volume *vcookie)
if (ret < 0) {
if (ret != -ESTALE)
goto error_dir;
inode_lock_nested(d_inode(cache->store), I_MUTEX_PARENT);
cachefiles_bury_object(cache, NULL, cache->store, vdentry,
vdentry = start_removing_dentry(cache->store, vdentry);
if (!IS_ERR(vdentry))
cachefiles_bury_object(cache, NULL, cache->store,
vdentry,
FSCACHE_VOLUME_IS_WEIRD);
cachefiles_put_directory(volume->dentry);
cond_resched();

View File

@ -403,7 +403,7 @@ static struct dentry *debugfs_start_creating(const char *name,
return dentry;
}
static struct dentry *failed_creating(struct dentry *dentry)
static struct dentry *debugfs_failed_creating(struct dentry *dentry)
{
inode_unlock(d_inode(dentry->d_parent));
dput(dentry);
@ -411,7 +411,7 @@ static struct dentry *failed_creating(struct dentry *dentry)
return ERR_PTR(-ENOMEM);
}
static struct dentry *end_creating(struct dentry *dentry)
static struct dentry *debugfs_end_creating(struct dentry *dentry)
{
inode_unlock(d_inode(dentry->d_parent));
return dentry;
@ -435,7 +435,7 @@ static struct dentry *__debugfs_create_file(const char *name, umode_t mode,
return dentry;
if (!(debugfs_allow & DEBUGFS_ALLOW_API)) {
failed_creating(dentry);
debugfs_failed_creating(dentry);
return ERR_PTR(-EPERM);
}
@ -443,7 +443,7 @@ static struct dentry *__debugfs_create_file(const char *name, umode_t mode,
if (unlikely(!inode)) {
pr_err("out of free dentries, can not create file '%s'\n",
name);
return failed_creating(dentry);
return debugfs_failed_creating(dentry);
}
inode->i_mode = mode;
@ -458,7 +458,7 @@ static struct dentry *__debugfs_create_file(const char *name, umode_t mode,
d_instantiate(dentry, inode);
fsnotify_create(d_inode(dentry->d_parent), dentry);
return end_creating(dentry);
return debugfs_end_creating(dentry);
}
struct dentry *debugfs_create_file_full(const char *name, umode_t mode,
@ -585,7 +585,7 @@ struct dentry *debugfs_create_dir(const char *name, struct dentry *parent)
return dentry;
if (!(debugfs_allow & DEBUGFS_ALLOW_API)) {
failed_creating(dentry);
debugfs_failed_creating(dentry);
return ERR_PTR(-EPERM);
}
@ -593,7 +593,7 @@ struct dentry *debugfs_create_dir(const char *name, struct dentry *parent)
if (unlikely(!inode)) {
pr_err("out of free dentries, can not create directory '%s'\n",
name);
return failed_creating(dentry);
return debugfs_failed_creating(dentry);
}
inode->i_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO;
@ -605,7 +605,7 @@ struct dentry *debugfs_create_dir(const char *name, struct dentry *parent)
d_instantiate(dentry, inode);
inc_nlink(d_inode(dentry->d_parent));
fsnotify_mkdir(d_inode(dentry->d_parent), dentry);
return end_creating(dentry);
return debugfs_end_creating(dentry);
}
EXPORT_SYMBOL_GPL(debugfs_create_dir);
@ -632,7 +632,7 @@ struct dentry *debugfs_create_automount(const char *name,
return dentry;
if (!(debugfs_allow & DEBUGFS_ALLOW_API)) {
failed_creating(dentry);
debugfs_failed_creating(dentry);
return ERR_PTR(-EPERM);
}
@ -640,7 +640,7 @@ struct dentry *debugfs_create_automount(const char *name,
if (unlikely(!inode)) {
pr_err("out of free dentries, can not create automount '%s'\n",
name);
return failed_creating(dentry);
return debugfs_failed_creating(dentry);
}
make_empty_dir_inode(inode);
@ -652,7 +652,7 @@ struct dentry *debugfs_create_automount(const char *name,
d_instantiate(dentry, inode);
inc_nlink(d_inode(dentry->d_parent));
fsnotify_mkdir(d_inode(dentry->d_parent), dentry);
return end_creating(dentry);
return debugfs_end_creating(dentry);
}
EXPORT_SYMBOL(debugfs_create_automount);
@ -699,13 +699,13 @@ struct dentry *debugfs_create_symlink(const char *name, struct dentry *parent,
pr_err("out of free dentries, can not create symlink '%s'\n",
name);
kfree(link);
return failed_creating(dentry);
return debugfs_failed_creating(dentry);
}
inode->i_mode = S_IFLNK | S_IRWXUGO;
inode->i_op = &debugfs_symlink_inode_operations;
inode->i_link = link;
d_instantiate(dentry, inode);
return end_creating(dentry);
return debugfs_end_creating(dentry);
}
EXPORT_SYMBOL_GPL(debugfs_create_symlink);
@ -842,7 +842,8 @@ int __printf(2, 3) debugfs_change_name(struct dentry *dentry, const char *fmt, .
int error = 0;
const char *new_name;
struct name_snapshot old_name;
struct dentry *parent, *target;
struct dentry *target;
struct renamedata rd = {};
struct inode *dir;
va_list ap;
@ -855,36 +856,31 @@ int __printf(2, 3) debugfs_change_name(struct dentry *dentry, const char *fmt, .
if (!new_name)
return -ENOMEM;
parent = dget_parent(dentry);
dir = d_inode(parent);
inode_lock(dir);
rd.old_parent = dget_parent(dentry);
rd.new_parent = rd.old_parent;
rd.flags = RENAME_NOREPLACE;
target = lookup_noperm_unlocked(&QSTR(new_name), rd.new_parent);
if (IS_ERR(target))
return PTR_ERR(target);
error = start_renaming_two_dentries(&rd, dentry, target);
if (error) {
if (error == -EEXIST && target == dentry)
/* it isn't an error to rename a thing to itself */
error = 0;
goto out;
}
dir = d_inode(rd.old_parent);
take_dentry_name_snapshot(&old_name, dentry);
if (WARN_ON_ONCE(dentry->d_parent != parent)) {
error = -EINVAL;
goto out;
}
if (strcmp(old_name.name.name, new_name) == 0)
goto out;
target = lookup_noperm(&QSTR(new_name), parent);
if (IS_ERR(target)) {
error = PTR_ERR(target);
goto out;
}
if (d_really_is_positive(target)) {
dput(target);
error = -EINVAL;
goto out;
}
simple_rename_timestamp(dir, dentry, dir, target);
d_move(dentry, target);
dput(target);
simple_rename_timestamp(dir, dentry, dir, rd.new_dentry);
d_move(dentry, rd.new_dentry);
fsnotify_move(dir, dir, &old_name.name, d_is_dir(dentry), NULL, dentry);
out:
release_dentry_name_snapshot(&old_name);
inode_unlock(dir);
dput(parent);
end_renaming(&rd);
out:
dput(rd.old_parent);
dput(target);
kfree_const(new_name);
return error;
}

View File

@ -24,18 +24,26 @@
#include <linux/unaligned.h>
#include "ecryptfs_kernel.h"
static int lock_parent(struct dentry *dentry,
struct dentry **lower_dentry,
struct inode **lower_dir)
static struct dentry *ecryptfs_start_creating_dentry(struct dentry *dentry)
{
struct dentry *lower_dir_dentry;
struct dentry *parent = dget_parent(dentry);
struct dentry *ret;
lower_dir_dentry = ecryptfs_dentry_to_lower(dentry->d_parent);
*lower_dir = d_inode(lower_dir_dentry);
*lower_dentry = ecryptfs_dentry_to_lower(dentry);
ret = start_creating_dentry(ecryptfs_dentry_to_lower(parent),
ecryptfs_dentry_to_lower(dentry));
dput(parent);
return ret;
}
inode_lock_nested(*lower_dir, I_MUTEX_PARENT);
return (*lower_dentry)->d_parent == lower_dir_dentry ? 0 : -EINVAL;
static struct dentry *ecryptfs_start_removing_dentry(struct dentry *dentry)
{
struct dentry *parent = dget_parent(dentry);
struct dentry *ret;
ret = start_removing_dentry(ecryptfs_dentry_to_lower(parent),
ecryptfs_dentry_to_lower(dentry));
dput(parent);
return ret;
}
static int ecryptfs_inode_test(struct inode *inode, void *lower_inode)
@ -141,15 +149,12 @@ static int ecryptfs_do_unlink(struct inode *dir, struct dentry *dentry,
struct inode *lower_dir;
int rc;
rc = lock_parent(dentry, &lower_dentry, &lower_dir);
dget(lower_dentry); // don't even try to make the lower negative
if (!rc) {
if (d_unhashed(lower_dentry))
rc = -EINVAL;
else
rc = vfs_unlink(&nop_mnt_idmap, lower_dir, lower_dentry,
NULL);
}
lower_dentry = ecryptfs_start_removing_dentry(dentry);
if (IS_ERR(lower_dentry))
return PTR_ERR(lower_dentry);
lower_dir = lower_dentry->d_parent->d_inode;
rc = vfs_unlink(&nop_mnt_idmap, lower_dir, lower_dentry, NULL);
if (rc) {
printk(KERN_ERR "Error in vfs_unlink; rc = [%d]\n", rc);
goto out_unlock;
@ -158,8 +163,7 @@ static int ecryptfs_do_unlink(struct inode *dir, struct dentry *dentry,
set_nlink(inode, ecryptfs_inode_to_lower(inode)->i_nlink);
inode_set_ctime_to_ts(inode, inode_get_ctime(dir));
out_unlock:
dput(lower_dentry);
inode_unlock(lower_dir);
end_removing(lower_dentry);
if (!rc)
d_drop(dentry);
return rc;
@ -186,10 +190,11 @@ ecryptfs_do_create(struct inode *directory_inode,
struct inode *lower_dir;
struct inode *inode;
rc = lock_parent(ecryptfs_dentry, &lower_dentry, &lower_dir);
if (!rc)
rc = vfs_create(&nop_mnt_idmap, lower_dir,
lower_dentry, mode, true);
lower_dentry = ecryptfs_start_creating_dentry(ecryptfs_dentry);
if (IS_ERR(lower_dentry))
return ERR_CAST(lower_dentry);
lower_dir = lower_dentry->d_parent->d_inode;
rc = vfs_create(&nop_mnt_idmap, lower_dentry, mode, NULL);
if (rc) {
printk(KERN_ERR "%s: Failure to create dentry in lower fs; "
"rc = [%d]\n", __func__, rc);
@ -205,7 +210,7 @@ ecryptfs_do_create(struct inode *directory_inode,
fsstack_copy_attr_times(directory_inode, lower_dir);
fsstack_copy_inode_size(directory_inode, lower_dir);
out_lock:
inode_unlock(lower_dir);
end_creating(lower_dentry);
return inode;
}
@ -433,8 +438,10 @@ static int ecryptfs_link(struct dentry *old_dentry, struct inode *dir,
file_size_save = i_size_read(d_inode(old_dentry));
lower_old_dentry = ecryptfs_dentry_to_lower(old_dentry);
rc = lock_parent(new_dentry, &lower_new_dentry, &lower_dir);
if (!rc)
lower_new_dentry = ecryptfs_start_creating_dentry(new_dentry);
if (IS_ERR(lower_new_dentry))
return PTR_ERR(lower_new_dentry);
lower_dir = lower_new_dentry->d_parent->d_inode;
rc = vfs_link(lower_old_dentry, &nop_mnt_idmap, lower_dir,
lower_new_dentry, NULL);
if (rc || d_really_is_negative(lower_new_dentry))
@ -448,7 +455,7 @@ static int ecryptfs_link(struct dentry *old_dentry, struct inode *dir,
ecryptfs_inode_to_lower(d_inode(old_dentry))->i_nlink);
i_size_write(d_inode(new_dentry), file_size_save);
out_lock:
inode_unlock(lower_dir);
end_creating(lower_new_dentry);
return rc;
}
@ -468,9 +475,11 @@ static int ecryptfs_symlink(struct mnt_idmap *idmap,
size_t encoded_symlen;
struct ecryptfs_mount_crypt_stat *mount_crypt_stat = NULL;
rc = lock_parent(dentry, &lower_dentry, &lower_dir);
if (rc)
goto out_lock;
lower_dentry = ecryptfs_start_creating_dentry(dentry);
if (IS_ERR(lower_dentry))
return PTR_ERR(lower_dentry);
lower_dir = lower_dentry->d_parent->d_inode;
mount_crypt_stat = &ecryptfs_superblock_to_private(
dir->i_sb)->mount_crypt_stat;
rc = ecryptfs_encrypt_and_encode_filename(&encoded_symname,
@ -480,7 +489,7 @@ static int ecryptfs_symlink(struct mnt_idmap *idmap,
if (rc)
goto out_lock;
rc = vfs_symlink(&nop_mnt_idmap, lower_dir, lower_dentry,
encoded_symname);
encoded_symname, NULL);
kfree(encoded_symname);
if (rc || d_really_is_negative(lower_dentry))
goto out_lock;
@ -490,7 +499,7 @@ static int ecryptfs_symlink(struct mnt_idmap *idmap,
fsstack_copy_attr_times(dir, lower_dir);
fsstack_copy_inode_size(dir, lower_dir);
out_lock:
inode_unlock(lower_dir);
end_creating(lower_dentry);
if (d_really_is_negative(dentry))
d_drop(dentry);
return rc;
@ -501,14 +510,16 @@ static struct dentry *ecryptfs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
{
int rc;
struct dentry *lower_dentry;
struct dentry *lower_dir_dentry;
struct inode *lower_dir;
rc = lock_parent(dentry, &lower_dentry, &lower_dir);
if (rc)
goto out;
lower_dentry = ecryptfs_start_creating_dentry(dentry);
if (IS_ERR(lower_dentry))
return lower_dentry;
lower_dir_dentry = dget(lower_dentry->d_parent);
lower_dir = lower_dir_dentry->d_inode;
lower_dentry = vfs_mkdir(&nop_mnt_idmap, lower_dir,
lower_dentry, mode);
lower_dentry, mode, NULL);
rc = PTR_ERR(lower_dentry);
if (IS_ERR(lower_dentry))
goto out;
@ -522,7 +533,7 @@ static struct dentry *ecryptfs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
fsstack_copy_inode_size(dir, lower_dir);
set_nlink(dir, lower_dir->i_nlink);
out:
inode_unlock(lower_dir);
end_creating(lower_dentry);
if (d_really_is_negative(dentry))
d_drop(dentry);
return ERR_PTR(rc);
@ -534,21 +545,18 @@ static int ecryptfs_rmdir(struct inode *dir, struct dentry *dentry)
struct inode *lower_dir;
int rc;
rc = lock_parent(dentry, &lower_dentry, &lower_dir);
dget(lower_dentry); // don't even try to make the lower negative
if (!rc) {
if (d_unhashed(lower_dentry))
rc = -EINVAL;
else
rc = vfs_rmdir(&nop_mnt_idmap, lower_dir, lower_dentry);
}
lower_dentry = ecryptfs_start_removing_dentry(dentry);
if (IS_ERR(lower_dentry))
return PTR_ERR(lower_dentry);
lower_dir = lower_dentry->d_parent->d_inode;
rc = vfs_rmdir(&nop_mnt_idmap, lower_dir, lower_dentry, NULL);
if (!rc) {
clear_nlink(d_inode(dentry));
fsstack_copy_attr_times(dir, lower_dir);
set_nlink(dir, lower_dir->i_nlink);
}
dput(lower_dentry);
inode_unlock(lower_dir);
end_removing(lower_dentry);
if (!rc)
d_drop(dentry);
return rc;
@ -562,10 +570,12 @@ ecryptfs_mknod(struct mnt_idmap *idmap, struct inode *dir,
struct dentry *lower_dentry;
struct inode *lower_dir;
rc = lock_parent(dentry, &lower_dentry, &lower_dir);
if (!rc)
rc = vfs_mknod(&nop_mnt_idmap, lower_dir,
lower_dentry, mode, dev);
lower_dentry = ecryptfs_start_creating_dentry(dentry);
if (IS_ERR(lower_dentry))
return PTR_ERR(lower_dentry);
lower_dir = lower_dentry->d_parent->d_inode;
rc = vfs_mknod(&nop_mnt_idmap, lower_dir, lower_dentry, mode, dev, NULL);
if (rc || d_really_is_negative(lower_dentry))
goto out;
rc = ecryptfs_interpose(lower_dentry, dentry, dir->i_sb);
@ -574,7 +584,7 @@ ecryptfs_mknod(struct mnt_idmap *idmap, struct inode *dir,
fsstack_copy_attr_times(dir, lower_dir);
fsstack_copy_inode_size(dir, lower_dir);
out:
inode_unlock(lower_dir);
end_removing(lower_dentry);
if (d_really_is_negative(dentry))
d_drop(dentry);
return rc;
@ -590,7 +600,6 @@ ecryptfs_rename(struct mnt_idmap *idmap, struct inode *old_dir,
struct dentry *lower_new_dentry;
struct dentry *lower_old_dir_dentry;
struct dentry *lower_new_dir_dentry;
struct dentry *trap;
struct inode *target_inode;
struct renamedata rd = {};
@ -605,31 +614,13 @@ ecryptfs_rename(struct mnt_idmap *idmap, struct inode *old_dir,
target_inode = d_inode(new_dentry);
trap = lock_rename(lower_old_dir_dentry, lower_new_dir_dentry);
if (IS_ERR(trap))
return PTR_ERR(trap);
dget(lower_new_dentry);
rc = -EINVAL;
if (lower_old_dentry->d_parent != lower_old_dir_dentry)
goto out_lock;
if (lower_new_dentry->d_parent != lower_new_dir_dentry)
goto out_lock;
if (d_unhashed(lower_old_dentry) || d_unhashed(lower_new_dentry))
goto out_lock;
/* source should not be ancestor of target */
if (trap == lower_old_dentry)
goto out_lock;
/* target should not be ancestor of source */
if (trap == lower_new_dentry) {
rc = -ENOTEMPTY;
goto out_lock;
}
rd.mnt_idmap = &nop_mnt_idmap;
rd.old_parent = lower_old_dir_dentry;
rd.old_dentry = lower_old_dentry;
rd.new_parent = lower_new_dir_dentry;
rd.new_dentry = lower_new_dentry;
rc = start_renaming_two_dentries(&rd, lower_old_dentry, lower_new_dentry);
if (rc)
return rc;
rc = vfs_rename(&rd);
if (rc)
goto out_lock;
@ -640,8 +631,7 @@ ecryptfs_rename(struct mnt_idmap *idmap, struct inode *old_dir,
if (new_dir != old_dir)
fsstack_copy_attr_all(old_dir, d_inode(lower_old_dir_dentry));
out_lock:
dput(lower_new_dentry);
unlock_rename(lower_old_dir_dentry, lower_new_dir_dentry);
end_renaming(&rd);
return rc;
}

View File

@ -378,9 +378,7 @@ EXPORT_SYMBOL_GPL(eventfd_ctx_fileget);
static int do_eventfd(unsigned int count, int flags)
{
struct eventfd_ctx *ctx;
struct file *file;
int fd;
struct eventfd_ctx *ctx __free(kfree) = NULL;
/* Check the EFD_* constants for consistency. */
BUILD_BUG_ON(EFD_CLOEXEC != O_CLOEXEC);
@ -398,26 +396,19 @@ static int do_eventfd(unsigned int count, int flags)
init_waitqueue_head(&ctx->wqh);
ctx->count = count;
ctx->flags = flags;
ctx->id = ida_alloc(&eventfd_ida, GFP_KERNEL);
flags &= EFD_SHARED_FCNTL_FLAGS;
flags |= O_RDWR;
fd = get_unused_fd_flags(flags);
if (fd < 0)
goto err;
file = anon_inode_getfile_fmode("[eventfd]", &eventfd_fops,
ctx, flags, FMODE_NOWAIT);
if (IS_ERR(file)) {
put_unused_fd(fd);
fd = PTR_ERR(file);
goto err;
}
fd_install(fd, file);
return fd;
err:
eventfd_free_ctx(ctx);
return fd;
FD_PREPARE(fdf, flags,
anon_inode_getfile_fmode("[eventfd]", &eventfd_fops, ctx,
flags, FMODE_NOWAIT));
if (fdf.err)
return fdf.err;
ctx->id = ida_alloc(&eventfd_ida, GFP_KERNEL);
retain_and_null_ptr(ctx);
return fd_publish(fdf);
}
SYSCALL_DEFINE2(eventfd2, unsigned int, count, int, flags)

View File

@ -2165,9 +2165,8 @@ static void clear_tfile_check_list(void)
*/
static int do_epoll_create(int flags)
{
int error, fd;
struct eventpoll *ep = NULL;
struct file *file;
int error;
struct eventpoll *ep;
/* Check the EPOLL_* constant for consistency. */
BUILD_BUG_ON(EPOLL_CLOEXEC != O_CLOEXEC);
@ -2184,26 +2183,15 @@ static int do_epoll_create(int flags)
* Creates all the items needed to setup an eventpoll file. That is,
* a file structure and a free file descriptor.
*/
fd = get_unused_fd_flags(O_RDWR | (flags & O_CLOEXEC));
if (fd < 0) {
error = fd;
goto out_free_ep;
}
file = anon_inode_getfile("[eventpoll]", &eventpoll_fops, ep,
O_RDWR | (flags & O_CLOEXEC));
if (IS_ERR(file)) {
error = PTR_ERR(file);
goto out_free_fd;
}
ep->file = file;
fd_install(fd, file);
return fd;
out_free_fd:
put_unused_fd(fd);
out_free_ep:
FD_PREPARE(fdf, O_RDWR | (flags & O_CLOEXEC),
anon_inode_getfile("[eventpoll]", &eventpoll_fops, ep,
O_RDWR | (flags & O_CLOEXEC)));
if (fdf.err) {
ep_clear_and_put(ep);
return error;
return fdf.err;
}
ep->file = fd_prepare_file(fdf);
return fd_publish(fdf);
}
SYSCALL_DEFINE1(epoll_create1, int, flags)

View File

@ -1280,10 +1280,9 @@ int begin_new_exec(struct linux_binprm * bprm)
/* Pass the opened binary to the interpreter. */
if (bprm->have_execfd) {
retval = get_unused_fd_flags(0);
retval = FD_ADD(0, bprm->executable);
if (retval < 0)
goto out_unlock;
fd_install(retval, bprm->executable);
bprm->executable = NULL;
bprm->execfd = retval;
}

View File

@ -445,6 +445,7 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
struct file *filp)
{
void __user *argp = (void __user *)arg;
struct delegation deleg;
int argi = (int)arg;
struct flock flock;
long err = -EINVAL;
@ -550,6 +551,18 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
case F_SET_RW_HINT:
err = fcntl_set_rw_hint(filp, arg);
break;
case F_GETDELEG:
if (copy_from_user(&deleg, argp, sizeof(deleg)))
return -EFAULT;
err = fcntl_getdeleg(filp, &deleg);
if (!err && copy_to_user(argp, &deleg, sizeof(deleg)))
return -EFAULT;
break;
case F_SETDELEG:
if (copy_from_user(&deleg, argp, sizeof(deleg)))
return -EFAULT;
err = fcntl_setdeleg(fd, filp, &deleg);
break;
default:
break;
}

View File

@ -404,32 +404,28 @@ static int handle_to_path(int mountdirfd, struct file_handle __user *ufh,
return retval;
}
static struct file *file_open_handle(struct path *path, int open_flag)
{
const struct export_operations *eops;
eops = path->mnt->mnt_sb->s_export_op;
if (eops->open)
return eops->open(path, open_flag);
return file_open_root(path, "", open_flag, 0);
}
static long do_handle_open(int mountdirfd, struct file_handle __user *ufh,
int open_flag)
{
long retval = 0;
long retval;
struct path path __free(path_put) = {};
struct file *file;
const struct export_operations *eops;
retval = handle_to_path(mountdirfd, ufh, &path, open_flag);
if (retval)
return retval;
CLASS(get_unused_fd, fd)(open_flag);
if (fd < 0)
return fd;
eops = path.mnt->mnt_sb->s_export_op;
if (eops->open)
file = eops->open(&path, open_flag);
else
file = file_open_root(&path, "", open_flag, 0);
if (IS_ERR(file))
return PTR_ERR(file);
fd_install(fd, file);
return take_fd(fd);
return FD_ADD(open_flag, file_open_handle(&path, open_flag));
}
/**

View File

@ -1380,28 +1380,25 @@ int replace_fd(unsigned fd, struct file *file, unsigned flags)
*/
int receive_fd(struct file *file, int __user *ufd, unsigned int o_flags)
{
int new_fd;
int error;
error = security_file_receive(file);
if (error)
return error;
new_fd = get_unused_fd_flags(o_flags);
if (new_fd < 0)
return new_fd;
FD_PREPARE(fdf, o_flags, file);
if (fdf.err)
return fdf.err;
get_file(file);
if (ufd) {
error = put_user(new_fd, ufd);
if (error) {
put_unused_fd(new_fd);
error = put_user(fd_prepare_fd(fdf), ufd);
if (error)
return error;
}
}
fd_install(new_fd, get_file(file));
__receive_sock(file);
return new_fd;
__receive_sock(fd_prepare_file(fdf));
return fd_publish(fdf);
}
EXPORT_SYMBOL_GPL(receive_fd);

View File

@ -1397,27 +1397,25 @@ int fuse_reverse_inval_entry(struct fuse_conn *fc, u64 parent_nodeid,
if (!parent)
return -ENOENT;
inode_lock_nested(parent, I_MUTEX_PARENT);
if (!S_ISDIR(parent->i_mode))
goto unlock;
goto put_parent;
err = -ENOENT;
dir = d_find_alias(parent);
if (!dir)
goto unlock;
goto put_parent;
name->hash = full_name_hash(dir, name->name, name->len);
entry = d_lookup(dir, name);
entry = start_removing_noperm(dir, name);
dput(dir);
if (!entry)
goto unlock;
if (IS_ERR(entry))
goto put_parent;
fuse_dir_changed(parent);
if (!(flags & FUSE_EXPIRE_ONLY))
d_invalidate(entry);
fuse_invalidate_entry_cache(entry);
if (child_nodeid != 0 && d_really_is_positive(entry)) {
if (child_nodeid != 0) {
inode_lock(d_inode(entry));
if (get_node_id(d_inode(entry)) != child_nodeid) {
err = -ENOENT;
@ -1445,10 +1443,9 @@ int fuse_reverse_inval_entry(struct fuse_conn *fc, u64 parent_nodeid,
} else {
err = 0;
}
dput(entry);
unlock:
inode_unlock(parent);
end_removing(entry);
put_parent:
iput(parent);
return err;
}
@ -2230,6 +2227,7 @@ static const struct file_operations fuse_dir_operations = {
.fsync = fuse_dir_fsync,
.unlocked_ioctl = fuse_dir_ioctl,
.compat_ioctl = fuse_dir_compat_ioctl,
.setlease = simple_nosetlease,
};
static const struct inode_operations fuse_common_inode_operations = {

View File

@ -157,7 +157,7 @@ int __init init_mknod(const char *filename, umode_t mode, unsigned int dev)
error = security_path_mknod(&path, dentry, mode, dev);
if (!error)
error = vfs_mknod(mnt_idmap(path.mnt), path.dentry->d_inode,
dentry, mode, new_decode_dev(dev));
dentry, mode, new_decode_dev(dev), NULL);
end_creating_path(&path, dentry);
return error;
}
@ -209,7 +209,7 @@ int __init init_symlink(const char *oldname, const char *newname)
error = security_path_symlink(&path, dentry, oldname);
if (!error)
error = vfs_symlink(mnt_idmap(path.mnt), path.dentry->d_inode,
dentry, oldname);
dentry, oldname, NULL);
end_creating_path(&path, dentry);
return error;
}
@ -233,7 +233,7 @@ int __init init_mkdir(const char *pathname, umode_t mode)
error = security_path_mkdir(&path, dentry, mode);
if (!error) {
dentry = vfs_mkdir(mnt_idmap(path.mnt), path.dentry->d_inode,
dentry, mode);
dentry, mode, NULL);
if (IS_ERR(dentry))
error = PTR_ERR(dentry);
}

View File

@ -67,6 +67,9 @@ int vfs_tmpfile(struct mnt_idmap *idmap,
const struct path *parentpath,
struct file *file, umode_t mode);
struct dentry *d_hash_and_lookup(struct dentry *, struct qstr *);
struct dentry *start_dirop(struct dentry *parent, struct qstr *name,
unsigned int lookup_flags);
int lookup_noperm_common(struct qstr *qname, struct dentry *base);
/*
* namespace.c

View File

@ -2290,27 +2290,25 @@ void stashed_dentry_prune(struct dentry *dentry)
cmpxchg(stashed, dentry, NULL);
}
/* parent must be held exclusive */
/**
* simple_start_creating - prepare to create a given name
* @parent: directory in which to prepare to create the name
* @name: the name to be created
*
* Required lock is taken and a lookup in performed prior to creating an
* object in a directory. No permission checking is performed.
*
* Returns: a negative dentry on which vfs_create() or similar may
* be attempted, or an error.
*/
struct dentry *simple_start_creating(struct dentry *parent, const char *name)
{
struct dentry *dentry;
struct inode *dir = d_inode(parent);
struct qstr qname = QSTR(name);
int err;
inode_lock(dir);
if (unlikely(IS_DEADDIR(dir))) {
inode_unlock(dir);
return ERR_PTR(-ENOENT);
}
dentry = lookup_noperm(&QSTR(name), parent);
if (IS_ERR(dentry)) {
inode_unlock(dir);
return dentry;
}
if (dentry->d_inode) {
dput(dentry);
inode_unlock(dir);
return ERR_PTR(-EEXIST);
}
return dentry;
err = lookup_noperm_common(&qname, parent);
if (err)
return ERR_PTR(err);
return start_dirop(parent, &qname, LOOKUP_CREATE | LOOKUP_EXCL);
}
EXPORT_SYMBOL(simple_start_creating);

View File

@ -585,7 +585,7 @@ static const struct lease_manager_operations lease_manager_ops = {
/*
* Initialize a lease, use the default lock manager operations
*/
static int lease_init(struct file *filp, int type, struct file_lease *fl)
static int lease_init(struct file *filp, unsigned int flags, int type, struct file_lease *fl)
{
if (assign_type(&fl->c, type) != 0)
return -EINVAL;
@ -594,13 +594,13 @@ static int lease_init(struct file *filp, int type, struct file_lease *fl)
fl->c.flc_pid = current->tgid;
fl->c.flc_file = filp;
fl->c.flc_flags = FL_LEASE;
fl->c.flc_flags = flags;
fl->fl_lmops = &lease_manager_ops;
return 0;
}
/* Allocate a file_lock initialised to this type of lease */
static struct file_lease *lease_alloc(struct file *filp, int type)
static struct file_lease *lease_alloc(struct file *filp, unsigned int flags, int type)
{
struct file_lease *fl = locks_alloc_lease();
int error = -ENOMEM;
@ -608,7 +608,7 @@ static struct file_lease *lease_alloc(struct file *filp, int type)
if (fl == NULL)
return ERR_PTR(error);
error = lease_init(filp, type, fl);
error = lease_init(filp, flags, type, fl);
if (error) {
locks_free_lease(fl);
return ERR_PTR(error);
@ -1529,29 +1529,35 @@ any_leases_conflict(struct inode *inode, struct file_lease *breaker)
/**
* __break_lease - revoke all outstanding leases on file
* @inode: the inode of the file to return
* @mode: O_RDONLY: break only write leases; O_WRONLY or O_RDWR:
* break all leases
* @type: FL_LEASE: break leases and delegations; FL_DELEG: break
* only delegations
* @flags: LEASE_BREAK_* flags
*
* break_lease (inlined for speed) has checked there already is at least
* some kind of lock (maybe a lease) on this file. Leases are broken on
* a call to open() or truncate(). This function can sleep unless you
* specified %O_NONBLOCK to your open().
* a call to open() or truncate(). This function can block waiting for the
* lease break unless you specify LEASE_BREAK_NONBLOCK.
*/
int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)
int __break_lease(struct inode *inode, unsigned int flags)
{
int error = 0;
struct file_lock_context *ctx;
struct file_lease *new_fl, *fl, *tmp;
struct file_lock_context *ctx;
unsigned long break_time;
int want_write = (mode & O_ACCMODE) != O_RDONLY;
unsigned int type;
LIST_HEAD(dispose);
bool want_write = !(flags & LEASE_BREAK_OPEN_RDONLY);
int error = 0;
new_fl = lease_alloc(NULL, want_write ? F_WRLCK : F_RDLCK);
if (flags & LEASE_BREAK_LEASE)
type = FL_LEASE;
else if (flags & LEASE_BREAK_DELEG)
type = FL_DELEG;
else if (flags & LEASE_BREAK_LAYOUT)
type = FL_LAYOUT;
else
return -EINVAL;
new_fl = lease_alloc(NULL, type, want_write ? F_WRLCK : F_RDLCK);
if (IS_ERR(new_fl))
return PTR_ERR(new_fl);
new_fl->c.flc_flags = type;
/* typically we will check that ctx is non-NULL before calling */
ctx = locks_inode_context(inode);
@ -1596,7 +1602,7 @@ int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)
if (list_empty(&ctx->flc_lease))
goto out;
if (mode & O_NONBLOCK) {
if (flags & LEASE_BREAK_NONBLOCK) {
trace_break_lease_noblock(inode, new_fl);
error = -EWOULDBLOCK;
goto out;
@ -1675,8 +1681,9 @@ void lease_get_mtime(struct inode *inode, struct timespec64 *time)
EXPORT_SYMBOL(lease_get_mtime);
/**
* fcntl_getlease - Enquire what lease is currently active
* __fcntl_getlease - Enquire what lease is currently active
* @filp: the file
* @flavor: type of lease flags to check
*
* The value returned by this function will be one of
* (if no lease break is pending):
@ -1697,7 +1704,7 @@ EXPORT_SYMBOL(lease_get_mtime);
* XXX: sfr & willy disagree over whether F_INPROGRESS
* should be returned to userspace.
*/
int fcntl_getlease(struct file *filp)
static int __fcntl_getlease(struct file *filp, unsigned int flavor)
{
struct file_lease *fl;
struct inode *inode = file_inode(filp);
@ -1713,6 +1720,7 @@ int fcntl_getlease(struct file *filp)
list_for_each_entry(fl, &ctx->flc_lease, c.flc_list) {
if (fl->c.flc_file != filp)
continue;
if (fl->c.flc_flags & flavor)
type = target_leasetype(fl);
break;
}
@ -1724,6 +1732,19 @@ int fcntl_getlease(struct file *filp)
return type;
}
int fcntl_getlease(struct file *filp)
{
return __fcntl_getlease(filp, FL_LEASE);
}
int fcntl_getdeleg(struct file *filp, struct delegation *deleg)
{
if (deleg->d_flags != 0 || deleg->__pad != 0)
return -EINVAL;
deleg->d_type = __fcntl_getlease(filp, FL_DELEG);
return 0;
}
/**
* check_conflicting_open - see if the given file points to an inode that has
* an existing open that would conflict with the
@ -1929,11 +1950,19 @@ static int generic_delete_lease(struct file *filp, void *owner)
int generic_setlease(struct file *filp, int arg, struct file_lease **flp,
void **priv)
{
struct inode *inode = file_inode(filp);
if (!S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode))
return -EINVAL;
switch (arg) {
case F_UNLCK:
return generic_delete_lease(filp, *priv);
case F_RDLCK:
case F_WRLCK:
if (S_ISDIR(inode->i_mode))
return -EINVAL;
fallthrough;
case F_RDLCK:
if (!(*flp)->fl_lmops->lm_break) {
WARN_ON_ONCE(1);
return -ENOLCK;
@ -2018,8 +2047,6 @@ vfs_setlease(struct file *filp, int arg, struct file_lease **lease, void **priv)
if ((!vfsuid_eq_kuid(vfsuid, current_fsuid())) && !capable(CAP_LEASE))
return -EACCES;
if (!S_ISREG(inode->i_mode))
return -EINVAL;
error = security_file_lock(filp, arg);
if (error)
return error;
@ -2027,13 +2054,13 @@ vfs_setlease(struct file *filp, int arg, struct file_lease **lease, void **priv)
}
EXPORT_SYMBOL_GPL(vfs_setlease);
static int do_fcntl_add_lease(unsigned int fd, struct file *filp, int arg)
static int do_fcntl_add_lease(unsigned int fd, struct file *filp, unsigned int flavor, int arg)
{
struct file_lease *fl;
struct fasync_struct *new;
int error;
fl = lease_alloc(filp, arg);
fl = lease_alloc(filp, flavor, arg);
if (IS_ERR(fl))
return PTR_ERR(fl);
@ -2064,9 +2091,33 @@ static int do_fcntl_add_lease(unsigned int fd, struct file *filp, int arg)
*/
int fcntl_setlease(unsigned int fd, struct file *filp, int arg)
{
if (S_ISDIR(file_inode(filp)->i_mode))
return -EINVAL;
if (arg == F_UNLCK)
return vfs_setlease(filp, F_UNLCK, NULL, (void **)&filp);
return do_fcntl_add_lease(fd, filp, arg);
return do_fcntl_add_lease(fd, filp, FL_LEASE, arg);
}
/**
* fcntl_setdeleg - sets a delegation on an open file
* @fd: open file descriptor
* @filp: file pointer
* @deleg: delegation request from userland
*
* Call this fcntl to establish a delegation on the file.
* Note that you also need to call %F_SETSIG to
* receive a signal when the lease is broken.
*/
int fcntl_setdeleg(unsigned int fd, struct file *filp, struct delegation *deleg)
{
/* For now, no flags are supported */
if (deleg->d_flags != 0 || deleg->__pad != 0)
return -EINVAL;
if (deleg->d_type == F_UNLCK)
return vfs_setlease(filp, F_UNLCK, NULL, (void **)&filp);
return do_fcntl_add_lease(fd, filp, FL_DELEG, deleg->d_type);
}
/**

File diff suppressed because it is too large Load Diff

View File

@ -3100,19 +3100,7 @@ static struct file *vfs_open_tree(int dfd, const char __user *filename, unsigned
SYSCALL_DEFINE3(open_tree, int, dfd, const char __user *, filename, unsigned, flags)
{
int fd;
struct file *file __free(fput) = NULL;
file = vfs_open_tree(dfd, filename, flags);
if (IS_ERR(file))
return PTR_ERR(file);
fd = get_unused_fd_flags(flags & O_CLOEXEC);
if (fd < 0)
return fd;
fd_install(fd, no_free_ptr(file));
return fd;
return FD_ADD(flags, vfs_open_tree(dfd, filename, flags));
}
/*
@ -4281,10 +4269,10 @@ static unsigned int attr_flags_to_mnt_flags(u64 attr_flags)
SYSCALL_DEFINE3(fsmount, int, fs_fd, unsigned int, flags,
unsigned int, attr_flags)
{
struct path new_path __free(path_put) = {};
struct mnt_namespace *ns;
struct fs_context *fc;
struct file *file;
struct path newmount;
struct vfsmount *new_mnt;
struct mount *mnt;
unsigned int mnt_flags = 0;
long ret;
@ -4322,35 +4310,36 @@ SYSCALL_DEFINE3(fsmount, int, fs_fd, unsigned int, flags,
fc = fd_file(f)->private_data;
ret = mutex_lock_interruptible(&fc->uapi_mutex);
if (ret < 0)
ACQUIRE(mutex_intr, uapi_mutex)(&fc->uapi_mutex);
ret = ACQUIRE_ERR(mutex_intr, &uapi_mutex);
if (ret)
return ret;
/* There must be a valid superblock or we can't mount it */
ret = -EINVAL;
if (!fc->root)
goto err_unlock;
return ret;
ret = -EPERM;
if (mount_too_revealing(fc->root->d_sb, &mnt_flags)) {
errorfcp(fc, "VFS", "Mount too revealing");
goto err_unlock;
return ret;
}
ret = -EBUSY;
if (fc->phase != FS_CONTEXT_AWAITING_MOUNT)
goto err_unlock;
return ret;
if (fc->sb_flags & SB_MANDLOCK)
warn_mandlock();
newmount.mnt = vfs_create_mount(fc);
if (IS_ERR(newmount.mnt)) {
ret = PTR_ERR(newmount.mnt);
goto err_unlock;
}
newmount.dentry = dget(fc->root);
newmount.mnt->mnt_flags = mnt_flags;
new_mnt = vfs_create_mount(fc);
if (IS_ERR(new_mnt))
return PTR_ERR(new_mnt);
new_mnt->mnt_flags = mnt_flags;
new_path.dentry = dget(fc->root);
new_path.mnt = new_mnt;
/* We've done the mount bit - now move the file context into more or
* less the same state as if we'd done an fspick(). We don't want to
@ -4360,38 +4349,27 @@ SYSCALL_DEFINE3(fsmount, int, fs_fd, unsigned int, flags,
vfs_clean_context(fc);
ns = alloc_mnt_ns(current->nsproxy->mnt_ns->user_ns, true);
if (IS_ERR(ns)) {
ret = PTR_ERR(ns);
goto err_path;
}
mnt = real_mount(newmount.mnt);
if (IS_ERR(ns))
return PTR_ERR(ns);
mnt = real_mount(new_path.mnt);
ns->root = mnt;
ns->nr_mounts = 1;
mnt_add_to_ns(ns, mnt);
mntget(newmount.mnt);
mntget(new_path.mnt);
/* Attach to an apparent O_PATH fd with a note that we need to unmount
* it, not just simply put it.
*/
file = dentry_open(&newmount, O_PATH, fc->cred);
if (IS_ERR(file)) {
dissolve_on_fput(newmount.mnt);
ret = PTR_ERR(file);
goto err_path;
FD_PREPARE(fdf, (flags & FSMOUNT_CLOEXEC) ? O_CLOEXEC : 0,
dentry_open(&new_path, O_PATH, fc->cred));
if (fdf.err) {
dissolve_on_fput(new_path.mnt);
return fdf.err;
}
file->f_mode |= FMODE_NEED_UNMOUNT;
ret = get_unused_fd_flags((flags & FSMOUNT_CLOEXEC) ? O_CLOEXEC : 0);
if (ret >= 0)
fd_install(ret, file);
else
fput(file);
err_path:
path_put(&newmount);
err_unlock:
mutex_unlock(&fc->uapi_mutex);
return ret;
/*
* Attach to an apparent O_PATH fd with a note that we
* need to unmount it, not just simply put it.
*/
fd_prepare_file(fdf)->f_mode |= FMODE_NEED_UNMOUNT;
return fd_publish(fdf);
}
static inline int vfs_move_mount(const struct path *from_path,
@ -5033,19 +5011,17 @@ SYSCALL_DEFINE5(open_tree_attr, int, dfd, const char __user *, filename,
unsigned, flags, struct mount_attr __user *, uattr,
size_t, usize)
{
struct file __free(fput) *file = NULL;
int fd;
if (!uattr && usize)
return -EINVAL;
file = vfs_open_tree(dfd, filename, flags);
if (IS_ERR(file))
return PTR_ERR(file);
FD_PREPARE(fdf, flags, vfs_open_tree(dfd, filename, flags));
if (fdf.err)
return fdf.err;
if (uattr) {
int ret;
struct mount_kattr kattr = {};
struct file *file = fd_prepare_file(fdf);
int ret;
if (flags & OPEN_TREE_CLONE)
kattr.kflags = MOUNT_KATTR_IDMAP_REPLACE;
@ -5061,12 +5037,7 @@ SYSCALL_DEFINE5(open_tree_attr, int, dfd, const char __user *, filename,
return ret;
}
fd = get_unused_fd_flags(flags & O_CLOEXEC);
if (fd < 0)
return fd;
fd_install(fd, no_free_ptr(file));
return fd;
return fd_publish(fdf);
}
int show_path(struct seq_file *m, struct dentry *root)
@ -5148,6 +5119,12 @@ static u64 mnt_to_propagation_flags(struct mount *m)
return propagation;
}
u64 vfsmount_to_propagation_flags(struct vfsmount *mnt)
{
return mnt_to_propagation_flags(real_mount(mnt));
}
EXPORT_SYMBOL_GPL(vfsmount_to_propagation_flags);
static void statmount_sb_basic(struct kstatmount *s)
{
struct super_block *sb = s->mnt->mnt_sb;

View File

@ -431,6 +431,8 @@ void nfs42_ssc_unregister_ops(void)
static int nfs4_setlease(struct file *file, int arg, struct file_lease **lease,
void **priv)
{
if (!S_ISREG(file_inode(file)->i_mode))
return -EINVAL;
return nfs4_proc_setlease(file, arg, lease, priv);
}

View File

@ -1086,7 +1086,7 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct net *net,
struct auth_domain *client,
struct svc_fh *fhp,
unsigned int may_flags, struct file *file,
struct nfsd_file **pnf, bool want_gc)
umode_t type, bool want_gc, struct nfsd_file **pnf)
{
unsigned char need = may_flags & NFSD_FILE_MAY_MASK;
struct nfsd_file *new, *nf;
@ -1097,13 +1097,13 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct net *net,
int ret;
retry:
if (rqstp) {
status = fh_verify(rqstp, fhp, S_IFREG,
if (rqstp)
status = fh_verify(rqstp, fhp, type,
may_flags|NFSD_MAY_OWNER_OVERRIDE);
} else {
status = fh_verify_local(net, cred, client, fhp, S_IFREG,
else
status = fh_verify_local(net, cred, client, fhp, type,
may_flags|NFSD_MAY_OWNER_OVERRIDE);
}
if (status != nfs_ok)
return status;
inode = d_inode(fhp->fh_dentry);
@ -1176,15 +1176,18 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct net *net,
open_file:
trace_nfsd_file_alloc(nf);
if (type == S_IFREG)
nf->nf_mark = nfsd_file_mark_find_or_create(inode);
if (nf->nf_mark) {
if (type != S_IFREG || nf->nf_mark) {
if (file) {
get_file(file);
nf->nf_file = file;
status = nfs_ok;
trace_nfsd_file_opened(nf, status);
} else {
ret = nfsd_open_verified(fhp, may_flags, &nf->nf_file);
ret = nfsd_open_verified(fhp, type, may_flags, &nf->nf_file);
if (ret == -EOPENSTALE && stale_retry) {
stale_retry = false;
nfsd_file_unhash(nf);
@ -1246,7 +1249,7 @@ nfsd_file_acquire_gc(struct svc_rqst *rqstp, struct svc_fh *fhp,
unsigned int may_flags, struct nfsd_file **pnf)
{
return nfsd_file_do_acquire(rqstp, SVC_NET(rqstp), NULL, NULL,
fhp, may_flags, NULL, pnf, true);
fhp, may_flags, NULL, S_IFREG, true, pnf);
}
/**
@ -1271,7 +1274,7 @@ nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
unsigned int may_flags, struct nfsd_file **pnf)
{
return nfsd_file_do_acquire(rqstp, SVC_NET(rqstp), NULL, NULL,
fhp, may_flags, NULL, pnf, false);
fhp, may_flags, NULL, S_IFREG, false, pnf);
}
/**
@ -1314,8 +1317,8 @@ nfsd_file_acquire_local(struct net *net, struct svc_cred *cred,
const struct cred *save_cred = get_current_cred();
__be32 beres;
beres = nfsd_file_do_acquire(NULL, net, cred, client,
fhp, may_flags, NULL, pnf, false);
beres = nfsd_file_do_acquire(NULL, net, cred, client, fhp, may_flags,
NULL, S_IFREG, false, pnf);
put_cred(revert_creds(save_cred));
return beres;
}
@ -1344,7 +1347,33 @@ nfsd_file_acquire_opened(struct svc_rqst *rqstp, struct svc_fh *fhp,
struct nfsd_file **pnf)
{
return nfsd_file_do_acquire(rqstp, SVC_NET(rqstp), NULL, NULL,
fhp, may_flags, file, pnf, false);
fhp, may_flags, file, S_IFREG, false, pnf);
}
/**
* nfsd_file_acquire_dir - Get a struct nfsd_file with an open directory
* @rqstp: the RPC transaction being executed
* @fhp: the NFS filehandle of the file to be opened
* @pnf: OUT: new or found "struct nfsd_file" object
*
* The nfsd_file_object returned by this API is reference-counted
* but not garbage-collected. The object is unhashed after the
* final nfsd_file_put(). This opens directories only, and only
* in O_RDONLY mode.
*
* Return values:
* %nfs_ok - @pnf points to an nfsd_file with its reference
* count boosted.
*
* On error, an nfsstat value in network byte order is returned.
*/
__be32
nfsd_file_acquire_dir(struct svc_rqst *rqstp, struct svc_fh *fhp,
struct nfsd_file **pnf)
{
return nfsd_file_do_acquire(rqstp, SVC_NET(rqstp), NULL, NULL, fhp,
NFSD_MAY_READ|NFSD_MAY_64BIT_COOKIE,
NULL, S_IFDIR, false, pnf);
}
/*

View File

@ -82,5 +82,7 @@ __be32 nfsd_file_acquire_opened(struct svc_rqst *rqstp, struct svc_fh *fhp,
__be32 nfsd_file_acquire_local(struct net *net, struct svc_cred *cred,
struct auth_domain *client, struct svc_fh *fhp,
unsigned int may_flags, struct nfsd_file **pnf);
__be32 nfsd_file_acquire_dir(struct svc_rqst *rqstp, struct svc_fh *fhp,
struct nfsd_file **pnf);
int nfsd_file_cache_stats_show(struct seq_file *m, void *v);
#endif /* _FS_NFSD_FILECACHE_H */

View File

@ -281,14 +281,11 @@ nfsd3_create_file(struct svc_rqst *rqstp, struct svc_fh *fhp,
if (host_err)
return nfserrno(host_err);
inode_lock_nested(inode, I_MUTEX_PARENT);
child = lookup_one(&nop_mnt_idmap,
&QSTR_LEN(argp->name, argp->len),
parent);
child = start_creating(&nop_mnt_idmap, parent,
&QSTR_LEN(argp->name, argp->len));
if (IS_ERR(child)) {
status = nfserrno(PTR_ERR(child));
goto out;
goto out_write;
}
if (d_really_is_negative(child)) {
@ -344,7 +341,7 @@ nfsd3_create_file(struct svc_rqst *rqstp, struct svc_fh *fhp,
status = fh_fill_pre_attrs(fhp);
if (status != nfs_ok)
goto out;
host_err = vfs_create(&nop_mnt_idmap, inode, child, iap->ia_mode, true);
host_err = vfs_create(&nop_mnt_idmap, child, iap->ia_mode, NULL);
if (host_err < 0) {
status = nfserrno(host_err);
goto out;
@ -367,9 +364,8 @@ nfsd3_create_file(struct svc_rqst *rqstp, struct svc_fh *fhp,
status = nfsd_create_setattr(rqstp, fhp, resfhp, &attrs);
out:
inode_unlock(inode);
if (child && !IS_ERR(child))
dput(child);
end_creating(child);
out_write:
fh_drop_write(fhp);
return status;
}

View File

@ -264,14 +264,11 @@ nfsd4_create_file(struct svc_rqst *rqstp, struct svc_fh *fhp,
if (is_create_with_attrs(open))
nfsd4_acl_to_attr(NF4REG, open->op_acl, &attrs);
inode_lock_nested(inode, I_MUTEX_PARENT);
child = lookup_one(&nop_mnt_idmap,
&QSTR_LEN(open->op_fname, open->op_fnamelen),
parent);
child = start_creating(&nop_mnt_idmap, parent,
&QSTR_LEN(open->op_fname, open->op_fnamelen));
if (IS_ERR(child)) {
status = nfserrno(PTR_ERR(child));
goto out;
goto out_write;
}
if (d_really_is_negative(child)) {
@ -379,10 +376,9 @@ nfsd4_create_file(struct svc_rqst *rqstp, struct svc_fh *fhp,
if (attrs.na_aclerr)
open->op_bmval[0] &= ~FATTR4_WORD0_ACL;
out:
inode_unlock(inode);
end_creating(child);
nfsd_attrs_free(&attrs);
if (child && !IS_ERR(child))
dput(child);
out_write:
fh_drop_write(fhp);
return status;
}
@ -2342,6 +2338,13 @@ nfsd4_get_dir_delegation(struct svc_rqst *rqstp,
union nfsd4_op_u *u)
{
struct nfsd4_get_dir_delegation *gdd = &u->get_dir_delegation;
struct nfs4_delegation *dd;
struct nfsd_file *nf;
__be32 status;
status = nfsd_file_acquire_dir(rqstp, &cstate->current_fh, &nf);
if (status != nfs_ok)
return status;
/*
* RFC 8881, section 18.39.3 says:
@ -2355,8 +2358,21 @@ nfsd4_get_dir_delegation(struct svc_rqst *rqstp,
* return NFS4_OK with a non-fatal status of GDD4_UNAVAIL in this
* situation.
*/
dd = nfsd_get_dir_deleg(cstate, gdd, nf);
nfsd_file_put(nf);
if (IS_ERR(dd)) {
int err = PTR_ERR(dd);
if (err != -EAGAIN)
return nfserrno(err);
gdd->gddrnf_status = GDD4_UNAVAIL;
return nfs_ok;
}
gdd->gddrnf_status = GDD4_OK;
memcpy(&gdd->gddr_stateid, &dd->dl_stid.sc_stateid, sizeof(gdd->gddr_stateid));
nfs4_put_stid(&dd->dl_stid);
return nfs_ok;
}
#ifdef CONFIG_NFSD_PNFS

View File

@ -195,13 +195,11 @@ nfsd4_create_clid_dir(struct nfs4_client *clp)
goto out_creds;
dir = nn->rec_file->f_path.dentry;
/* lock the parent */
inode_lock(d_inode(dir));
dentry = lookup_one(&nop_mnt_idmap, &QSTR(dname), dir);
dentry = start_creating(&nop_mnt_idmap, dir, &QSTR(dname));
if (IS_ERR(dentry)) {
status = PTR_ERR(dentry);
goto out_unlock;
goto out;
}
if (d_really_is_positive(dentry))
/*
@ -212,15 +210,13 @@ nfsd4_create_clid_dir(struct nfs4_client *clp)
* In the 4.0 case, we should never get here; but we may
* as well be forgiving and just succeed silently.
*/
goto out_put;
dentry = vfs_mkdir(&nop_mnt_idmap, d_inode(dir), dentry, S_IRWXU);
goto out_end;
dentry = vfs_mkdir(&nop_mnt_idmap, d_inode(dir), dentry, 0700, NULL);
if (IS_ERR(dentry))
status = PTR_ERR(dentry);
out_put:
if (!status)
dput(dentry);
out_unlock:
inode_unlock(d_inode(dir));
out_end:
end_creating(dentry);
out:
if (status == 0) {
if (nn->in_grace)
__nfsd4_create_reclaim_record_grace(clp, dname,
@ -328,20 +324,12 @@ nfsd4_unlink_clid_dir(char *name, struct nfsd_net *nn)
dprintk("NFSD: nfsd4_unlink_clid_dir. name %s\n", name);
dir = nn->rec_file->f_path.dentry;
inode_lock_nested(d_inode(dir), I_MUTEX_PARENT);
dentry = lookup_one(&nop_mnt_idmap, &QSTR(name), dir);
if (IS_ERR(dentry)) {
status = PTR_ERR(dentry);
goto out_unlock;
}
status = -ENOENT;
if (d_really_is_negative(dentry))
goto out;
status = vfs_rmdir(&nop_mnt_idmap, d_inode(dir), dentry);
out:
dput(dentry);
out_unlock:
inode_unlock(d_inode(dir));
dentry = start_removing(&nop_mnt_idmap, dir, &QSTR(name));
if (IS_ERR(dentry))
return PTR_ERR(dentry);
status = vfs_rmdir(&nop_mnt_idmap, d_inode(dir), dentry, NULL);
end_removing(dentry);
return status;
}
@ -427,7 +415,7 @@ purge_old(struct dentry *parent, struct dentry *child, struct nfsd_net *nn)
if (nfs4_has_reclaimed_state(name, nn))
goto out_free;
status = vfs_rmdir(&nop_mnt_idmap, d_inode(parent), child);
status = vfs_rmdir(&nop_mnt_idmap, d_inode(parent), child, NULL);
if (status)
printk("failed to remove client recovery directory %pd\n",
child);

Some files were not shown because too many files have changed in this diff Show More