linux/Documentation
Paul E. McKenney 96d3fd0d31 rcu: Break call_rcu() deadlock involving scheduler and perf
Dave Jones got the following lockdep splat:

>  ======================================================
>  [ INFO: possible circular locking dependency detected ]
>  3.12.0-rc3+ #92 Not tainted
>  -------------------------------------------------------
>  trinity-child2/15191 is trying to acquire lock:
>   (&rdp->nocb_wq){......}, at: [<ffffffff8108ff43>] __wake_up+0x23/0x50
>
> but task is already holding lock:
>   (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #3 (&ctx->lock){-.-...}:
>         [<ffffffff810cc243>] lock_acquire+0x93/0x200
>         [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
>         [<ffffffff811500ff>] __perf_event_task_sched_out+0x2df/0x5e0
>         [<ffffffff81091b83>] perf_event_task_sched_out+0x93/0xa0
>         [<ffffffff81732052>] __schedule+0x1d2/0xa20
>         [<ffffffff81732f30>] preempt_schedule_irq+0x50/0xb0
>         [<ffffffff817352b6>] retint_kernel+0x26/0x30
>         [<ffffffff813eed04>] tty_flip_buffer_push+0x34/0x50
>         [<ffffffff813f0504>] pty_write+0x54/0x60
>         [<ffffffff813e900d>] n_tty_write+0x32d/0x4e0
>         [<ffffffff813e5838>] tty_write+0x158/0x2d0
>         [<ffffffff811c4850>] vfs_write+0xc0/0x1f0
>         [<ffffffff811c52cc>] SyS_write+0x4c/0xa0
>         [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
>
> -> #2 (&rq->lock){-.-.-.}:
>         [<ffffffff810cc243>] lock_acquire+0x93/0x200
>         [<ffffffff81733f90>] _raw_spin_lock+0x40/0x80
>         [<ffffffff810980b2>] wake_up_new_task+0xc2/0x2e0
>         [<ffffffff81054336>] do_fork+0x126/0x460
>         [<ffffffff81054696>] kernel_thread+0x26/0x30
>         [<ffffffff8171ff93>] rest_init+0x23/0x140
>         [<ffffffff81ee1e4b>] start_kernel+0x3f6/0x403
>         [<ffffffff81ee1571>] x86_64_start_reservations+0x2a/0x2c
>         [<ffffffff81ee1664>] x86_64_start_kernel+0xf1/0xf4
>
> -> #1 (&p->pi_lock){-.-.-.}:
>         [<ffffffff810cc243>] lock_acquire+0x93/0x200
>         [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
>         [<ffffffff810979d1>] try_to_wake_up+0x31/0x350
>         [<ffffffff81097d62>] default_wake_function+0x12/0x20
>         [<ffffffff81084af8>] autoremove_wake_function+0x18/0x40
>         [<ffffffff8108ea38>] __wake_up_common+0x58/0x90
>         [<ffffffff8108ff59>] __wake_up+0x39/0x50
>         [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
>         [<ffffffff81111450>] __call_rcu+0x140/0x820
>         [<ffffffff81111b8d>] call_rcu+0x1d/0x20
>         [<ffffffff81093697>] cpu_attach_domain+0x287/0x360
>         [<ffffffff81099d7e>] build_sched_domains+0xe5e/0x10a0
>         [<ffffffff81efa7fc>] sched_init_smp+0x3b7/0x47a
>         [<ffffffff81ee1f4e>] kernel_init_freeable+0xf6/0x202
>         [<ffffffff817200be>] kernel_init+0xe/0x190
>         [<ffffffff8173d22c>] ret_from_fork+0x7c/0xb0
>
> -> #0 (&rdp->nocb_wq){......}:
>         [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
>         [<ffffffff810cc243>] lock_acquire+0x93/0x200
>         [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
>         [<ffffffff8108ff43>] __wake_up+0x23/0x50
>         [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
>         [<ffffffff81111450>] __call_rcu+0x140/0x820
>         [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
>         [<ffffffff81149abf>] put_ctx+0x4f/0x70
>         [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
>         [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
>         [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
>         [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
>         [<ffffffff8173d4e4>] tracesys+0xdd/0xe2
>
> other info that might help us debug this:
>
> Chain exists of:
>   &rdp->nocb_wq --> &rq->lock --> &ctx->lock
>
>   Possible unsafe locking scenario:
>
>         CPU0                    CPU1
>         ----                    ----
>    lock(&ctx->lock);
>                                 lock(&rq->lock);
>                                 lock(&ctx->lock);
>    lock(&rdp->nocb_wq);
>
>  *** DEADLOCK ***
>
> 1 lock held by trinity-child2/15191:
>  #0:  (&ctx->lock){-.-...}, at: [<ffffffff81154c19>] perf_event_exit_task+0x109/0x230
>
> stack backtrace:
> CPU: 2 PID: 15191 Comm: trinity-child2 Not tainted 3.12.0-rc3+ #92
>  ffffffff82565b70 ffff880070c2dbf8 ffffffff8172a363 ffffffff824edf40
>  ffff880070c2dc38 ffffffff81726741 ffff880070c2dc90 ffff88022383b1c0
>  ffff88022383aac0 0000000000000000 ffff88022383b188 ffff88022383b1c0
> Call Trace:
>  [<ffffffff8172a363>] dump_stack+0x4e/0x82
>  [<ffffffff81726741>] print_circular_bug+0x200/0x20f
>  [<ffffffff810cb7ca>] __lock_acquire+0x191a/0x1be0
>  [<ffffffff810c6439>] ? get_lock_stats+0x19/0x60
>  [<ffffffff8100b2f4>] ? native_sched_clock+0x24/0x80
>  [<ffffffff810cc243>] lock_acquire+0x93/0x200
>  [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
>  [<ffffffff8173419b>] _raw_spin_lock_irqsave+0x4b/0x90
>  [<ffffffff8108ff43>] ? __wake_up+0x23/0x50
>  [<ffffffff8108ff43>] __wake_up+0x23/0x50
>  [<ffffffff8110d4f8>] __call_rcu_nocb_enqueue+0xa8/0xc0
>  [<ffffffff81111450>] __call_rcu+0x140/0x820
>  [<ffffffff8109bc8f>] ? local_clock+0x3f/0x50
>  [<ffffffff81111bb0>] kfree_call_rcu+0x20/0x30
>  [<ffffffff81149abf>] put_ctx+0x4f/0x70
>  [<ffffffff81154c3e>] perf_event_exit_task+0x12e/0x230
>  [<ffffffff81056b8d>] do_exit+0x30d/0xcc0
>  [<ffffffff810c9af5>] ? trace_hardirqs_on_caller+0x115/0x1e0
>  [<ffffffff810c9bcd>] ? trace_hardirqs_on+0xd/0x10
>  [<ffffffff8105893c>] do_group_exit+0x4c/0xc0
>  [<ffffffff810589c4>] SyS_exit_group+0x14/0x20
>  [<ffffffff8173d4e4>] tracesys+0xdd/0xe2

The underlying problem is that perf is invoking call_rcu() with the
scheduler locks held, but in NOCB mode, call_rcu() will with high
probability invoke the scheduler -- which just might want to use its
locks.  The reason that call_rcu() needs to invoke the scheduler is
to wake up the corresponding rcuo callback-offload kthread, which
does the job of starting up a grace period and invoking the callbacks
afterwards.

One solution (championed on a related problem by Lai Jiangshan) is to
simply defer the wakeup to some point where scheduler locks are no longer
held.  Since we don't want to unnecessarily incur the cost of such
deferral, the task before us is threefold:

1.	Determine when it is likely that a relevant scheduler lock is held.

2.	Defer the wakeup in such cases.

3.	Ensure that all deferred wakeups eventually happen, preferably
	sooner rather than later.

We use irqs_disabled_flags() as a proxy for relevant scheduler locks
being held.  This works because the relevant locks are always acquired
with interrupts disabled.  We may defer more often than needed, but that
is at least safe.

The wakeup deferral is tracked via a new field in the per-CPU and
per-RCU-flavor rcu_data structure, namely ->nocb_defer_wakeup.

This flag is checked by the RCU core processing.  The __rcu_pending()
function now checks this flag, which causes rcu_check_callbacks()
to initiate RCU core processing at each scheduling-clock interrupt
where this flag is set.  Of course this is not sufficient because
scheduling-clock interrupts are often turned off (the things we used to
be able to count on!).  So the flags are also checked on entry to any
state that RCU considers to be idle, which includes both NO_HZ_IDLE idle
state and NO_HZ_FULL user-mode-execution state.

This approach should allow call_rcu() to be invoked regardless of what
locks you might be holding, the key word being "should".

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2013-12-03 10:10:18 -08:00
..
ABI Main batch of InfiniBand/RDMA changes for 3.13: 2013-11-18 15:36:04 -08:00
DocBook doc: fix generation of device-drivers 2013-11-27 20:40:04 -08:00
EDID
PCI
RCU rcu: Break call_rcu() deadlock involving scheduler and perf 2013-12-03 10:10:18 -08:00
accounting
acpi gpiolib / ACPI: document the GPIO descriptor based interface 2013-10-19 23:32:50 +02:00
aoe
arm Allwinner sunXi SoCs machine additions for 3.13 2013-10-28 10:19:38 -07:00
arm64 arm64: Use 42-bit address space with 64K pages 2013-11-05 17:23:52 +00:00
auxdisplay
backlight backlight: lp855x_bl: support new LP8555 device 2013-11-13 12:09:14 +09:00
blackfin
block
blockdev floppy: Correct documentation of driver options when used as a module. 2013-11-08 09:10:31 -07:00
bus-devices
cdrom
cgroups memcg: support hierarchical memory.numa_stats 2013-11-13 12:09:06 +09:00
connector
console
cpu-freq cpufreq: Implement light weight ->target_index() routine 2013-10-25 22:42:24 +02:00
cpuidle cpuidle: remove cpuidle_unregister_governor() 2013-10-30 01:21:24 +01:00
cris
crypto
development-process
device-mapper dm cache: resolve small nits and improve Documentation 2013-11-12 13:11:09 -05:00
devicetree ARM: SoC fixes for 3.13-rc 2013-11-26 11:18:37 -08:00
driver-model spi: Updates for v3.13 2013-11-12 15:01:39 +09:00
dvb
early-userspace
extcon
fault-injection
fb
filesystems Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs 2013-11-22 08:38:55 -08:00
firmware_class
fmc
frv
gpio Documentation: gpiolib: document new interface 2013-11-25 09:02:30 +01:00
hid
hwmon Merge branch 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging 2013-11-15 16:35:10 -08:00
i2c i2c: i801: Add Device IDs for Intel Wildcat Point-LP PCH 2013-11-14 18:38:04 +01:00
i2o
ia64
ide
infiniband
input Input: clarify gamepad API ABS values 2013-10-15 23:42:07 -07:00
ioctl ALSA: add DICE driver 2013-10-17 21:18:32 +02:00
isdn
ja_JP
kbuild
kdump
ko_KR
laptops thinkpad-acpi: Add mute and mic-mute LED functionality 2013-10-17 14:38:44 +02:00
leds
m68k
make
memory-devices
metag
mic
mips
misc-devices
mmc
mn10300
mtd
namespaces
netlabel
networking tcp: tsq: restore minimal amount of queueing 2013-11-14 16:25:14 -05:00
nfc
parisc
pcmcia
power More ACPI and power management updates for 3.13-rc1 2013-11-20 13:25:04 -08:00
powerpc
pps
prctl
pti
ptp
rapidio
s390 s390/s390dbf: add debug_level_enabled() function 2013-10-24 17:16:53 +02:00
scheduler H8/300 has been dead for several years, the kernel for it has 2013-11-12 14:13:14 +09:00
scsi
security ima: new templates management mechanism 2013-10-25 17:17:04 -04:00
serial serial: core: delete .set_wake() callback 2013-10-16 13:16:19 -07:00
sh
sound ALSA: Fix typo in documentation/alsa 2013-10-29 11:38:04 +01:00
spi
sysctl vsprintf: check real user/group id for %pK 2013-11-13 12:09:14 +09:00
target target: Remove TF_CIT_TMPL macro 2013-10-16 13:35:02 -07:00
thermal
timers doc: add missing files to timers/00-INDEX 2013-10-27 21:55:50 +00:00
tpm
trace Documentation/trace/tracepoints.txt: add links to TRACE_EVENT documentation 2013-11-13 12:09:32 +09:00
usb doc: usb: Fix typo in Documentation/usb/gadget_configs.txt 2013-10-31 13:31:39 +01:00
vDSO
video4linux
virtual Merge branch 'kvm-ppc-queue' of git://github.com/agraf/linux-2.6 into queue 2013-11-04 10:20:57 +02:00
vm x86, mm: do not leak page->ptl for pmd page tables 2013-11-21 16:42:28 -08:00
w1
watchdog
wimax
x86 Merge branch 'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2013-11-12 10:48:30 +09:00
xtensa
zh_CN
.gitignore
00-INDEX
BUG-HUNTING
Changes remove obsolete references to powertweak 2013-11-27 20:34:32 -08:00
CodingStyle
DMA-API-HOWTO.txt
DMA-API.txt
DMA-ISA-LPC.txt
DMA-attributes.txt
HOWTO
IPMI.txt
IRQ-affinity.txt
IRQ-domain.txt
IRQ.txt
Intel-IOMMU.txt
Makefile
ManagementStyle
SAK.txt
SM501.txt
SecurityBugs
SubmitChecklist
SubmittingDrivers
SubmittingPatches
VGA-softcursor.txt
applying-patches.txt
assoc_array.txt
atomic_ops.txt
bad_memory.txt
basic_profiling.txt
bcache.txt
binfmt_misc.txt
braille-console.txt
bt8xxgpio.txt
btmrvl.txt
bus-virt-phys-mapping.txt
cachetlb.txt
circular-buffers.txt
clk.txt
coccinelle.txt
cpu-hotplug.txt MAINTAINERS: update Zwane Mwaikambo's e-mail address 2013-11-13 12:09:14 +09:00
cpu-load.txt
cputopology.txt
crc32.txt
dcdbas.txt
debugging-modules.txt
debugging-via-ohci1394.txt
dell_rbu.txt
devices.txt
digsig.txt
dma-buf-sharing.txt
dmaengine.txt
dmatest.txt dmatest: add a 'wait' parameter 2013-11-14 11:04:40 -08:00
dontdiff
dynamic-debug-howto.txt
edac.txt
efi-stub.txt
eisa.txt
email-clients.txt
flexible-arrays.txt
futex-requeue-pi.txt
gcov.txt gcov: compile specific gcov implementation based on gcc version 2013-11-13 12:09:34 +09:00
highuid.txt
hw_random.txt
hwspinlock.txt
init.txt
initrd.txt
intel_txt.txt
io-mapping.txt
io_ordering.txt
iostats.txt
irqflags-tracing.txt
isapnp.txt
java.txt
kernel-doc-nano-HOWTO.txt
kernel-docs.txt
kernel-parameters.txt Merge branch 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security 2013-11-21 19:46:00 -08:00
kernel-per-CPU-kthreads.txt
kmemcheck.txt
kmemleak.txt
kobject.txt
kprobes.txt
kref.txt
ldm.txt
local_ops.txt
lockdep-design.txt
lockstat.txt
lockup-watchdogs.txt
logo.gif
logo.txt
magic-number.txt
md.txt
media-framework.txt
memory-barriers.txt
memory-hotplug.txt
mono.txt
mutex-design.txt locking/doc: Update references to kernel/mutex.c 2013-11-11 12:41:33 +01:00
nommu-mmap.txt
numastat.txt
oops-tracing.txt
padata.txt
parport-lowlevel.txt
parport.txt
percpu-rw-semaphore.txt
phy.txt
pi-futex.txt
pinctrl.txt pinctrl: add documentation for pinctrl_get_group_pins() 2013-10-16 15:35:21 +02:00
pnp.txt
preempt-locking.txt
printk-formats.txt
pwm.txt Documentation/pwm: Fix trivial typos 2013-10-24 10:51:33 +02:00
ramoops.txt
rbtree.txt
remoteproc.txt
rfkill.txt
robust-futex-ABI.txt
robust-futexes.txt
rpmsg.txt
rt-mutex-design.txt
rt-mutex.txt
rtc.txt
serial-console.txt
sgi-ioc4.txt
sgi-visws.txt
smsc_ece1099.txt
sparse.txt
spinlocks.txt
stable_api_nonsense.txt
stable_kernel_rules.txt
static-keys.txt
svga.txt
sysfs-rules.txt
sysrq.txt sysrq: Allow magic SysRq key functions to be disabled through Kconfig 2013-10-16 13:01:44 -07:00
this_cpu_ops.txt
unaligned-memory-access.txt
unicode.txt
unshare.txt
vfio.txt
vgaarbiter.txt
video-output.txt
vme_api.txt
volatile-considered-harmful.txt
workqueue.txt
ww-mutex-design.txt
xz.txt
zorro.txt